Search CORE

Evaluating reproducibility of differential expression discoveries in microarray studies by considering correlated molecular changes

Author: Benjamini
Bhattacharjee
Carter
Chen
Chen Yao
Chenguang Wang
David
Diehn
Dong Wang
Efron
Ein-Dor
Ein-Dor
Fleiss
Frantz
Garber
Guo
Hardeo
Haslett
Hosack
Hui Xiao
Irizarry
Jeffery
Jinfeng Zou
Jing Wang
Klebanov
Klebanov
Klebanov
Lapointe
Lin Zhang
Mao
Marshall
Michiels
Miklos
Min Zhang
Oshlack
Pescatori
Qing Liu
Qiu
Qiu
Ransohoff
Ransohoff
Rhodes
Shi
Shi
Singh
Subramanian
Tan
Tong
Troyanskaya
Tusher
Xie
Xu
Yang
Zhang
Zhang
Zheng Guo
Zhu
Publication venue: Oxford University Press
Publication date
Field of study

Motivation: According to current consistency metrics such as percentage of overlapping genes (POG), lists of differentially expressed genes (DEGs) detected from different microarray studies for a complex disease are often highly inconsistent. This irreproducibility problem also exists in other high-throughput post-genomic areas such as proteomics and metabolism. A complex disease is often characterized with many coordinated molecular changes, which should be considered when evaluating the reproducibility of discovery lists from different studies

CiteSeerX

arXiv.org e-Print Archive

Stable Feature Selection for Biomarker Discovery

Author: He Zengyou
Yu Weichuan
Publication venue
Publication date: 01/01/2010
Field of study

Feature selection techniques have been used as the workhorse in biomarker discovery applications for a long time. Surprisingly, the stability of feature selection with respect to sampling variations has long been under-considered. It is only until recently that this issue has received more and more attention. In this article, we review existing stable feature selection methods for biomarker discovery using a generic hierarchal framework. We have two objectives: (1) providing an overview on this new yet fast growing topic for a convenient reference; (2) categorizing existing methods under an expandable framework for future research and development

CiteSeerX

Hong Kong University of Science and Technology Institutional Repository

Reproducible Cancer Biomarker Discovery in SELDI-TOF MS Using Different Pre-Processing Algorithms

Author: A Carvajal-Rodriguez
A Cruz-Marcelo
AC Sauve
AK Callesen
AW Bell
B Huang
BL Adam
C Li
C Mathelin
C Truntzer
Chen Yao
DF Ransohoff
DM Rissin
DM Rocke
DW Swinkels
EP Diamandis
FJ Esteva
G Kristina
Guini Hong
HJ Song
II Emanuele VA
J Frobel
J Li
J MacQueen
J Wang
JA Mead
JF Timms
Jinfeng Zou
Jing Wang
JM Hogan
JW Wong
KA Baggerly
KR Coombes
L Diao
L Ein-Dor
L Ein-Dor
L Klebanov
L Pusztai
L Shi
L Sun
Lin Zhang
M De Bock
M Dijkstra
M Zhang
M Zhang
MA Kuzyk
ME Sanders
MK Tuck
ML Lee
P Du
PC Carvalho
PJ Rousseeuw
R Aebersold
RE Caffrey
SM Hanash
T Fortin
TC Poon
W Meuleman
WC Cho
WC Cho
William C.S. Cho
X Gong
X Li
X Qiu
Xinwu Guo
Y Benjamini
Y Pawitan
Y Yasui
Zheng Guo
Publication venue: Public Library of Science
Publication date: 14/10/2011
Field of study

BACKGROUND: There has been much interest in differentiating diseased and normal samples using biomarkers derived from mass spectrometry (MS) studies. However, biomarker identification for specific diseases has been hindered by irreproducibility. Specifically, a peak profile extracted from a dataset for biomarker identification depends on a data pre-processing algorithm. Until now, no widely accepted agreement has been reached. RESULTS: In this paper, we investigated the consistency of biomarker identification using differentially expressed (DE) peaks from peak profiles produced by three widely used average spectrum-dependent pre-processing algorithms based on SELDI-TOF MS data for prostate and breast cancers. Our results revealed two important factors that affect the consistency of DE peak identification using different algorithms. One factor is that some DE peaks selected from one peak profile were not detected as peaks in other profiles, and the second factor is that the statistical power of identifying DE peaks in large peak profiles with many peaks may be low due to the large scale of the tests and small number of samples. Furthermore, we demonstrated that the DE peak detection power in large profiles could be improved by the stratified false discovery rate (FDR) control approach and that the reproducibility of DE peak detection could thereby be increased. CONCLUSIONS: Comparing and evaluating pre-processing algorithms in terms of reproducibility can elucidate the relationship among different algorithms and also help in selecting a pre-processing algorithm. The DE peaks selected from small peak profiles with few peaks for a dataset tend to be reproducibly detected in large peak profiles, which suggests that a suitable pre-processing algorithm should be able to produce peaks sufficient for identifying useful and reproducible biomarkers

Public Library of Science (PLOS)

Public Library of Science (PLOS)

Reproducibility and Concordance of Differential DNA Methylation and Gene Expression in Cancer

Author: Guo Zheng
He Lang
He Zheng
Li Hongdong
Shen Xiaopei
Yao Chen
Publication venue: Public Library of Science
Publication date: 03/01/2012
Field of study

Background: Hundreds of genes with differential DNA methylation of promoters have been identified for various cancers. However, the reproducibility of differential DNA methylation discoveries for cancer and the relationship between DNA methylation and aberrant gene expression have not been systematically analysed. Methodology/Principal Findings: Using array data for seven types of cancers, we first evaluated the effects of experimental batches on differential DNA methylation detection. Second, we compared the directions of DNA methylation changes detected from different datasets for the same cancer. Third, we evaluated the concordance between methylation and gene expression changes. Finally, we compared DNA methylation changes in different cancers. For a given cancer, the directions of methylation and expression changes detected from different datasets, excluding potential batch effects, were highly consistent. In different cancers, DNA hypermethylation was highly inversely correlated with the down-regulation of gene expression, whereas hypomethylation was only weakly correlated with the up-regulation of genes. Finally, we found that genes commonly hypomethylated in different cancers primarily performed functions associated with chronic inflammation, such as ‘keratinization’, ‘chemotaxis ’ and ‘immune response’. Conclusions: Batch effects could greatly affect the discovery of DNA methylation biomarkers. For a particular cancer, both differential DNA methylation and gene expression can be reproducibly detected from different studies with no batc

CiteSeerX

Springer - Publisher Connector

FigShare

Multi-level reproducibility of signature hubs in human interactome for breast cancer metastasis

Author: Guo Zheng
Li Hongdong
Yao Chen
Zhang Lin
Zhou Chenggui
Zou Jinfeng
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background It has been suggested that, in the human protein-protein interaction network, changes of co-expression between highly connected proteins ("hub") and their interaction neighbours might have important roles in cancer metastasis and be predictive disease signatures for patient outcome. However, for a cancer, such disease signatures identified from different studies have little overlap. Results Here, we propose a systemic approach to evaluate the reproducibility of disease signatures at multiple levels, on the basis of some statistically testable biological models. Using two datasets for breast cancer metastasis, we showed that different signature hubs identified from different studies were highly consistent in terms of significantly sharing interaction neighbours and displaying consistent co-expression changes with their overlapping neighbours, whereas the shared interaction neighbours were significantly over-represented with known cancer genes and enriched in pathways deregulated in breast cancer pathogenesis. Then, we showed that the signature hubs identified from the two datasets were highly reproducible at the protein interaction and pathway levels in three other independent datasets. Conclusions Our results provide a possible biological model that different signature hubs altered in different patient cohorts could disturb the same pathways associated with cancer metastasis through their interaction neighbours.</p

Springer - Publisher Connector

Concordance analysis of microarray studies identifies representative gene expression changes in Parkinson’s disease: a comparison of 33 human and animal studies

Author
Publication venue: BioMed Central
Publication date: 23/03/2017
Field of study

A statistical framework for integrating two microarray data sets in differential expression analysis

Author: D Lockhart
D Singh
EM Conlon
F Hong
GJ McLachlan
GJ McLachlan
GJ McLachlan
I Borozan
JD Storey
Jin-Xiong She
JK Choi
KHS Wilson
L Ein-Dor
L Xu
L Xu
M Miron
M Schena
M Zhang
P Cahan
PT Spellman
S Dudoit
Sarah E Eckenrode
SE Eckenrode
TR Golub
VK Mootha
X Cui
Y Benjamini
Y Lai
Yinglei Lai
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background Different microarray data sets can be collected for studying the same or similar diseases. We expect to achieve a more efficient analysis of differential expression if an efficient statistical method can be developed for integrating different microarray data sets. Although many statistical methods have been proposed for data integration, the genome-wide concordance of different data sets has not been well considered in the analysis. Results Before considering data integration, it is necessary to evaluate the genome-wide concordance so that misleading results can be avoided. Based on the test results, different subsequent actions are suggested. The evaluation of genome-wide concordance and the data integration can be achieved based on the normal distribution based mixture models. Conclusion The results from our simulation study suggest that misleading results can be generated if the genome-wide concordance issue is not appropriately considered. Our method provides a rigorous parametric solution. The results also show that our method is robust to certain model misspecification and is practically useful for the integrative analysis of differential expression.</p

George Washington University: Health Sciences Research Commons (HSRC)

Concordance analysis of microarray studies identifies representative gene expression changes in Parkinson’s disease: a comparison of 33 human and animal studies

Author: A Dumitriu
A Dumitriu
A Kasim
A Kauffmann
A Kuhn
A Schroeder
AD Strand
Andreas Bender
B Haibe-Kains
B Zheng
C Stretch
CA Davie
E Saccenti
Erin Oerton
G Konopka
G Yu
GT Sutherland
H Braak
I Cantuti-Castelvetri
J Blesa
J Li
J Russ
J Seok
JE Larkin
JT Dudley
K Kadota
K Takao
L Ein-Dor
L Gautier
L Guo
L Shi
L Zhang
LW Huson
M Atz
M Cruz-Monteagudo
M Mistry
M Zhang
ME Ritchie
OR Bandapalli
P Calabresi
P D’Haeseleer
P Preece
PA Lewis
R Core Team
R Jaksik
R Miller
R Suzuki
RA Ach
RM Miller
S Kilpinen
SAFT Hijum van
SH Lam
WK Lim
X Zheng-Bradley
Y Lu
YJK Edwards
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

GMCM: Unsupervised Clustering and Meta-Analysis Using Gaussian Mixture Copula Models

Author: Bilgrau Anders Ellern
Boegsted Martin
Dybkaer Karen
Eriksen Poul Svante
Johnsen Hans Erik
Rasmussen Jakob Gulddahl
Publication venue: 'Foundation for Open Access Statistic'
Publication date: 01/04/2016
Field of study

Methods for clustering in unsupervised learning are an important part of the statistical toolbox in numerous scientific disciplines. Tewari, Giering, and Raghunathan (2011) proposed to use so-called Gaussian mixture copula models (GMCM) for general unsupervised learning based on clustering. Li, Brown, Huang, and Bickel (2011) independently discussed a special case of these GMCMs as a novel approach to meta-analysis in highdimensional settings. GMCMs have attractive properties which make them highly flexible and therefore interesting alternatives to other well-established methods. However, parameter estimation is hard because of intrinsic identifiability issues and intractable likelihood functions. Both aforementioned papers discuss similar expectation-maximization-like algorithms as their pseudo maximum likelihood estimation procedure. We present and discuss an improved implementation in R of both classes of GMCMs along with various alternative optimization routines to the EM algorithm. The software is freely available in the R package GMCM. The implementation is fast, general, and optimized for very large numbers of observations. We demonstrate the use of package GMCM through different applications