Search CORE

A close examination of double filtering with fold change and t test in microarray analysis

Author: AE Gelfand
G Casella
I Hedenfalk
I Lonnstedt
J Cao
JD Storey
JD Storey
Jing Cao
M Sauer
MA Newton
MM Kittleson
N Jain
P Baldi
P Quinn
R Opgen-Rhein
RA Irizarry
SE Choe
Song Zhang
T Han
VG Tusher
X Cui
Y Li
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background Many researchers use the double filtering procedure with fold change and <it>t </it>test to identify differentially expressed genes, in the hope that the double filtering will provide extra confidence in the results. Due to its simplicity, the double filtering procedure has been popular with applied researchers despite the development of more sophisticated methods. Results This paper, for the first time to our knowledge, provides theoretical insight on the drawback of the double filtering procedure. We show that fold change assumes all genes to have a common variance while <it>t </it>statistic assumes gene-specific variances. The two statistics are based on contradicting assumptions. Under the assumption that gene variances arise from a mixture of a common variance and gene-specific variances, we develop the theoretically most powerful likelihood ratio test statistic. We further demonstrate that the posterior inference based on a Bayesian mixture model and the widely used significance analysis of microarrays (SAM) statistic are better approximations to the likelihood ratio test than the double filtering procedure. Conclusion We demonstrate through hypothesis testing theory, simulation studies and real data examples, that well constructed shrinkage testing methods, which can be united under the mixture gene variance assumption, can considerably outperform the double filtering procedure.</p

ROAST: rotation gene set tests for complex microarray experiments

Author: Ackermann
Adewale
Barnard
Di Wu
Diboun
Dinu
Dørum
Efron
Elgene Lim
Ernst
François Vaillant
Goeman
Goeman
Gordon K. Smyth
Hummel
Jane E. Visvader
Jiang
Kooperberg
Langsrud
Lim
Lim
Lonnstedt
Manoli
Marie-Liesse Asselin-Labat
Mootha
Murie
Oron
Ritchie
Saxena
Smyth
Smyth
Subramanian
Tian
Visvader
Wang
Wright
Publication venue: Oxford University Press
Publication date: 01/01/2010
Field of study

Motivation: A gene set test is a differential expression analysis in which a P-value is assigned to a set of genes as a unit. Gene set tests are valuable for increasing statistical power, organizing and interpreting results and for relating expression patterns across different experiments. Existing methods are based on permutation. Methods that rely on permutation of probes unrealistically assume independence of genes, while those that rely on permutation of sample are suitable only for two-group comparisons with a good number of replicates in each group

A hierarchical Bayesian model for comparing transcriptomes at the individual transcript isoform level

Author: Anton
Barbieri
Black
Blencowe
Boutz
Chen
Chipman
del Arco
Ding
Faustino
Fiermonte
Garcia-Blanco
George
George
Jensen
Johnson
Jumaa
Kanadia
Kapranov
Kapur
Liang Chen
Lonnstedt
Mariottini
Mashima
Mortazavi
Nott
Pan
Plummer
Rocke
Shai
Sika Zheng
Stamm
Vandesompele
Wang
Xing
Xu
Publication venue: Oxford University Press
Publication date
Field of study

The complexity of mammalian transcriptomes is compounded by alternative splicing which allows one gene to produce multiple transcript isoforms. However, transcriptome comparison has been limited to differential analysis at the gene level instead of the individual transcript isoform level. High-throughput sequencing technologies and high-resolution tiling arrays provide an unprecedented opportunity to compare transcriptomes at the level of individual splice variants. However, sequence read coverage or probe intensity at each position may represent a family of splice variants instead of one single isoform. Here we propose a hierarchical Bayesian model, BASIS (Bayesian Analysis of Splicing IsoformS), to infer the differential expression level of each transcript isoform in response to two conditions. A latent variable was introduced to perform direct statistical selection of differentially expressed isoforms. Model parameters were inferred based on an ergodic Markov chain generated by our Gibbs sampler. BASIS has the ability to borrow information across different probes (or positions) from the same genes and different genes. BASIS can handle the heteroskedasticity of probe intensity or sequence read coverage. We applied BASIS to a human tiling-array data set and a mouse RNA-seq data set. Some of the predictions were validated by quantitative real-time RT–PCR experiments

Use of genomic DNA control features and predicted operon structure in microarray data analysis: ArrayLeaRNA – a Bayesian approach

Author: A Dagkessamanskaia
C Lanczos
C Pin
Carmen Pin
EJ Alm
GK Smyth
I Lonnstedt
JL DeRisi
JP Townsend
JP Townsend
K Holmes
M Abramowitz
MA Newton
Mark Reuter
MF Anjum
MK Kerr
ML Mohedano
MN Price
MN Price
P Baldi
P Luu
PW Mielke
R Gottardo
RD Wolfinger
RJ Fox
S Eriksson
W Cleveland
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

Abstract Background Microarrays are widely used for the study of gene expression; however deciding on whether observed differences in expression are significant remains a challenge. Results A computing tool (ArrayLeaRNA) has been developed for gene expression analysis. It implements a Bayesian approach which is based on the Gumbel distribution and uses printed genomic DNA control features for normalization and for estimation of the parameters of the Bayesian model and prior knowledge from predicted operon structure. The method is compared with two other approaches: the classical LOWESS normalization followed by a two fold cut-off criterion and the OpWise method (Price, et al. 2006. BMC Bioinformatics. 7, 19), a published Bayesian approach also using predicted operon structure. The three methods were compared on experimental datasets with prior knowledge of gene expression. With ArrayLeaRNA, data normalization is carried out according to the genomic features which reflect the results of equally transcribed genes; also the statistical significance of the difference in expression is based on the variability of the equally transcribed genes. The operon information helps the classification of genes with low confidence measurements. ArrayLeaRNA is implemented in Visual Basic and freely available as an Excel add-in at <url>http://www.ifr.ac.uk/safety/ArrayLeaRNA/</url> Conclusion We have introduced a novel Bayesian model and demonstrated that it is a robust method for analysing microarray expression profiles. ArrayLeaRNA showed a considerable improvement in data normalization, in the estimation of the experimental variability intrinsic to each hybridization and in the establishment of a clear boundary between non-changing and differentially expressed genes. The method is applicable to data derived from hybridizations of labelled cDNA samples as well as from hybridizations of labelled cDNA with genomic DNA and can be used for the analysis of datasets where differentially regulated genes predominate.</p

A full Bayesian hierarchical mixture model for the variance of gene differential expression

Author: A Lewin
AA Alizadeh
B Efron
B Efron
C Li
D Amaratunga
D Spiegelhalter
I Lonnstedt
M Schena
MA Newton
Mark S Gilthorpe
N Dean
P Baldi
P Broet
P Delmar
PO Brown
R Brame
RC Gentleman
Rebecca E Walls
S Richardson
Samuel OM Manda
V Tusher
W Huber
W Pan
Y Ji
YH Yang
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

Abstract Background In many laboratory-based high throughput microarray experiments, there are very few replicates of gene expression levels. Thus, estimates of gene variances are inaccurate. Visual inspection of graphical summaries of these data usually reveals that heteroscedasticity is present, and the standard approach to address this is to take a log2 transformation. In such circumstances, it is then common to assume that gene variability is constant when an analysis of these data is undertaken. However, this is perhaps too stringent an assumption. More careful inspection reveals that the simple log2 transformation does not remove the problem of heteroscedasticity. An alternative strategy is to assume independent gene-specific variances; although again this is problematic as variance estimates based on few replications are highly unstable. More meaningful and reliable comparisons of gene expression might be achieved, for different conditions or different tissue samples, where the test statistics are based on accurate estimates of gene variability; a crucial step in the identification of differentially expressed genes. Results We propose a Bayesian mixture model, which classifies genes according to similarity in their variance. The result is that genes in the same latent class share the similar variance, estimated from a larger number of replicates than purely those per gene, i.e. the total of all replicates of all genes in the same latent class. An example dataset, consisting of 9216 genes with four replicates per condition, resulted in four latent classes based on their similarity of the variance. Conclusion The mixture variance model provides a realistic and flexible estimate for the variance of gene expression data under limited replicates. We believe that in using the latent class variances, estimated from a larger number of genes in each derived latent group, the <it>p</it>-values obtained are more robust than either using a constant gene or gene-specific variance estimate.</p

White Rose Research Online

Empirical bayes analysis of sequencing-based transcriptional profiling without replicates

Author: A Kal
A Oshlack
B Li
Bethany D Jenkins
D McCarthy
E Marti
G Schaaf
G Stolovitzky
GK Smyth
I Lonnstedt
J Bloom
J Dohm
K Baggerly
L Cheng
L Cui
L Wang
LeAnn P Whitney
M Lee
M Robinson
M Robinson
M Robinson
Mak A Saito
Melissa Mercier
P Hoen
R Guillard
R Guillard
R Vêncio
S Hashimoto
S Nygaard
Sonya T Dyhrman
Tatiana A Rynearson
Zhijin Wu
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Background: Recent technological advancements have made high throughput sequencing an increasingly popular approach for transcriptome analysis. Advantages of sequencing-based transcriptional profiling over microarrays have been reported, including lower technical variability. However, advances in technology do not remove biological variation between replicates and this variation is often neglected in many analyses. Results: We propose an empirical Bayes method, titled Analysis of Sequence Counts (ASC), to detect differential expression based on sequencing technology. ASC borrows information across sequences to establish prior distribution of sample variation, so that biological variation can be accounted for even when replicates are not available. Compared to current approaches that simply tests for equality of proportions in two samples, ASC is less biased towards highly expressed sequences and can identify more genes with a greater log fold change at lower overall abundance. Conclusions: ASC unifies the biological and statistical significance of differential expression by estimating the posterior mean of log fold change and estimating false discovery rates based on the posterior mean. The implementation in R is available at http://www.stat.brown.edu/Zwu/research.aspx

University of Memphis Digital Commons

DigitalCommons@URI

A unified framework for finding differentially expressed genes from microarray experiments

Author: C Tang
C Zhang
D Stekel
DL Davies
G Casella
G Getz
GJ McLachlan
H Hui-Huang
H Sahai
I Guyon
I Lonnstedt
IB Jeffery
J Shaik
J Shaik
J Shaik
J Shaik
J Shaik
Jahangheer S Shaik
JD Storey
Mohammed Yeasin
P Tamayo
RA Fisher
RL Fernando
RM Miller
RO Duda
S Mukherjee
S Tavazoie
T Li
TR Golub
U Alon
VG Tusher
X Chen
Y Benjamini
Y Benjamini
Y Su
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

Abstract Background This paper presents a unified framework for finding differentially expressed genes (DEGs) from the microarray data. The proposed framework has three interrelated modules: (i) gene ranking, ii) significance analysis of genes and (iii) validation. The first module uses two gene selection algorithms, namely, a) two-way clustering and b) combined adaptive ranking to rank the genes. The second module converts the gene ranks into p-values using an R-test and fuses the two sets of p-values using the Fisher's omnibus criterion. The DEGs are selected using the FDR analysis. The third module performs three fold validations of the obtained DEGs. The robustness of the proposed unified framework in gene selection is first illustrated using false discovery rate analysis. In addition, the clustering-based validation of the DEGs is performed by employing an adaptive subspace-based clustering algorithm on the training and the test datasets. Finally, a projection-based visualization is performed to validate the DEGs obtained using the unified framework. Results The performance of the unified framework is compared with well-known ranking algorithms such as t-statistics, Significance Analysis of Microarrays (SAM), Adaptive Ranking, Combined Adaptive Ranking and Two-way Clustering. The performance curves obtained using 50 simulated microarray datasets each following two different distributions indicate the superiority of the unified framework over the other reported algorithms. Further analyses on 3 real cancer datasets and 3 Parkinson's datasets show the similar improvement in performance. First, a 3 fold validation process is provided for the two-sample cancer datasets. In addition, the analysis on 3 sets of Parkinson's data is performed to demonstrate the scalability of the proposed method to multi-sample microarray datasets. Conclusion This paper presents a unified framework for the robust selection of genes from the two-sample as well as multi-sample microarray experiments. Two different ranking methods used in module 1 bring diversity in the selection of genes. The conversion of ranks to p-values, the fusion of p-values and FDR analysis aid in the identification of significant genes which cannot be judged based on gene ranking alone. The 3 fold validation, namely, robustness in selection of genes using FDR analysis, clustering, and visualization demonstrate the relevance of the DEGs. Empirical analyses on 50 artificial datasets and 6 real microarray datasets illustrate the efficacy of the proposed approach. The analyses on 3 cancer datasets demonstrate the utility of the proposed approach on microarray datasets with two classes of samples. The scalability of the proposed unified approach to multi-sample (more than two sample classes) microarray datasets is addressed using three sets of Parkinson's Data. Empirical analyses show that the unified framework outperformed other gene selection methods in selecting differentially expressed genes from microarray data.</p

Comparison study of microarray meta-analysis methods

Author: A Ramasamy
AA Alizadeh
AC Fierro
Anna Campain
AV Ivshina
D Maglott
DR Rhodes
F Hong
G Parmigiani
G Parmigiani
GK Smyth TNP
I Lonnstedt
J Steven
JK Choi
M Ritchie
MA Shipp
O Larsson
P Cahan
P Farmer
P Warnat
R Bosotti
R Breitling
R DerSimonian
R Gentleman
R Grützmann
R Guerra
RA Fisher
RA Irizarry
S Dudoit
S Dudoit
S Loi
S Lu
SL Normand
TL Fare
VG Tusher
WE Johnson
Yee Hwa Yang
YH Yang
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background Meta-analysis methods exist for combining multiple microarray datasets. However, there are a wide range of issues associated with microarray meta-analysis and a limited ability to compare the performance of different meta-analysis methods. Results We compare eight meta-analysis methods, five existing methods, two naive methods and a novel approach (mDEDS). Comparisons are performed using simulated data and two biological case studies with varying degrees of meta-analysis complexity. The performance of meta-analysis methods is assessed via ROC curves and prediction accuracy where applicable. Conclusions Existing meta-analysis methods vary in their ability to perform successful meta-analysis. This success is very dependent on the complexity of the data and type of analysis. Our proposed method, mDEDS, performs competitively as a meta-analysis tool even as complexity increases. Because of the varying abilities of compared meta-analysis methods, care should be taken when considering the meta-analysis method used for particular research.</p

Pre-processing Agilent microarray data

Author: A Oshlack
AA Dombkowski
Agilent
AR Dabney
BA Rosenzweig
David Berman
Edward Schaeffer
G Delenstarr
G Smyth
G Smyth
G Smyth
GC Tseng
Giovanni Parmigiani
GK Smyth
I Lonnstedt
J Freudenberg
J Quackenbush
K Dobbin
KK Dobbin
Leslie Cope
LM Cope
LX Qin
Marianna Zahurak
ML Martin-Magniette
R Development Core Team
R Scharpf
RA Irizarry
RA Irizarry
RA Irizarry
Robert B Scharpf
S Dudoit
SE Choe
Shabana Shabbeer
T Bammler
W Tong
Wayne Yu
Y Yang
YH Yang
YH Yang
Publication venue: BioMed Central
Publication date: 01/05/2007
Field of study

Abstract Background Pre-processing methods for two-sample long oligonucleotide arrays, specifically the Agilent technology, have not been extensively studied. The goal of this study is to quantify some of the sources of error that affect measurement of expression using Agilent arrays and to compare Agilent's Feature Extraction software with pre-processing methods that have become the standard for normalization of cDNA arrays. These include log transformation followed by loess normalization with or without background subtraction and often a between array scale normalization procedure. The larger goal is to define best study design and pre-processing practices for Agilent arrays, and we offer some suggestions. Results Simple loess normalization without background subtraction produced the lowest variability. However, without background subtraction, fold changes were biased towards zero, particularly at low intensities. ROC analysis of a spike-in experiment showed that differentially expressed genes are most reliably detected when background is not subtracted. Loess normalization and no background subtraction yielded an AUC of 99.7% compared with 88.8% for Agilent processed fold changes. All methods performed well when error was taken into account by t- or z-statistics, AUCs ≥ 99.8%. A substantial proportion of genes showed dye effects, 43% (99%<it>CI </it>: 39%, 47%). However, these effects were generally small regardless of the pre-processing method. Conclusion Simple loess normalization without background subtraction resulted in low variance fold changes that more reliably ranked gene expression than the other methods. While t-statistics and other measures that take variation into account, including Agilent's z-statistic, can also be used to reliably select differentially expressed genes, fold changes are a standard measure of differential expression for exploratory work, cross platform comparison, and biological interpretation and can not be entirely replaced. Although dye effects are small for most genes, many array features are affected. Therefore, an experimental design that incorporates dye swaps or a common reference could be valuable.</p