Search CORE

138 research outputs found

The MicroArray Quality Control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models

Author: MAQC Consortium
Publication venue
Publication date: 01/01/2010
Field of study

Gene expression data from microarrays are being applied to predict preclinical and clinical endpoints, but the reliability of these predictions has not been established. In the MAQC-II project, 36 independent teams analyzed six microarray data sets to generate predictive models for classifying a sample with respect to one of 13 endpoints indicative of lung or liver toxicity in rodents, or of breast cancer, multiple myeloma or neuroblastoma in humans. In total, >30,000 models were built using many combinations of analytical methods. The teams generated predictive models without knowing the biological meaning of some of the endpoints and, to mimic clinical reality, tested the models on data that had not been used for training. We found that model performance depended largely on the endpoint and team proficiency and that different approaches generated models of similar performance. The conclusions and recommendations from MAQC-II should be useful for regulatory agencies, study committees and independent investigators that evaluate methods for global gene expression analysis

Carolina Digital Repository

Improving the value of public RNA-seq expression data by phenotype prediction.

Author: Andrew Jaffe
Aryee
Beery
Bernstein
Collado-Torres
Collado-Torres
Consortium
Denk
Eswaran
Frazee
Goodspeed
Houseman
Iorio
Irizarry
Jeffrey T Leek
Kalari
Kim
Leek
Leinonen
Leonardo Collado-Torres
Lister
Liu
Lonsdale
Mazure
Mortazavi
Nagalakshmi
Nellore
Pohl
Ritchie
Robinson
Seqc/Maqc-Iii Consortium.
Shannon E Ellis
Smallridge
Toker
Publication venue: eScholarship, University of California
Publication date: 01/05/2018
Field of study

Publicly available genomic data are a valuable resource for studying normal human variation and disease, but these data are often not well labeled or annotated. The lack of phenotype information for public genomic data severely limits their utility for addressing targeted biological questions. We develop an in silico phenotyping approach for predicting critical missing annotation directly from genomic measurements using well-annotated genomic and phenotypic data produced by consortia like TCGA and GTEx as training data. We apply in silico phenotyping to a set of 70 000 RNA-seq samples we recently processed on a common pipeline as part of the recount2 project. We use gene expression data to build and evaluate predictors for both biological phenotypes (sex, tissue, sample source) and experimental conditions (sequencing strategy). We demonstrate how these predictions can be used to study cross-sample properties of public genomic data, select genomic projects with specific characteristics, and perform downstream analyses using predicted phenotypes. The methods to perform phenotype prediction are available in the phenopredict R package and the predictions for recount2 are available from the recount R package. With data and phenotype information available for 70,000 human samples, expression data is available for use on a scale that was not previously feasible

Crossref

eScholarship - University of California

Optimizing the noise versus bias trade-off for Illumina whole genome expression BeadChips

Author: Alicia Oshlack
Archer
Asselin-Labat
Barnes
Bolstad
Brent
Ding
Du
Dunning
Dunning
Dunning
Gordon K. Smyth
Hackstadt
Huber
Illumina
Irizarry
Irizarry
Langaas
Lim
Lim
Lin
MAQC Consortium
McCall
Ritchie
Rocke
Shi
Shi
Silver
Smyth
Smyth
Wei Shi
Wu
Xie
Publication venue: Oxford University Press
Publication date: 01/01/2010
Field of study

Five strategies for pre-processing intensities from Illumina expression BeadChips are assessed from the point of view of precision and bias. The strategies include a popular variance stabilizing transformation and model-based background corrections that either use or ignore the control probes. Four calibration data sets are used to evaluate precision, bias and false discovery rate (FDR). The original algorithms are shown to have operating characteristics that are not easily comparable. Some tend to minimize noise while others minimize bias. Each original algorithm is shown to have an innate intensity offset, by which unlogged intensities are bounded away from zero, and the size of this offset determines its position on the noise–bias spectrum. By adding extra offsets, a continuum of related algorithms with different noise–bias trade-offs is generated, allowing direct comparison of the performance of the strategies on equivalent terms. Adding a positive offset is shown to decrease the FDR of each original algorithm. The potential of each strategy to generate an algorithm with an optimal noise–bias trade-off is explored by finding the offset that minimizes its FDR. The use of control probes as part of the background correction and normalization strategy is shown to achieve the lowest FDR for a given bias

CiteSeerX

Crossref

PubMed Central

University of Melbourne Institutional Repository

Reproducibility of microarray data: a further analysis of microarray quality control (MAQC) data

Author: Chen-An Tsai
Chien-Ju Lin
E Marshall
Huey-Miin Hsueh
James J Chen
JE Larkin
JM Perket
JP Ioannidis
KK Dobbin
L Guo
L Klebanov
L Shi
L Shi
MAQC Consortium
Members of the Toxicogenomics Research Consortium
P Liang
PK Tan
RA Irizarry
RD Canales
Robert R Delongchamp
S Draghici
S Frantz
SC Chow
TA Patterson
Y Benjamini
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

Abstract Background Many researchers are concerned with the comparability and reliability of microarray gene expression data. Recent completion of the MicroArray Quality Control (MAQC) project provides a unique opportunity to assess reproducibility across multiple sites and the comparability across multiple platforms. The MAQC analysis presented for the conclusion of inter- and intra-platform comparability/reproducibility of microarray gene expression measurements is inadequate. We evaluate the reproducibility/comparability of the MAQC data for 12901 common genes in four titration samples generated from five high-density one-color microarray platforms and the TaqMan technology. We discuss some of the problems with the use of correlation coefficient as metric to evaluate the inter- and intra-platform reproducibility and the percent of overlapping genes (POG) as a measure for evaluation of a gene selection procedure by MAQC. Results A total of 293 arrays were used in the intra- and inter-platform analysis. A hierarchical cluster analysis shows distinct differences in the measured intensities among the five platforms. A number of genes show a small fold-change in one platform and a large fold-change in another platform, even though the correlations between platforms are high. An analysis of variance shows thirty percent of gene expressions of the samples show inconsistent patterns across the five platforms. We illustrated that POG does not reflect the accuracy of a selected gene list. A non-overlapping gene can be truly differentially expressed with a stringent cut, and an overlapping gene can be non-differentially expressed with non-stringent cutoff. In addition, POG is an unusable selection criterion. POG can increase or decrease irregularly as cutoff changes; there is no criterion to determine a cutoff so that POG is optimized. Conclusion Using various statistical methods we demonstrate that there are differences in the intensities measured by different platforms and different sites within platform. Within each platform, the patterns of expression are generally consistent, but there is site-by-site variability. Evaluation of data analysis methods for use in regulatory decision should take no treatment effect into consideration, when there is no treatment effect, "a fold-change cutoff with a non-stringent p-value cutoff" could result in 100% false positive error selection.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Estimating the proportion of microarray probes expressed in an RNA sample

Author: Affymetrix
Akashi
Akashi
Anderson
Archer
Ariel H. Achtman
Bernstein
Bolstad
Carolyn A. de Graaf
Derbinski
Derbinski
DeVoss
Douglas J. Hilton
Gordon K. Smyth
Gotter
Hackstadt
Hamish S. Scott
Hubert
Illumina
Irizarry
Kondo
Lein
Li
Lim
Liston
Liston
Louis Schofield
MAQC Consortium
McGee
Mikkelsen
Nagamine
Ramsey
Ritchie
Sarah A. Kinkel
Schuster
Shi
Silver
Smyth
Smyth
Spangrude
Stern
The Finnish-German APECED Consortium
Tracey Baldwin
Tusher
Venanzi
Wei Shi
Wu
Zhu
Zilliox
Publication venue: Oxford University Press
Publication date: 01/01/2010
Field of study

A fundamental question in microarray analysis is the estimation of the number of expressed probes in different RNA samples. Negative control probes available in the latest microarray platforms, such as Illumina whole genome expression BeadChips, provide a unique opportunity to estimate the number of expressed probes without setting a threshold. A novel algorithm was proposed in this study to estimate the number of expressed probes in an RNA sample by utilizing these negative controls to measure background noise. The performance of the algorithm was demonstrated by comparing different generations of Illumina BeadChips, comparing the set of probes targeting well-characterized RefSeq NM transcripts with other probes on the array and comparing pure samples with heterogenous samples. Furthermore, hematopoietic stem cells were found to have a larger transcriptome than progenitor cells. Aire knockout medullary thymic epithelial cells were shown to have significantly less expressed probes than matched wild-type cells

ResearchOnline@JCU

Crossref

Adelaide Research & Scholarship

ResearchOnline at James Cook University

PubMed Central

University of Melbourne Institutional Repository

Estimating the proportion of microarray probes expressed in an RNA sample

Author: Affymetrix
Akashi
Akashi
Anderson
Archer
Ariel H. Achtman
Bernstein
Bolstad
Carolyn A. de Graaf
Derbinski
Derbinski
DeVoss
Douglas J. Hilton
Gordon K. Smyth
Gotter
Hackstadt
Hamish S. Scott
Hubert
Illumina
Irizarry
Kondo
Lein
Li
Lim
Liston
Liston
Louis Schofield
MAQC Consortium
McGee
Mikkelsen
Nagamine
Ramsey
Ritchie
Sarah A. Kinkel
Schuster
Shi
Silver
Smyth
Smyth
Spangrude
Stern
The Finnish-German APECED Consortium
Tracey Baldwin
Tusher
Venanzi
Wei Shi
Wu
Zhu
Zilliox
Publication venue: Oxford University Press
Publication date: 01/01/2010
Field of study

ResearchOnline@JCU

Crossref

Adelaide Research & Scholarship

ResearchOnline at James Cook University

PubMed Central

University of Melbourne Institutional Repository

Fold change and p-value cutoffs significantly alter microarray interpretations

Author: A Fujita
Anthony Deeter
AT Askari
DA Iacobas
DB Allison
DJ McCarthy
DM Witten
E Jacob
Gayathri Nimishakavi
GD Ruxton
IB Jeffery
IJ Marques
JP Scarth
JS Isaacs
L van der Weyden
MAQC Consortium
Mark R Dalman
N Mah
R Nadon
RA Miller
SCP Renn
SN Kahn
W Enard
WJ Lin
Zhong-Hui Duan
Publication venue: BioMed Central
Publication date: 01/01/2012
Field of study

Abstract Background As context is important to gene expression, so is the preprocessing of microarray to transcriptomics. Microarray data suffers from several normalization and significance problems. Arbitrary fold change (FC) cut-offs of >2 and significance p-values of <0.02 lead data collection to look only at genes which vary wildly amongst other genes. Therefore, questions arise as to whether the biology or the statistical cutoff are more important within the interpretation. In this paper, we reanalyzed a zebrafish (<it>D. rerio</it>) microarray data set using GeneSpring and different differential gene expression cut-offs and found the data interpretation was drastically different. Furthermore, despite the advances in microarray technology, the array captures a large portion of genes known but yet still leaving large voids in the number of genes assayed, such as leptin a pleiotropic hormone directly related to hypoxia-induced angiogenesis. Results The data strongly suggests that the number of differentially expressed genes is more up-regulated than down-regulated, with many genes indicating conserved signalling to previously known functions. Recapitulated data from Marques et al. (2008) was similar but surprisingly different with some genes showing unexpected signalling which may be a product of tissue (heart) or that the intended response was transient. Conclusions Our analyses suggest that based on the chosen statistical or fold change cut-off; microarray analysis can provide essentially more than one answer, implying data interpretation as more of an art than a science, with follow up gene expression studies a must. Furthermore, gene chip annotation and development needs to maintain pace with not only new genomes being sequenced but also novel genes that are crucial to the overall gene chips interpretation.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Gene selection with multiple ordering criteria

Author: BA Rosenzweig
C Ambroise
CA Tsai
Chen-An Tsai
Chun-Houh Chen
G Fleury
GS Akerman
H Liu
I Guyon
James J Chen
JH Cho
JM Perket
L Breiman
L Breiman
L Li
M de Berg
M Dettling
MAQC Consortium
O Barndorff-Nielsen
S Michiels
SE Choe
SH Jung
ShengLi Tzeng
U Alon
V Tusher
W Jin
Y Benjamini
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

BACKGROUND: A microarray study may select different differentially expressed gene sets because of different selection criteria. For example, the fold-change and p-value are two commonly known criteria to select differentially expressed genes under two experimental conditions. These two selection criteria often result in incompatible selected gene sets. Also, in a two-factor, say, treatment by time experiment, the investigator may be interested in one gene list that responds to both treatment and time effects. RESULTS: We propose three layer ranking algorithms, point-admissible, line-admissible (convex), and Pareto, to provide a preference gene list from multiple gene lists generated by different ranking criteria. Using the public colon data as an example, the layer ranking algorithms are applied to the three univariate ranking criteria, fold-change, p-value, and frequency of selections by the SVM-RFE classifier. A simulation experiment shows that for experiments with small or moderate sample sizes (less than 20 per group) and detecting a 4-fold change or less, the two-dimensional (p-value and fold-change) convex layer ranking selects differentially expressed genes with generally lower FDR and higher power than the standard p-value ranking. Three applications are presented. The first application illustrates a use of the layer rankings to potentially improve predictive accuracy. The second application illustrates an application to a two-factor experiment involving two dose levels and two time points. The layer rankings are applied to selecting differentially expressed genes relating to the dose and time effects. In the third application, the layer rankings are applied to a benchmark data set consisting of three dilution concentrations to provide a ranking system from a long list of differentially expressed genes generated from the three dilution concentrations. CONCLUSION: The layer ranking algorithms are useful to help investigators in selecting the most promising genes from multiple gene lists generated by different filter, normalization, or analysis methods for various objectives

Crossref

Springer

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments

Author: A Lee
A Mortazavi
A Oshlack
B Ewing
B Langmead
DR Bentley
DY Chiang
Elizabeth Purdom
ET Wang
H Li
Illumina
Illumina
J Lu
James H Bullard
JC Dohm
JC Marioni
Kasper D Hansen
MA Taub
MAQC Consortium
MD Robinson
PAC Hoen
RA Irizarry
RA Irizarry
RD Canales
S Durinck
Sandrine Dudoit
U Nagalakshmi
Publication venue: BioMed Central
Publication date: 21/04/2009
Field of study

Abstract Background High-throughput sequencing technologies, such as the Illumina Genome Analyzer, are powerful new tools for investigating a wide range of biological and medical questions. Statistical and computational methods are key for drawing meaningful and accurate conclusions from the massive and complex datasets generated by the sequencers. We provide a detailed evaluation of statistical methods for normalization and differential expression (DE) analysis of Illumina transcriptome sequencing (mRNA-Seq) data. Results We compare statistical methods for detecting genes that are significantly DE between two types of biological samples and find that there are substantial differences in how the test statistics handle low-count genes. We evaluate how DE results are affected by features of the sequencing platform, such as, varying gene lengths, base-calling calibration method (with and without phi X control lane), and flow-cell/library preparation effects. We investigate the impact of the read count normalization method on DE results and show that the standard approach of scaling by total lane counts (e.g., RPKM) can bias estimates of DE. We propose more general quantile-based normalization procedures and demonstrate an improvement in DE detection. Conclusions Our results have significant practical and methodological implications for the design and analysis of mRNA-Seq experiments. They highlight the importance of appropriate statistical methods for normalization and DE inference, to account for features of the sequencing platform that could impact the accuracy of results. They also reveal the need for further research in the development of statistical and computational methods for mRNA-Seq.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

eScholarship - University of California

Collection Of Biostatistics Research Archive

Classification across gene expression microarray studies

Author: A Buness
A Goncalves
A Liaw
A Shimo
AC Tan
Achim Tresch
Andreas Buness
C Sotiriou
D Geman
E McLachlan
HY Chang
J Schneider
J Stec
L Ein-Dor
L Klebanov
L Xu
L Xu
L Xu
LD Miller
LJ van't Veer
M Ruschhaupt
MAQC Consortium
Markus Ruschhaupt
P Warnat
R Ihaka
R Tibshirani
RC Gentleman
Ruprecht Kuner
S Esseghir
S Michiels
V Vapnik
Y Wang
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background The increasing number of gene expression microarray studies represents an important resource in biomedical research. As a result, gene expression based diagnosis has entered clinical practice for patient stratification in breast cancer. However, the integration and combined analysis of microarray studies remains still a challenge. We assessed the potential benefit of data integration on the classification accuracy and systematically evaluated the generalization performance of selected methods on four breast cancer studies comprising almost 1000 independent samples. To this end, we introduced an evaluation framework which aims to establish good statistical practice and a graphical way to monitor differences. The classification goal was to correctly predict estrogen receptor status (negative/positive) and histological grade (low/high) of each tumor sample in an independent study which was not used for the training. For the classification we chose support vector machines (SVM), predictive analysis of microarrays (PAM), random forest (RF) and k-top scoring pairs (kTSP). Guided by considerations relevant for classification across studies we developed a generalization of kTSP which we evaluated in addition. Our derived version (DV) aims to improve the robustness of the intrinsic invariance of kTSP with respect to technologies and preprocessing. Results For each individual study the generalization error was benchmarked via complete cross-validation and was found to be similar for all classification methods. The misclassification rates were substantially higher in classification across studies, when each single study was used as an independent test set while all remaining studies were combined for the training of the classifier. However, with increasing number of independent microarray studies used in the training, the overall classification performance improved. DV performed better than the average and showed slightly less variance. In particular, the better predictive results of DV in across platform classification indicate higher robustness of the classifier when trained on single channel data and applied to gene expression ratios. Conclusions We present a systematic evaluation of strategies for the integration of independent microarray studies in a classification task. Our findings in across studies classification may guide further research aiming on the construction of more robust and reliable methods for stratification and diagnosis in clinical practice.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Open Access LMU