Search CORE

INRIA a CCSD electronic archive server

HAL-Rennes 1

Blind Source Separation and the Analysis of Microarray Data

Author: B. Torrésani
Beirlant J.
Dudoit S.
Ghosh D.
Gruvberger S.
M.C. Roubaud
P. Chiappetta
Sekowska A.
Publication venue: 'Mary Ann Liebert Inc'
Publication date
Field of study

Microarray Probe Expression Measures, Data Normalization and Statistical Validation

Author: Affymetrix
Baldi
Bolstad
Dudoit
Golub
Hartemink
Irizarry
Irizarry
Irizarry
Kim
Li
Li
Raffaele A. Calogero
Rocke
Saviozzi
Schena
Silvia Saviozzi
Tusher
Publication venue: Hindawi Publishing Corporation
Publication date: 01/01/2003
Field of study

DNA microarray technology is a high-throughput method for gaining information on gene function. Microarray technology is based on deposition/synthesis, in an ordered manner, on a solid surface, of thousands of EST sequences/genes/oligonucleotides. Due to the high number of generated datapoints, computational tools are essential in microarray data analysis and mining to grasp knowledge from experimental results. In this review, we will focus on some of the methodologies actually available to define gene expression intensity measures, microarray data normalization, and statistical validation of differential expression

Institutional Research Information System University of Turin

Knowledge-based gene expression classification via matrix factorization

Author: A. M. Tomé
Affymetrix
Allison
Baldi
Barnhill
Bolstad
Breiman
Cardoso
Cardoso
Chen
D. Lutter
Diaz-Uriarte
Diaz-Uriarte
Dougherty
Dougherty
Dudoit
E. W. Lang
F. J. Theis
G. Schmitz
Galton
Galton
Golub
Guyon
Hochreiter
Irrizarry
Lee
Li
Liebermeister
Liu
Lutter
M. Stetter
Mangasarian
P. Gómez Vilda
P. Knollmüller
Pearson
Quackenbush
R. Schachtner
Saidi
Schachtner
Schachtner
Schölkopf
Simon
Spang
Talloen
Troyanskaya
Tusher
Wu
Wu
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2008
Field of study

Motivation: Modern machine learning methods based on matrix decomposition techniques, like independent component analysis (ICA) or non-negative matrix factorization (NMF), provide new and efficient analysis tools which are currently explored to analyze gene expression profiles. These exploratory feature extraction techniques yield expression modes (ICA) or metagenes (NMF). These extracted features are considered indicative of underlying regulatory processes. They can as well be applied to the classification of gene expression datasets by grouping samples into different categories for diagnostic purposes or group genes into functional categories for further investigation of related metabolic pathways and regulatory networks. Results: In this study we focus on unsupervised matrix factorization techniques and apply ICA and sparse NMF to microarray datasets. The latter monitor the gene expression levels of human peripheral blood cells during differentiation from monocytes to macrophages. We show that these tools are able to identify relevant signatures in the deduced component matrices and extract informative sets of marker genes from these gene expression profiles. The methods rely on the joint discriminative power of a set of marker genes rather than on single marker genes. With these sets of marker genes, corroborated by leave-one-out or random forest cross-validation, the datasets could easily be classified into related diagnostic categories. The latter correspond to either monocytes versus macrophages or healthy vs Niemann Pick C disease patients.Siemens AG, MunichDFG (Graduate College 638)DAAD (PPP Luso - Alem˜a and PPP Hispano - Alemanas

University of Regensburg Publication Server

Repositório Institucional da Universidade de Aveiro

PuSH

GeneSrF and varSelRF: a web-based tool and R package for gene selection and classification using random forest

Author: A Alibés
A Liaw
B Efron
C Ambroise
C Strobl
EJ Kontoghiorghes
H Sutter
I Foster
I Medina
J Dongarra
KH Pan
L Ein-Dor
NL Pochet
P Pacheco
R Development Core Team
R Diaz-Uriarte
R Díaz-Uriarte
R Díaz-Uriarte
R Simon
Ramón Diaz-Uriarte
RL Somorjai
S Dudoit
S Dudoit
S Michiels
S Patel
S Varma
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

Abstract Background Microarray data are often used for patient classification and gene selection. An appropriate tool for end users and biomedical researchers should combine user friendliness with statistical rigor, including carefully avoiding selection biases and allowing analysis of multiple solutions, together with access to additional functional information of selected genes. Methodologically, such a tool would be of greater use if it incorporates state-of-the-art computational approaches and makes source code available. Results We have developed GeneSrF, a web-based tool, and varSelRF, an R package, that implement, in the context of patient classification, a validated method for selecting very small sets of genes while preserving classification accuracy. Computation is parallelized, allowing to take advantage of multicore CPUs and clusters of workstations. Output includes bootstrapped estimates of prediction error rate, and assessments of the stability of the solutions. Clickable tables link to additional information for each gene (GO terms, PubMed citations, KEGG pathways), and output can be sent to PaLS for examination of PubMed references, GO terms, KEGG and and Reactome pathways characteristic of sets of genes selected for class prediction. The full source code is available, allowing to extend the software. The web-based application is available from <url>http://genesrf2.bioinfo.cnio.es</url>. All source code is available from Bioinformatics.org or The Launchpad. The R package is also available from CRAN. Conclusion varSelRF and GeneSrF implement a validated method for gene selection including bootstrap estimates of classification error rate. They are valuable tools for applied biomedical researchers, specially for exploratory work with microarray data. Because of the underlying technology used (combination of parallelization with web-based application) they are also of methodological interest to bioinformaticians and biostatisticians.</p

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Springer - Publisher Connector

Biblos-e Archivo

Predictive response-relevant clustering of expression data provides insights into disease processes

Author: Abe
Amanda K. Sampson
Anna F. Dominiczak
Bach
Bae
Benjamini
Bennett
Bishop
Breitling
Bunger
Clark
de Snoo
Delyth Graham
Doi
Dudoit
Golub
Gore
Graham
Graham Young
Hanczar
Harris
Hoffbrand
Hubert
Huffman
Irizarry
Jeffs
John D. McClure
Kearney
Keith J. Harris
Lee
Lee
Lisa E. M. Hopcroft
Mark A. Girolami
Martin W. McBride
McBride
Mohri
Park
Stein
Tessa L. Holyoake
Tibshirani
Vinh
Weinberger
Woon
Ziino
Zuber
Publication venue: 'Oxford University Press (OUP)'
Publication date: 22/06/2010
Field of study

This article describes and illustrates a novel method of microarray data analysis that couples model-based clustering and binary classification to form clusters of ;response-relevant' genes; that is, genes that are informative when discriminating between the different values of the response. Predictions are subsequently made using an appropriate statistical summary of each gene cluster, which we call the ;meta-covariate' representation of the cluster, in a probit regression model. We first illustrate this method by analysing a leukaemia expression dataset, before focusing closely on the meta-covariate analysis of a renal gene expression dataset in a rat model of salt-sensitive hypertension. We explore the biological insights provided by our analysis of these data. In particular, we identify a highly influential cluster of 13 genes-including three transcription factors (Arntl, Bhlhe41 and Npas2)-that is implicated as being protective against hypertension in response to increased dietary sodium. Functional and canonical pathway analysis of this cluster using Ingenuity Pathway Analysis implicated transcriptional activation and circadian rhythm signalling, respectively. Although we illustrate our method using only expression data, the method is applicable to any high-dimensional datasets

White Rose Research Online

Enlighten

CUED - Cambridge University Engineering Department

Nonparametric relevance-shifted multiple testing procedures for the analysis of high-dimensional multivariate data with small sample sizes

Author: AI Fleishman
C Frömke
C Li
Cornelia Frömke
D Hauschke
DC Polacek
DJ Schaid
E Witt
J Khan
JF Chich
L Guo
LA Hothorn
Ludwig A Hothorn
N Zimmermann
NF Cariello
OG Troyanskaya
PH Westfall
PH Westfall
S Dudoit
S Dudoit
S Holm
S Kropf
S Kropf
S Lange
Siegfried Kropf
T Speed
VR Iyer
Y Benjamini
Y Ge
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background In many research areas it is necessary to find differences between treatment groups with several variables. For example, studies of microarray data seek to find a significant difference in location parameters from zero or one for ratios thereof for each variable. However, in some studies a significant deviation of the difference in locations from zero (or 1 in terms of the ratio) is biologically meaningless. A relevant difference or ratio is sought in such cases. Results This article addresses the use of relevance-shifted tests on ratios for a multivariate parallel two-sample group design. Two empirical procedures are proposed which embed the relevance-shifted test on ratios. As both procedures test a hypothesis for each variable, the resulting multiple testing problem has to be considered. Hence, the procedures include a multiplicity correction. Both procedures are extensions of available procedures for point null hypotheses achieving exact control of the familywise error rate. Whereas the shift of the null hypothesis alone would give straight-forward solutions, the problems that are the reason for the empirical considerations discussed here arise by the fact that the shift is considered in both directions and the whole parameter space in between these two limits has to be accepted as null hypothesis. Conclusion The first algorithm to be discussed uses a permutation algorithm, and is appropriate for designs with a moderately large number of observations. However, many experiments have limited sample sizes. Then the second procedure might be more appropriate, where multiplicity is corrected according to a concept of data-driven order of hypotheses.</p

Springer - Publisher Connector

Institutionelles Repositorium der Leibniz Universität Hannover

Server für wissenschaftliche Schriften der Hochschule Hannover

Speeding up the Consensus Clustering methodology for microarray data analysis

Author: A Ben-Hur
A Bertoni
A Bertoni
A Borodin
A Jain
AK Jain
B Everitt
B Mirkin
E Levine
Filippo Utro
G Frahling
G Milligan
J Handl
J Kraus
JA Hartigan
JA Rice
JP Brunet
K Devarajan
K Yeung
L Kaufman
P Bertrand
P D'haeseleer
P Hansen
R Giancarlo
R Shamir
R Tibshirani
Raffaele Giancarlo
S Dudoit
S Dudoit
S Klie
S Monti
S Salvador
S Seal
T Hastie
TP Speed
V Di Gesú
V Roth
W Krzanowski
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background The inference of the number of clusters in a dataset, a fundamental problem in Statistics, Data Analysis and Classification, is usually addressed via internal validation measures. The stated problem is quite difficult, in particular for microarrays, since the inferred prediction must be sensible enough to capture the inherent biological structure in a dataset, e.g., functionally related genes. Despite the rich literature present in that area, the identification of an internal validation measure that is both fast and precise has proved to be elusive. In order to partially fill this gap, we propose a speed-up of <monospace>Consensus</monospace> (Consensus Clustering), a methodology whose purpose is the provision of a prediction of the number of clusters in a dataset, together with a dissimilarity matrix (the consensus matrix) that can be used by clustering algorithms. As detailed in the remainder of the paper, <monospace>Consensus</monospace> is a natural candidate for a speed-up. Results Since the time-precision performance of <monospace>Consensus</monospace> depends on two parameters, our first task is to show that a simple adjustment of the parameters is not enough to obtain a good precision-time trade-off. Our second task is to provide a fast approximation algorithm for <monospace>Consensus</monospace>. That is, the closely related algorithm <monospace>FC</monospace> (Fast Consensus) that would have the same precision as <monospace>Consensus</monospace> with a substantially better time performance. The performance of <monospace>FC</monospace> has been assessed via extensive experiments on twelve benchmark datasets that summarize key features of microarray applications, such as cancer studies, gene expression with up and down patterns, and a full spectrum of dimensionality up to over a thousand. Based on their outcome, compared with previous benchmarking results available in the literature, <monospace>FC</monospace> turns out to be among the fastest internal validation methods, while retaining the same outstanding precision of <monospace>Consensus</monospace>. Moreover, it also provides a consensus matrix that can be used as a dissimilarity matrix, guaranteeing the same performance as the corresponding matrix produced by <monospace>Consensus</monospace>. We have also experimented with the use of <monospace>Consensus</monospace> and <monospace>FC</monospace> in conjunction with <monospace>NMF</monospace> (Nonnegative Matrix Factorization), in order to identify the correct number of clusters in a dataset. Although <monospace>NMF</monospace> is an increasingly popular technique for biological data mining, our results are somewhat disappointing and complement quite well the state of the art about <monospace>NMF</monospace>, shedding further light on its merits and limitations. Conclusions In summary, <monospace>FC</monospace> with a parameter setting that makes it robust with respect to small and medium-sized datasets, i.e, number of items to cluster in the hundreds and number of conditions up to a thousand, seems to be the internal validation measure of choice. Moreover, the technique we have developed here can be used in other contexts, in particular for the speed-up of stability-based validation measures.</p

Springer - Publisher Connector

Archivio istituzionale della ricerca - Università di Palermo

Classes of Multiple Decision Functions Strongly Controlling FWER and FDR

Author: B Efron
B Efron
CR Genovese
E Roquain
EA Peña
Edsel A. Peña
G Blanchard
G Blanchard
G Kang
H Finner
J Scott
J Storey
JD Habiger
JD Habiger
JJ Goeman
JL Doob
Joshua D. Habiger
K Roeder
M Bogdan
M Guindani
P Müller
PH Westfall
PH Westfall
S Dudoit
S Holm
SK Sarkar
SK Sarkar
SK Sarkar
W Hoeffding
W Sun
W Wu
Wensong Wu
Y Benjamini
Y Benjamini
Z Šidák
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 15/07/2010
Field of study

This paper provides two general classes of multiple decision functions where each member of the first class strongly controls the family-wise error rate (FWER), while each member of the second class strongly controls the false discovery rate (FDR). These classes offer the possibility that an optimal multiple decision function with respect to a pre-specified criterion, such as the missed discovery rate (MDR), could be found within these classes. Such multiple decision functions can be utilized in multiple testing, specifically, but not limited to, the analysis of high-dimensional microarray data sets.Comment: 19 page

arXiv.org e-Print Archive