8,165 research outputs found
Extreme Value Distribution Based Gene Selection Criteria for Discriminant Microarray Data Analysis Using Logistic Regression
One important issue commonly encountered in the analysis of microarray data
is to decide which and how many genes should be selected for further studies.
For discriminant microarray data analyses based on statistical models, such as
the logistic regression models, gene selection can be accomplished by a
comparison of the maximum likelihood of the model given the real data,
, and the expected maximum likelihood of the model given an
ensemble of surrogate data with randomly permuted label, .
Typically, the computational burden for obtaining is immense,
often exceeding the limits of computing available resources by orders of
magnitude. Here, we propose an approach that circumvents such heavy
computations by mapping the simulation problem to an extreme-value problem. We
present the derivation of an asymptotic distribution of the extreme-value as
well as its mean, median, and variance. Using this distribution, we propose two
gene selection criteria, and we apply them to two microarray datasets and three
classification tasks for illustration.Comment: to be published in Journal of Computational Biology (2004
GOexpress: an R/Bioconductor package for the identification and visualisation of robust gene ontology signatures through supervised learning of gene expression data
Background: Identification of gene expression profiles that differentiate experimental groups is critical for discovery and analysis of key molecular pathways and also for selection of robust diagnostic or prognostic biomarkers. While integration of differential expression statistics has been used to refine gene set enrichment analyses, such approaches are typically limited to single gene lists resulting from simple two-group comparisons or time-series analyses. In contrast, functional class scoring and machine learning approaches provide powerful alternative methods to leverage molecular measurements for pathway analyses, and to compare continuous and multi-level categorical factors. Results: We introduce GOexpress, a software package for scoring and summarising the capacity of gene ontology features to simultaneously classify samples from multiple experimental groups. GOexpress integrates normalised gene expression data (e.g., from microarray and RNA-seq experiments) and phenotypic information of individual samples with gene ontology annotations to derive a ranking of genes and gene ontology terms using a supervised learning approach. The default random forest algorithm allows interactions between all experimental factors, and competitive scoring of expressed genes to evaluate their relative importance in classifying predefined groups of samples. Conclusions: GOexpress enables rapid identification and visualisation of ontology-related gene panels that robustly classify groups of samples and supports both categorical (e.g., infection status, treatment) and continuous (e.g., time-series, drug concentrations) experimental factors. The use of standard Bioconductor extension packages and publicly available gene ontology annotations facilitates straightforward integration of GOexpress within existing computational biology pipelines.Department of Agriculture, Food and the MarineEuropean Commission - Seventh Framework Programme (FP7)Science Foundation IrelandUniversity College Dubli
Stable Feature Selection for Biomarker Discovery
Feature selection techniques have been used as the workhorse in biomarker
discovery applications for a long time. Surprisingly, the stability of feature
selection with respect to sampling variations has long been under-considered.
It is only until recently that this issue has received more and more attention.
In this article, we review existing stable feature selection methods for
biomarker discovery using a generic hierarchal framework. We have two
objectives: (1) providing an overview on this new yet fast growing topic for a
convenient reference; (2) categorizing existing methods under an expandable
framework for future research and development
Application of Volcano Plots in Analyses of mRNA Differential Expressions with Microarrays
Volcano plot displays unstandardized signal (e.g. log-fold-change) against
noise-adjusted/standardized signal (e.g. t-statistic or -log10(p-value) from
the t test). We review the basic and an interactive use of the volcano plot,
and its crucial role in understanding the regularized t-statistic. The joint
filtering gene selection criterion based on regularized statistics has a curved
discriminant line in the volcano plot, as compared to the two perpendicular
lines for the "double filtering" criterion. This review attempts to provide an
unifying framework for discussions on alternative measures of differential
expression, improved methods for estimating variance, and visual display of a
microarray analysis result. We also discuss the possibility to apply volcano
plots to other fields beyond microarray.Comment: 8 figure
The Reproducibility of Lists of Differentially Expressed Genes in Microarray Studies
Reproducibility is a fundamental requirement in scientific experiments and clinical contexts. Recent publications raise concerns about the reliability of microarray technology because of the apparent lack of agreement between lists of differentially expressed genes (DEGs). In this study we demonstrate that (1) such discordance may stem from ranking and selecting DEGs solely by statistical significance (P) derived from widely used simple t-tests; (2) when fold change (FC) is used as the ranking criterion, the lists become much more reproducible, especially when fewer genes are selected; and (3) the instability of short DEG lists based on P cutoffs is an expected mathematical consequence of the high variability of the t-values. We recommend the use of FC ranking plus a non-stringent P cutoff as a baseline practice in order to generate more reproducible DEG lists. The FC criterion enhances reproducibility while the P criterion balances sensitivity and specificity
Algebraic Comparison of Partial Lists in Bioinformatics
The outcome of a functional genomics pipeline is usually a partial list of
genomic features, ranked by their relevance in modelling biological phenotype
in terms of a classification or regression model. Due to resampling protocols
or just within a meta-analysis comparison, instead of one list it is often the
case that sets of alternative feature lists (possibly of different lengths) are
obtained. Here we introduce a method, based on the algebraic theory of
symmetric groups, for studying the variability between lists ("list stability")
in the case of lists of unequal length. We provide algorithms evaluating
stability for lists embedded in the full feature set or just limited to the
features occurring in the partial lists. The method is demonstrated first on
synthetic data in a gene filtering task and then for finding gene profiles on a
recent prostate cancer dataset
A framework for list representation, enabling list stabilization through incorporation of gene exchangeabilities
Analysis of multivariate data sets from e.g. microarray studies frequently
results in lists of genes which are associated with some response of interest.
The biological interpretation is often complicated by the statistical
instability of the obtained gene lists with respect to sampling variations,
which may partly be due to the functional redundancy among genes, implying that
multiple genes can play exchangeable roles in the cell. In this paper we use
the concept of exchangeability of random variables to model this functional
redundancy and thereby account for the instability attributable to sampling
variations. We present a flexible framework to incorporate the exchangeability
into the representation of lists. The proposed framework supports
straightforward robust comparison between any two lists. It can also be used to
generate new, more stable gene rankings incorporating more information from the
experimental data. Using a microarray data set from lung cancer patients we
show that the proposed method provides more robust gene rankings than existing
methods with respect to sampling variations, without compromising the
biological significance
An evaluation of DNA-damage response and cell-cycle pathways for breast cancer classification
Accurate subtyping or classification of breast cancer is important for
ensuring proper treatment of patients and also for understanding the molecular
mechanisms driving this disease. While there have been several gene signatures
proposed in the literature to classify breast tumours, these signatures show
very low overlaps, different classification performance, and not much relevance
to the underlying biology of these tumours. Here we evaluate DNA-damage
response (DDR) and cell cycle pathways, which are critical pathways implicated
in a considerable proportion of breast tumours, for their usefulness and
ability in breast tumour subtyping. We think that subtyping breast tumours
based on these two pathways could lead to vital insights into molecular
mechanisms driving these tumours. Here, we performed a systematic evaluation of
DDR and cell-cycle pathways for subtyping of breast tumours into the five known
intrinsic subtypes. Homologous Recombination (HR) pathway showed the best
performance in subtyping breast tumours, indicating that HR genes are strongly
involved in all breast tumours. Comparisons of pathway based signatures and two
standard gene signatures supported the use of known pathways for breast tumour
subtyping. Further, the evaluation of these standard gene signatures showed
that breast tumour subtyping, prognosis and survival estimation are all closely
related. Finally, we constructed an all-inclusive super-signature by combining
(union of) all genes and performing a stringent feature selection, and found it
to be reasonably accurate and robust in classification as well as prognostic
value. Adopting DDR and cell cycle pathways for breast tumour subtyping
achieved robust and accurate breast tumour subtyping, and constructing a
super-signature which contains feature selected mix of genes from these
molecular pathways as well as clinical aspects is valuable in clinical
practice.Comment: 28 pages, 7 figures, 6 table
EFSIS: Ensemble Feature Selection Integrating Stability
Ensemble learning that can be used to combine the predictions from multiple
learners has been widely applied in pattern recognition, and has been reported
to be more robust and accurate than the individual learners. This ensemble
logic has recently also been more applied in feature selection. There are
basically two strategies for ensemble feature selection, namely data
perturbation and function perturbation. Data perturbation performs feature
selection on data subsets sampled from the original dataset and then selects
the features consistently ranked highly across those data subsets. This has
been found to improve both the stability of the selector and the prediction
accuracy for a classifier. Function perturbation frees the user from having to
decide on the most appropriate selector for any given situation and works by
aggregating multiple selectors. This has been found to maintain or improve
classification performance. Here we propose a framework, EFSIS, combining these
two strategies. Empirical results indicate that EFSIS gives both high
prediction accuracy and stability.Comment: 20 pages, 3 figure
- …