37,815 research outputs found
Spectral analysis of gene expression profiles using gene networks
Microarrays have become extremely useful for analysing genetic phenomena, but
establishing a relation between microarray analysis results (typically a list
of genes) and their biological significance is often difficult. Currently, the
standard approach is to map a posteriori the results onto gene networks to
elucidate the functions perturbed at the level of pathways. However,
integrating a priori knowledge of the gene networks could help in the
statistical analysis of gene expression data and in their biological
interpretation. Here we propose a method to integrate a priori the knowledge of
a gene network in the analysis of gene expression data. The approach is based
on the spectral decomposition of gene expression profiles with respect to the
eigenfunctions of the graph, resulting in an attenuation of the high-frequency
components of the expression profiles with respect to the topology of the
graph. We show how to derive unsupervised and supervised classification
algorithms of expression profiles, resulting in classifiers with biological
relevance. We applied the method to the analysis of a set of expression
profiles from irradiated and non-irradiated yeast strains. It performed at
least as well as the usual classification but provides much more biologically
relevant results and allows a direct biological interpretation
GOexpress: an R/Bioconductor package for the identification and visualisation of robust gene ontology signatures through supervised learning of gene expression data
Background: Identification of gene expression profiles that differentiate experimental groups is critical for discovery and analysis of key molecular pathways and also for selection of robust diagnostic or prognostic biomarkers. While integration of differential expression statistics has been used to refine gene set enrichment analyses, such approaches are typically limited to single gene lists resulting from simple two-group comparisons or time-series analyses. In contrast, functional class scoring and machine learning approaches provide powerful alternative methods to leverage molecular measurements for pathway analyses, and to compare continuous and multi-level categorical factors. Results: We introduce GOexpress, a software package for scoring and summarising the capacity of gene ontology features to simultaneously classify samples from multiple experimental groups. GOexpress integrates normalised gene expression data (e.g., from microarray and RNA-seq experiments) and phenotypic information of individual samples with gene ontology annotations to derive a ranking of genes and gene ontology terms using a supervised learning approach. The default random forest algorithm allows interactions between all experimental factors, and competitive scoring of expressed genes to evaluate their relative importance in classifying predefined groups of samples. Conclusions: GOexpress enables rapid identification and visualisation of ontology-related gene panels that robustly classify groups of samples and supports both categorical (e.g., infection status, treatment) and continuous (e.g., time-series, drug concentrations) experimental factors. The use of standard Bioconductor extension packages and publicly available gene ontology annotations facilitates straightforward integration of GOexpress within existing computational biology pipelines.Department of Agriculture, Food and the MarineEuropean Commission - Seventh Framework Programme (FP7)Science Foundation IrelandUniversity College Dubli
A critical evaluation of network and pathway based classifiers for outcome prediction in breast cancer
Recently, several classifiers that combine primary tumor data, like gene
expression data, and secondary data sources, such as protein-protein
interaction networks, have been proposed for predicting outcome in breast
cancer. In these approaches, new composite features are typically constructed
by aggregating the expression levels of several genes. The secondary data
sources are employed to guide this aggregation. Although many studies claim
that these approaches improve classification performance over single gene
classifiers, the gain in performance is difficult to assess. This stems mainly
from the fact that different breast cancer data sets and validation procedures
are employed to assess the performance. Here we address these issues by
employing a large cohort of six breast cancer data sets as benchmark set and by
performing an unbiased evaluation of the classification accuracies of the
different approaches. Contrary to previous claims, we find that composite
feature classifiers do not outperform simple single gene classifiers. We
investigate the effect of (1) the number of selected features; (2) the specific
gene set from which features are selected; (3) the size of the training set and
(4) the heterogeneity of the data set on the performance of composite feature
and single gene classifiers. Strikingly, we find that randomization of
secondary data sources, which destroys all biological information in these
sources, does not result in a deterioration in performance of composite feature
classifiers. Finally, we show that when a proper correction for gene set size
is performed, the stability of single gene sets is similar to the stability of
composite feature sets. Based on these results there is currently no reason to
prefer prognostic classifiers based on composite features over single gene
classifiers for predicting outcome in breast cancer
Sparse PLS discriminant analysis: biologically relevant feature selection and graphical displays for multiclass problems
Background: Variable selection on high throughput biological data, such as gene expression or single nucleotide polymorphisms (SNPs), becomes inevitable to select relevant information and, therefore, to better characterize diseases or assess genetic structure. There are different ways to perform variable selection in large data sets. Statistical tests are commonly used to identify differentially expressed features for explanatory purposes, whereas Machine Learning wrapper approaches can be used for predictive purposes. In the case of multiple highly correlated variables, another option is to use multivariate exploratory approaches to give more insight into cell biology, biological pathways or complex traits.Results: A simple extension of a sparse PLS exploratory approach is proposed to perform variable selection in a multiclass classification framework.Conclusions: sPLS-DA has a classification performance similar to other wrapper or sparse discriminant analysis approaches on public microarray and SNP data sets. More importantly, sPLS-DA is clearly competitive in terms of computational efficiency and superior in terms of interpretability of the results via valuable graphical outputs. sPLS-DA is available in the R package mixOmics, which is dedicated to the analysis of large biological data sets
Predicting Genetic Regulatory Response Using Classification
We present a novel classification-based method for learning to predict gene
regulatory response. Our approach is motivated by the hypothesis that in simple
organisms such as Saccharomyces cerevisiae, we can learn a decision rule for
predicting whether a gene is up- or down-regulated in a particular experiment
based on (1) the presence of binding site subsequences (``motifs'') in the
gene's regulatory region and (2) the expression levels of regulators such as
transcription factors in the experiment (``parents''). Thus our learning task
integrates two qualitatively different data sources: genome-wide cDNA
microarray data across multiple perturbation and mutant experiments along with
motif profile data from regulatory sequences. We convert the regression task of
predicting real-valued gene expression measurement to a classification task of
predicting +1 and -1 labels, corresponding to up- and down-regulation beyond
the levels of biological and measurement noise in microarray measurements. The
learning algorithm employed is boosting with a margin-based generalization of
decision trees, alternating decision trees. This large-margin classifier is
sufficiently flexible to allow complex logical functions, yet sufficiently
simple to give insight into the combinatorial mechanisms of gene regulation. We
observe encouraging prediction accuracy on experiments based on the Gasch S.
cerevisiae dataset, and we show that we can accurately predict up- and
down-regulation on held-out experiments. Our method thus provides predictive
hypotheses, suggests biological experiments, and provides interpretable insight
into the structure of genetic regulatory networks.Comment: 8 pages, 4 figures, presented at Twelfth International Conference on
Intelligent Systems for Molecular Biology (ISMB 2004), supplemental website:
http://www.cs.columbia.edu/compbio/geneclas
Partition Decoupling for Multi-gene Analysis of Gene Expression Profiling Data
We present the extention and application of a new unsupervised statistical
learning technique--the Partition Decoupling Method--to gene expression data.
Because it has the ability to reveal non-linear and non-convex geometries
present in the data, the PDM is an improvement over typical gene expression
analysis algorithms, permitting a multi-gene analysis that can reveal
phenotypic differences even when the individual genes do not exhibit
differential expression. Here, we apply the PDM to publicly-available gene
expression data sets, and demonstrate that we are able to identify cell types
and treatments with higher accuracy than is obtained through other approaches.
By applying it in a pathway-by-pathway fashion, we demonstrate how the PDM may
be used to find sets of mechanistically-related genes that discriminate
phenotypes.Comment: Revise
- âŠ