7,079 research outputs found
Recommended from our members
The robust selection of predictive genes via a simple classifier
Identifying genes that direct the mechanism of a disease from expression data is extremely useful in understanding how that mechanism works.
This in turn may lead to better diagnoses and potentially can lead to a cure for that disease. This task becomes extremely challenging when the
data are characterised by only a small number of samples and a high number of dimensions, as it is often the case with gene expression data.
Motivated by this challenge, we present a general framework that focuses on simplicity and data perturbation. These are the keys for the robust
identification of the most predictive features in such data. Within this framework, we propose a simple selective naĀØıve Bayes classifier discovered using a global search technique, and combine it with data perturbation to
increase its robustness to small sample sizes.
An extensive validation of the method was carried out using two applied datasets from the field of microarrays and a simulated dataset, all
confounded by small sample sizes and high dimensionality. The method has been shown capable of identifying genes previously confirmed or associated with prostate cancer and viral infections
Machine Learning and Integrative Analysis of Biomedical Big Data.
Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues
Of mice and men: Sparse statistical modeling in cardiovascular genomics
In high-throughput genomics, large-scale designed experiments are becoming
common, and analysis approaches based on highly multivariate regression and
anova concepts are key tools. Shrinkage models of one form or another can
provide comprehensive approaches to the problems of simultaneous inference that
involve implicit multiple comparisons over the many, many parameters
representing effects of design factors and covariates. We use such approaches
here in a study of cardiovascular genomics. The primary experimental context
concerns a carefully designed, and rich, gene expression study focused on
gene-environment interactions, with the goals of identifying genes implicated
in connection with disease states and known risk factors, and in generating
expression signatures as proxies for such risk factors. A coupled exploratory
analysis investigates cross-species extrapolation of gene expression
signatures--how these mouse-model signatures translate to humans. The latter
involves exploration of sparse latent factor analysis of human observational
data and of how it relates to projected risk signatures derived in the animal
models. The study also highlights a range of applied statistical and genomic
data analysis issues, including model specification, computational questions
and model-based correction of experimental artifacts in DNA microarray data.Comment: Published at http://dx.doi.org/10.1214/07-AOAS110 in the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Application of Volcano Plots in Analyses of mRNA Differential Expressions with Microarrays
Volcano plot displays unstandardized signal (e.g. log-fold-change) against
noise-adjusted/standardized signal (e.g. t-statistic or -log10(p-value) from
the t test). We review the basic and an interactive use of the volcano plot,
and its crucial role in understanding the regularized t-statistic. The joint
filtering gene selection criterion based on regularized statistics has a curved
discriminant line in the volcano plot, as compared to the two perpendicular
lines for the "double filtering" criterion. This review attempts to provide an
unifying framework for discussions on alternative measures of differential
expression, improved methods for estimating variance, and visual display of a
microarray analysis result. We also discuss the possibility to apply volcano
plots to other fields beyond microarray.Comment: 8 figure
MissForest - nonparametric missing value imputation for mixed-type data
Modern data acquisition based on high-throughput technology is often facing
the problem of missing data. Algorithms commonly used in the analysis of such
large-scale data often depend on a complete set. Missing value imputation
offers a solution to this problem. However, the majority of available
imputation methods are restricted to one type of variable only: continuous or
categorical. For mixed-type data the different types are usually handled
separately. Therefore, these methods ignore possible relations between variable
types. We propose a nonparametric method which can cope with different types of
variables simultaneously. We compare several state of the art methods for the
imputation of missing values. We propose and evaluate an iterative imputation
method (missForest) based on a random forest. By averaging over many unpruned
classification or regression trees random forest intrinsically constitutes a
multiple imputation scheme. Using the built-in out-of-bag error estimates of
random forest we are able to estimate the imputation error without the need of
a test set. Evaluation is performed on multiple data sets coming from a diverse
selection of biological fields with artificially introduced missing values
ranging from 10% to 30%. We show that missForest can successfully handle
missing values, particularly in data sets including different types of
variables. In our comparative study missForest outperforms other methods of
imputation especially in data settings where complex interactions and nonlinear
relations are suspected. The out-of-bag imputation error estimates of
missForest prove to be adequate in all settings. Additionally, missForest
exhibits attractive computational efficiency and can cope with high-dimensional
data.Comment: Submitted to Oxford Journal's Bioinformatics on 3rd of May 201
Modeling dependent gene expression
In this paper we propose a Bayesian approach for inference about dependence
of high throughput gene expression. Our goals are to use prior knowledge about
pathways to anchor inference about dependence among genes; to account for this
dependence while making inferences about differences in mean expression across
phenotypes; and to explore differences in the dependence itself across
phenotypes. Useful features of the proposed approach are a model-based
parsimonious representation of expression as an ordinal outcome, a novel and
flexible representation of prior information on the nature of dependencies, and
the use of a coherent probability model over both the structure and strength of
the dependencies of interest. We evaluate our approach through simulations and
in the analysis of data on expression of genes in the Complement and
Coagulation Cascade pathway in ovarian cancer.Comment: Published in at http://dx.doi.org/10.1214/11-AOAS525 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
- ā¦