3,006 research outputs found
Survival ensembles by the sum of pairwise differences with application to lung cancer microarray studies
Lung cancer is among the most common cancers in the United States, in terms
of incidence and mortality. In 2009, it is estimated that more than 150,000
deaths will result from lung cancer alone. Genetic information is an extremely
valuable data source in characterizing the personal nature of cancer. Over the
past several years, investigators have conducted numerous association studies
where intensive genetic data is collected on relatively few patients compared
to the numbers of gene predictors, with one scientific goal being to identify
genetic features associated with cancer recurrence or survival. In this note,
we propose high-dimensional survival analysis through a new application of
boosting, a powerful tool in machine learning. Our approach is based on an
accelerated lifetime model and minimizing the sum of pairwise differences in
residuals. We apply our method to a recent microarray study of lung
adenocarcinoma and find that our ensemble is composed of 19 genes, while a
proportional hazards (PH) ensemble is composed of nine genes, a proper subset
of the 19-gene panel. In one of our simulation scenarios, we demonstrate that
PH boosting in a misspecified model tends to underfit and ignore
moderately-sized covariate effects, on average. Diagnostic analyses suggest
that the PH assumption is not satisfied in the microarray data and may explain,
in part, the discrepancy in the sets of active coefficients. Our simulation
studies and comparative data analyses demonstrate how statistical learning by
PH models alone is insufficient.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS426 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Inferential stability in systems biology
The modern biological sciences are fraught with statistical difficulties. Biomolecular
stochasticity, experimental noise, and the “large p, small n” problem all contribute to
the challenge of data analysis. Nevertheless, we routinely seek to draw robust, meaningful
conclusions from observations. In this thesis, we explore methods for assessing
the effects of data variability upon downstream inference, in an attempt to quantify and
promote the stability of the inferences we make.
We start with a review of existing methods for addressing this problem, focusing upon the
bootstrap and similar methods. The key requirement for all such approaches is a statistical
model that approximates the data generating process.
We move on to consider biomarker discovery problems. We present a novel algorithm for
proposing putative biomarkers on the strength of both their predictive ability and the stability
with which they are selected. In a simulation study, we find our approach to perform
favourably in comparison to strategies that select on the basis of predictive performance
alone.
We then consider the real problem of identifying protein peak biomarkers for HAM/TSP,
an inflammatory condition of the central nervous system caused by HTLV-1 infection.
We apply our algorithm to a set of SELDI mass spectral data, and identify a number of
putative biomarkers. Additional experimental work, together with known results from the
literature, provides corroborating evidence for the validity of these putative biomarkers.
Having focused on static observations, we then make the natural progression to time
course data sets. We propose a (Bayesian) bootstrap approach for such data, and then
apply our method in the context of gene network inference and the estimation of parameters
in ordinary differential equation models. We find that the inferred gene networks
are relatively unstable, and demonstrate the importance of finding distributions of ODE
parameter estimates, rather than single point estimates
Gene set analysis for longitudinal gene expression data
<p>Abstract</p> <p>Background</p> <p>Gene set analysis (GSA) has become a successful tool to interpret gene expression profiles in terms of biological functions, molecular pathways, or genomic locations. GSA performs statistical tests for independent microarray samples at the level of gene sets rather than individual genes. Nowadays, an increasing number of microarray studies are conducted to explore the dynamic changes of gene expression in a variety of species and biological scenarios. In these longitudinal studies, gene expression is repeatedly measured over time such that a GSA needs to take into account the within-gene correlations in addition to possible between-gene correlations.</p> <p>Results</p> <p>We provide a robust nonparametric approach to compare the expressions of longitudinally measured sets of genes under multiple treatments or experimental conditions. The limiting distributions of our statistics are derived when the number of genes goes to infinity while the number of replications can be small. When the number of genes in a gene set is small, we recommend permutation tests based on our nonparametric test statistics to achieve reliable type I error and better power while incorporating unknown correlations between and within-genes. Simulation results demonstrate that the proposed method has a greater power than other methods for various data distributions and heteroscedastic correlation structures. This method was used for an IL-2 stimulation study and significantly altered gene sets were identified.</p> <p>Conclusions</p> <p>The simulation study and the real data application showed that the proposed gene set analysis provides a promising tool for longitudinal microarray analysis. R scripts for simulating longitudinal data and calculating the nonparametric statistics are posted on the North Dakota INBRE website <url>http://ndinbre.org/programs/bioinformatics.php</url>. Raw microarray data is available in Gene Expression Omnibus (National Center for Biotechnology Information) with accession number GSE6085.</p
Boosting the concordance index for survival data - a unified framework to derive and evaluate biomarker combinations
The development of molecular signatures for the prediction of time-to-event
outcomes is a methodologically challenging task in bioinformatics and
biostatistics. Although there are numerous approaches for the derivation of
marker combinations and their evaluation, the underlying methodology often
suffers from the problem that different optimization criteria are mixed during
the feature selection, estimation and evaluation steps. This might result in
marker combinations that are only suboptimal regarding the evaluation criterion
of interest. To address this issue, we propose a unified framework to derive
and evaluate biomarker combinations. Our approach is based on the concordance
index for time-to-event data, which is a non-parametric measure to quantify the
discrimatory power of a prediction rule. Specifically, we propose a
component-wise boosting algorithm that results in linear biomarker combinations
that are optimal with respect to a smoothed version of the concordance index.
We investigate the performance of our algorithm in a large-scale simulation
study and in two molecular data sets for the prediction of survival in breast
cancer patients. Our numerical results show that the new approach is not only
methodologically sound but can also lead to a higher discriminatory power than
traditional approaches for the derivation of gene signatures.Comment: revised manuscript - added simulation study, additional result
Evaluation of statistical correlation and validation methods for construction of gene co-expression networks
High-throughput technologies such as microarrays have led to the rapid accumulation of large scale genomic data providing opportunities to systematically infer gene function and co-expression networks. Typical steps of co-expression network analysis using microarray data consist of estimation of pair-wise gene co-expression using some similarity measure, construction of co-expression networks, identification of clusters of co-expressed genes and post-cluster analyses such as cluster validation. This dissertation is primarily concerned with development and evaluation of approaches for the first and the last steps – estimation of gene co-expression matrices and validation of network clusters. Since clustering methods are not a focus, only a paraclique clustering algorithm will be used in this evaluation.
First, a novel Bayesian approach is presented for combining the Pearson correlation with prior biological information from Gene Ontology, yielding a biologically relevant estimate of gene co-expression. The addition of biological information by the Bayesian approach reduced noise in the paraclique gene clusters as indicated by high silhouette and increased homogeneity of clusters in terms of molecular function. Standard similarity measures including correlation coefficients from Pearson, Spearman, Kendall’s Tau, Shrinkage, Partial, and Mutual information, and Euclidean and Manhattan distance measures were evaluated. Based on quality metrics such as cluster homogeneity and stability with respect to ontological categories, clusters resulting from partial correlation and mutual information were more biologically relevant than those from any other correlation measures.
Second, statistical quality of clusters was evaluated using approaches based on permutation tests and Mantel correlation to identify significant and informative clusters that capture most of the covariance in the dataset. Third, the utility of statistical contrasts was studied for classification of temporal patterns of gene expression. Specifically, polynomial and Helmert contrast analyses were shown to provide a means of labeling the co-expressed gene sets because they showed similar temporal profiles
Distance-based methods for detecting associations in structured data with applications in bioinformatics
In bioinformatics applications samples of biological variables of interest can take a variety
of structures. For instance, in this thesis we consider vector-valued observations
of multiple gene expression and genetic markers, curve-valued gene expression time
courses, and graph-valued functional connectivity networks within the brain. This
thesis considers three problems routinely encountered when dealing with such variables:
detecting differences between populations, detecting predictive relationships
between variables, and detecting association between variables.
Distance-based approaches to these problems are considered, offering great flexibility
over alternative approaches, such as traditional multivariate approaches which
may be inappropriate. The notion of distance has been widely adopted in recent years
to quantify the dissimilarity between samples, and suitable distance measures can be
applied depending on the nature of the data and on the specific objectives of the study.
For instance, for gene expression time courses modeled as time-dependent curves, distance
measures can be specified to capture biologically meaningful aspects of these
curves which may differ. On obtaining a distance matrix containing all pairwise distances
between the samples of a given variable, many distance-based testing procedures
can then be applied. The main inhibitor of their effective use in bioinformatics is that
p-values are typically estimated by using Monte Carlo permutations. Thousands or
even millions of tests need to be performed simultaneously, and time/computational
constraints lead to a low number of permutations being enumerated for each test.
The contributions of this thesis include the proposal of two new distance-based
statistics, the DBF statistic for the problem of detecting differences between populations,
and the GRV coefficient for the problem of detecting association between
variables. In each case approximate null distributions are derived, allowing estimation
of p-values with reduced computational cost, and through simulation these are shown to work well for a range of distances and data types. The tests are also demonstrated
to be competitive with existing approaches. For the problem of detecting predictive
relationships between variables, the approximate null distribution is derived for the
routinely used distance-based pseudo F test, and through simulation this is shown to
work well for a range of distances and data types. All tests are applied to real datasets,
including a longitudinal human immune cell M. tuberculosis dataset, an Alzheimer’s
disease dataset, and an ovarian cancer dataset.Open Acces
- …