Search CORE

3,006 research outputs found

Survival ensembles by the sum of pairwise differences with application to lung cancer microarray studies

Author: Johnson Brent A.
Long Qi
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 09/08/2011
Field of study

Lung cancer is among the most common cancers in the United States, in terms of incidence and mortality. In 2009, it is estimated that more than 150,000 deaths will result from lung cancer alone. Genetic information is an extremely valuable data source in characterizing the personal nature of cancer. Over the past several years, investigators have conducted numerous association studies where intensive genetic data is collected on relatively few patients compared to the numbers of gene predictors, with one scientific goal being to identify genetic features associated with cancer recurrence or survival. In this note, we propose high-dimensional survival analysis through a new application of boosting, a powerful tool in machine learning. Our approach is based on an accelerated lifetime model and minimizing the sum of pairwise differences in residuals. We apply our method to a recent microarray study of lung adenocarcinoma and find that our ensemble is composed of 19 genes, while a proportional hazards (PH) ensemble is composed of nine genes, a proper subset of the 19-gene panel. In one of our simulation scenarios, we demonstrate that PH boosting in a misspecified model tends to underfit and ignore moderately-sized covariate effects, on average. Diagnostic analyses suggest that the PH assumption is not satisfied in the microarray data and may explain, in part, the discrepancy in the sets of active coefficients. Our simulation studies and comparative data analyses demonstrate how statistical learning by PH models alone is insufficient.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS426 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive

Crossref

Inferential stability in systems biology

Author: Kirk Paul
Kirk Paul
Publication venue: Division of Molecular Biosciences, Imperial College London
Publication date: 01/03/2011
Field of study

The modern biological sciences are fraught with statistical difficulties. Biomolecular stochasticity, experimental noise, and the “large p, small n” problem all contribute to the challenge of data analysis. Nevertheless, we routinely seek to draw robust, meaningful conclusions from observations. In this thesis, we explore methods for assessing the effects of data variability upon downstream inference, in an attempt to quantify and promote the stability of the inferences we make. We start with a review of existing methods for addressing this problem, focusing upon the bootstrap and similar methods. The key requirement for all such approaches is a statistical model that approximates the data generating process. We move on to consider biomarker discovery problems. We present a novel algorithm for proposing putative biomarkers on the strength of both their predictive ability and the stability with which they are selected. In a simulation study, we find our approach to perform favourably in comparison to strategies that select on the basis of predictive performance alone. We then consider the real problem of identifying protein peak biomarkers for HAM/TSP, an inflammatory condition of the central nervous system caused by HTLV-1 infection. We apply our algorithm to a set of SELDI mass spectral data, and identify a number of putative biomarkers. Additional experimental work, together with known results from the literature, provides corroborating evidence for the validity of these putative biomarkers. Having focused on static observations, we then make the natural progression to time course data sets. We propose a (Bayesian) bootstrap approach for such data, and then apply our method in the context of gene network inference and the estimation of parameters in ordinary differential equation models. We find that the inferred gene networks are relatively unstable, and demonstrate the importance of finding distributions of ODE parameter estimates, rather than single point estimates

Spiral - Imperial College Digital Repository

Gene set analysis for longitudinal gene expression data

Author: A Subramanian
Arne C Bathke
B Efron
BM Bolstad
C Beadling
C Beadling
CA Tsai
E Segal
GF Tsai
Haiyan Wang
Hans-Peter Piepho
J Pinheiro
J Storey
J Yan
JC Newman
JD Storey
JM Zahn
K Strimmer
Ke Zhang
L Xie
M Gatzka
M Lenardo
MK Kerr
PE Kovanen
PE Kovanen
R Mzali
Solomon W Harrar
T Konishi
T Park
VG Tusher
WT Barry
X Guo
Y Tai
Youping Deng
Z Zhang
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background Gene set analysis (GSA) has become a successful tool to interpret gene expression profiles in terms of biological functions, molecular pathways, or genomic locations. GSA performs statistical tests for independent microarray samples at the level of gene sets rather than individual genes. Nowadays, an increasing number of microarray studies are conducted to explore the dynamic changes of gene expression in a variety of species and biological scenarios. In these longitudinal studies, gene expression is repeatedly measured over time such that a GSA needs to take into account the within-gene correlations in addition to possible between-gene correlations. Results We provide a robust nonparametric approach to compare the expressions of longitudinally measured sets of genes under multiple treatments or experimental conditions. The limiting distributions of our statistics are derived when the number of genes goes to infinity while the number of replications can be small. When the number of genes in a gene set is small, we recommend permutation tests based on our nonparametric test statistics to achieve reliable type I error and better power while incorporating unknown correlations between and within-genes. Simulation results demonstrate that the proposed method has a greater power than other methods for various data distributions and heteroscedastic correlation structures. This method was used for an IL-2 stimulation study and significantly altered gene sets were identified. Conclusions The simulation study and the real data application showed that the proposed gene set analysis provides a promising tool for longitudinal microarray analysis. R scripts for simulating longitudinal data and calculating the nonparametric statistics are posted on the North Dakota INBRE website <url>http://ndinbre.org/programs/bioinformatics.php</url>. Raw microarray data is available in Gene Expression Omnibus (National Center for Biotechnology Information) with accession number GSE6085.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

University of Kentucky

Boosting the concordance index for survival data - a unified framework to derive and evaluate biomarker combinations

Author: Mayr Andreas
Schmid Matthias
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 25/10/2013
Field of study

The development of molecular signatures for the prediction of time-to-event outcomes is a methodologically challenging task in bioinformatics and biostatistics. Although there are numerous approaches for the derivation of marker combinations and their evaluation, the underlying methodology often suffers from the problem that different optimization criteria are mixed during the feature selection, estimation and evaluation steps. This might result in marker combinations that are only suboptimal regarding the evaluation criterion of interest. To address this issue, we propose a unified framework to derive and evaluate biomarker combinations. Our approach is based on the concordance index for time-to-event data, which is a non-parametric measure to quantify the discrimatory power of a prediction rule. Specifically, we propose a component-wise boosting algorithm that results in linear biomarker combinations that are optimal with respect to a smoothed version of the concordance index. We investigate the performance of our algorithm in a large-scale simulation study and in two molecular data sets for the prediction of survival in breast cancer patients. Our numerical results show that the new approach is not only methodologically sound but can also lead to a higher discriminatory power than traditional approaches for the derivation of gene signatures.Comment: revised manuscript - added simulation study, additional result

arXiv.org e-Print Archive

Public Library of Science (PLOS)

Directory of Open Access Journals

Open Access LMU

PubMed Central

FigShare

Evaluation of statistical correlation and validation methods for construction of gene co-expression networks

Author: Duvvuru Suman
Publication venue: TRACE: Tennessee Research and Creative Exchange
Publication date: 01/12/2008
Field of study

High-throughput technologies such as microarrays have led to the rapid accumulation of large scale genomic data providing opportunities to systematically infer gene function and co-expression networks. Typical steps of co-expression network analysis using microarray data consist of estimation of pair-wise gene co-expression using some similarity measure, construction of co-expression networks, identification of clusters of co-expressed genes and post-cluster analyses such as cluster validation. This dissertation is primarily concerned with development and evaluation of approaches for the first and the last steps – estimation of gene co-expression matrices and validation of network clusters. Since clustering methods are not a focus, only a paraclique clustering algorithm will be used in this evaluation. First, a novel Bayesian approach is presented for combining the Pearson correlation with prior biological information from Gene Ontology, yielding a biologically relevant estimate of gene co-expression. The addition of biological information by the Bayesian approach reduced noise in the paraclique gene clusters as indicated by high silhouette and increased homogeneity of clusters in terms of molecular function. Standard similarity measures including correlation coefficients from Pearson, Spearman, Kendall’s Tau, Shrinkage, Partial, and Mutual information, and Euclidean and Manhattan distance measures were evaluated. Based on quality metrics such as cluster homogeneity and stability with respect to ontological categories, clusters resulting from partial correlation and mutual information were more biologically relevant than those from any other correlation measures. Second, statistical quality of clusters was evaluated using approaches based on permutation tests and Mantel correlation to identify significant and informative clusters that capture most of the covariance in the dataset. Third, the utility of statistical contrasts was studied for classification of temporal patterns of gene expression. Specifically, polynomial and Helmert contrast analyses were shown to provide a means of labeling the co-expressed gene sets because they showed similar temporal profiles

University of Tennessee, Knoxville: Trace

Distance-based methods for detecting associations in structured data with applications in bioinformatics

Author: Minas Christopher
Publication venue: Mathematics, Imperial College London
Publication date: 01/04/2013
Field of study

In bioinformatics applications samples of biological variables of interest can take a variety of structures. For instance, in this thesis we consider vector-valued observations of multiple gene expression and genetic markers, curve-valued gene expression time courses, and graph-valued functional connectivity networks within the brain. This thesis considers three problems routinely encountered when dealing with such variables: detecting differences between populations, detecting predictive relationships between variables, and detecting association between variables. Distance-based approaches to these problems are considered, offering great flexibility over alternative approaches, such as traditional multivariate approaches which may be inappropriate. The notion of distance has been widely adopted in recent years to quantify the dissimilarity between samples, and suitable distance measures can be applied depending on the nature of the data and on the specific objectives of the study. For instance, for gene expression time courses modeled as time-dependent curves, distance measures can be specified to capture biologically meaningful aspects of these curves which may differ. On obtaining a distance matrix containing all pairwise distances between the samples of a given variable, many distance-based testing procedures can then be applied. The main inhibitor of their effective use in bioinformatics is that p-values are typically estimated by using Monte Carlo permutations. Thousands or even millions of tests need to be performed simultaneously, and time/computational constraints lead to a low number of permutations being enumerated for each test. The contributions of this thesis include the proposal of two new distance-based statistics, the DBF statistic for the problem of detecting differences between populations, and the GRV coefficient for the problem of detecting association between variables. In each case approximate null distributions are derived, allowing estimation of p-values with reduced computational cost, and through simulation these are shown to work well for a range of distances and data types. The tests are also demonstrated to be competitive with existing approaches. For the problem of detecting predictive relationships between variables, the approximate null distribution is derived for the routinely used distance-based pseudo F test, and through simulation this is shown to work well for a range of distances and data types. All tests are applied to real datasets, including a longitudinal human immune cell M. tuberculosis dataset, an Alzheimer’s disease dataset, and an ovarian cancer dataset.Open Acces

Spiral - Imperial College Digital Repository