469 research outputs found
Variable selection and regression analysis for graph-structured covariates with an application to genomics
Graphs and networks are common ways of depicting biological information. In
biology, many different biological processes are represented by graphs, such as
regulatory networks, metabolic pathways and protein--protein interaction
networks. This kind of a priori use of graphs is a useful supplement to the
standard numerical data such as microarray gene expression data. In this paper
we consider the problem of regression analysis and variable selection when the
covariates are linked on a graph. We study a graph-constrained regularization
procedure and its theoretical properties for regression analysis to take into
account the neighborhood information of the variables measured on a graph. This
procedure involves a smoothness penalty on the coefficients that is defined as
a quadratic form of the Laplacian matrix associated with the graph. We
establish estimation and model selection consistency results and provide
estimation bounds for both fixed and diverging numbers of parameters in
regression models. We demonstrate by simulations and a real data set that the
proposed procedure can lead to better variable selection and prediction than
existing methods that ignore the graph information associated with the
covariates.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS332 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Censored Data Regression in High-Dimension and Low-Sample Size Settings For Genomic Applications
New high-throughput technologies are generating various types of high-dimensional genomic and proteomic data and meta-data (e.g., networks and pathways) in order to obtain a systems-level understanding of various complex diseases such as human cancers and cardiovascular diseases. As the amount and complexity of the data increase and as the questions being addressed become more sophisticated, we face the great challenge of how to model such data in order to draw valid statistical and biological conclusions. One important problem in genomic research is to relate these high-throughput genomic data to various clinical outcomes, including possibly censored survival outcomes such as age at disease onset or time to cancer recurrence. We review some recently developed methods for censored data regression in the high-dimension and low-sample size setting, with emphasis on applications to genomic data. These methods include dimension reduction-based methods, regularized estimation methods such as Lasso and threshold gradient descent method, gradient descent boosting methods and nonparametric pathways-based regression models. These methods are demonstrated and compared by analysis of a data set of microarray gene expression profiles of 240 patients with diffuse large B-cell lymphoma together with follow-up survival information. Areas of further research are also presented
Conditional Screening for Ultra-high Dimensional Covariates with Survival Outcomes
Identifying important biomarkers that are predictive for cancer patients'
prognosis is key in gaining better insights into the biological influences on
the disease and has become a critical component of precision medicine. The
emergence of large-scale biomedical survival studies, which typically involve
excessive number of biomarkers, has brought high demand in designing efficient
screening tools for selecting predictive biomarkers. The vast amount of
biomarkers defies any existing variable selection methods via regularization.
The recently developed variable screening methods, though powerful in many
practical setting, fail to incorporate prior information on the importance of
each biomarker and are less powerful in detecting marginally weak while jointly
important signals. We propose a new conditional screening method for survival
outcome data by computing the marginal contribution of each biomarker given
priorly known biological information. This is based on the premise that some
biomarkers are known to be associated with disease outcomes a priori. Our
method possesses sure screening properties and a vanishing false selection
rate. The utility of the proposal is further confirmed with extensive
simulation studies and analysis of a Diffuse large B-cell lymphoma (DLBCL)
dataset.Comment: 34 pages, 3 figure
Recommended from our members
Statistical Workflow for Feature Selection in Human Metabolomics Data.
High-throughput metabolomics investigations, when conducted in large human cohorts, represent a potentially powerful tool for elucidating the biochemical diversity underlying human health and disease. Large-scale metabolomics data sources, generated using either targeted or nontargeted platforms, are becoming more common. Appropriate statistical analysis of these complex high-dimensional data will be critical for extracting meaningful results from such large-scale human metabolomics studies. Therefore, we consider the statistical analytical approaches that have been employed in prior human metabolomics studies. Based on the lessons learned and collective experience to date in the field, we offer a step-by-step framework for pursuing statistical analyses of cohort-based human metabolomics data, with a focus on feature selection. We discuss the range of options and approaches that may be employed at each stage of data management, analysis, and interpretation and offer guidance on the analytical decisions that need to be considered over the course of implementing a data analysis workflow. Certain pervasive analytical challenges facing the field warrant ongoing focused research. Addressing these challenges, particularly those related to analyzing human metabolomics data, will allow for more standardization of as well as advances in how research in the field is practiced. In turn, such major analytical advances will lead to substantial improvements in the overall contributions of human metabolomics investigations
Genomic architecture and prediction of censored time-to-event phenotypes with a Bayesian genome-wide analysis
While recent advancements in computation and modelling have improved the analysis of complex traits, our understanding of the genetic basis of the time at symptom onset remains limited. Here, we develop a Bayesian approach (BayesW) that provides probabilistic inference of the genetic architecture of age-at-onset phenotypes in a sampling scheme that facilitates biobank-scale time-to-event analyses. We show in extensive simulation work the benefits BayesW provides in terms of number of discoveries, model performance and genomic prediction. In the UK Biobank, we find many thousands of common genomic regions underlying the age-at-onset of high blood pressure (HBP), cardiac disease (CAD), and type-2 diabetes (T2D), and for the genetic basis of onset reflecting the underlying genetic liability to disease. Age-at-menopause and age-at-menarche are also highly polygenic, but with higher variance contributed by low frequency variants. Genomic prediction into the Estonian Biobank data shows that BayesW gives higher prediction accuracy than other approaches
Incorporating biological information into linear models: A Bayesian approach to the selection of pathways and genes
The vast amount of biological knowledge accumulated over the years has
allowed researchers to identify various biochemical interactions and define
different families of pathways. There is an increased interest in identifying
pathways and pathway elements involved in particular biological processes. Drug
discovery efforts, for example, are focused on identifying biomarkers as well
as pathways related to a disease. We propose a Bayesian model that addresses
this question by incorporating information on pathways and gene networks in the
analysis of DNA microarray data. Such information is used to define pathway
summaries, specify prior distributions, and structure the MCMC moves to fit the
model. We illustrate the method with an application to gene expression data
with censored survival outcomes. In addition to identifying markers that would
have been missed otherwise and improving prediction accuracy, the integration
of existing biological knowledge into the analysis provides a better
understanding of underlying molecular processes.Comment: Published in at http://dx.doi.org/10.1214/11-AOAS463 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
- …