Search CORE

469 research outputs found

Variable selection and regression analysis for graph-structured covariates with an application to genomics

Author: Li Caiyan
Li Hongzhe
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 15/11/2010
Field of study

Graphs and networks are common ways of depicting biological information. In biology, many different biological processes are represented by graphs, such as regulatory networks, metabolic pathways and protein--protein interaction networks. This kind of a priori use of graphs is a useful supplement to the standard numerical data such as microarray gene expression data. In this paper we consider the problem of regression analysis and variable selection when the covariates are linked on a graph. We study a graph-constrained regularization procedure and its theoretical properties for regression analysis to take into account the neighborhood information of the variables measured on a graph. This procedure involves a smoothness penalty on the coefficients that is defined as a quadratic form of the Laplacian matrix associated with the graph. We establish estimation and model selection consistency results and provide estimation bounds for both fixed and diverging numbers of parameters in regression models. We demonstrate by simulations and a real data set that the proposed procedure can lead to better variable selection and prediction than existing methods that ignore the graph information associated with the covariates.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS332 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive

Crossref

Censored Data Regression in High-Dimension and Low-Sample Size Settings For Genomic Applications

Author: Li Hongzhe
Publication venue: Collection of Biostatistics Research Archive
Publication date: 16/03/2006
Field of study

New high-throughput technologies are generating various types of high-dimensional genomic and proteomic data and meta-data (e.g., networks and pathways) in order to obtain a systems-level understanding of various complex diseases such as human cancers and cardiovascular diseases. As the amount and complexity of the data increase and as the questions being addressed become more sophisticated, we face the great challenge of how to model such data in order to draw valid statistical and biological conclusions. One important problem in genomic research is to relate these high-throughput genomic data to various clinical outcomes, including possibly censored survival outcomes such as age at disease onset or time to cancer recurrence. We review some recently developed methods for censored data regression in the high-dimension and low-sample size setting, with emphasis on applications to genomic data. These methods include dimension reduction-based methods, regularized estimation methods such as Lasso and threshold gradient descent method, gradient descent boosting methods and nonparametric pathways-based regression models. These methods are demonstrated and compared by analysis of a data set of microarray gene expression profiles of 240 patients with diffuse large B-cell lymphoma together with follow-up survival information. Areas of further research are also presented

Collection Of Biostatistics Research Archive

Conditional Screening for Ultra-high Dimensional Covariates with Survival Outcomes

Author: Hong Hyokyoung Grace
Kang Jian
Li Yi
Publication venue
Publication date: 19/03/2016
Field of study

Identifying important biomarkers that are predictive for cancer patients' prognosis is key in gaining better insights into the biological influences on the disease and has become a critical component of precision medicine. The emergence of large-scale biomedical survival studies, which typically involve excessive number of biomarkers, has brought high demand in designing efficient screening tools for selecting predictive biomarkers. The vast amount of biomarkers defies any existing variable selection methods via regularization. The recently developed variable screening methods, though powerful in many practical setting, fail to incorporate prior information on the importance of each biomarker and are less powerful in detecting marginally weak while jointly important signals. We propose a new conditional screening method for survival outcome data by computing the marginal contribution of each biomarker given priorly known biological information. This is based on the premise that some biomarkers are known to be associated with disease outcomes a priori. Our method possesses sure screening properties and a vanishing false selection rate. The utility of the proposal is further confirmed with extensive simulation studies and analysis of a Diffuse large B-cell lymphoma (DLBCL) dataset.Comment: 34 pages, 3 figure

arXiv.org e-Print Archive

Collection Of Biostatistics Research Archive

Recommended from our members

Statistical Workflow for Feature Selection in Human Metabolomics Data.

Author: Antonelli Joseph
Cheng Susan
Claggett Brian L
Demler Olga V
Deng Katherine
Henglin Mir
Hushcha Pavel V
Jain Mohit
Kim Andy
Kim Nicole
Lagerborg Kim A
Mora Samia
Niiranen Teemu J
Ovsak Gavin
Pereira Alexandre C
Rao Kevin
Tyagi Octavia
Watrous Jeramie D
Publication venue: eScholarship, University of California
Publication date: 01/07/2019
Field of study

High-throughput metabolomics investigations, when conducted in large human cohorts, represent a potentially powerful tool for elucidating the biochemical diversity underlying human health and disease. Large-scale metabolomics data sources, generated using either targeted or nontargeted platforms, are becoming more common. Appropriate statistical analysis of these complex high-dimensional data will be critical for extracting meaningful results from such large-scale human metabolomics studies. Therefore, we consider the statistical analytical approaches that have been employed in prior human metabolomics studies. Based on the lessons learned and collective experience to date in the field, we offer a step-by-step framework for pursuing statistical analyses of cohort-based human metabolomics data, with a focus on feature selection. We discuss the range of options and approaches that may be employed at each stage of data management, analysis, and interpretation and offer guidance on the analytical decisions that need to be considered over the course of implementing a data analysis workflow. Certain pervasive analytical challenges facing the field warrant ongoing focused research. Addressing these challenges, particularly those related to analyzing human metabolomics data, will allow for more standardization of as well as advances in how research in the field is practiced. In turn, such major analytical advances will lead to substantial improvements in the overall contributions of human metabolomics investigations

eScholarship - University of California

Genomic architecture and prediction of censored time-to-event phenotypes with a Bayesian genome-wide analysis

Author: Fischer Krista
Kousathanas Athanasios
Kutalik Zoltan
Lall Kristi
Magi Reedik
Ojavee Sven E
Orliac Etienne J
Patxot Marion
Robinson Matthew Richard
Trejo Banos Daniel
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2021
Field of study

While recent advancements in computation and modelling have improved the analysis of complex traits, our understanding of the genetic basis of the time at symptom onset remains limited. Here, we develop a Bayesian approach (BayesW) that provides probabilistic inference of the genetic architecture of age-at-onset phenotypes in a sampling scheme that facilitates biobank-scale time-to-event analyses. We show in extensive simulation work the benefits BayesW provides in terms of number of discoveries, model performance and genomic prediction. In the UK Biobank, we find many thousands of common genomic regions underlying the age-at-onset of high blood pressure (HBP), cardiac disease (CAD), and type-2 diabetes (T2D), and for the genetic basis of onset reflecting the underlying genetic liability to disease. Age-at-menopause and age-at-menarche are also highly polygenic, but with higher variance contributed by low frequency variants. Genomic prediction into the Estonian Biobank data shows that BayesW gives higher prediction accuracy than other approaches

Serveur académique lausannois

IST Austria: PubRep (Institute of Science and Technology)

Incorporating biological information into linear models: A Bayesian approach to the selection of pathways and genes

Author: Chen Yian A.
Stingo Francesco C.
Tadesse Mahlet G.
Vannucci Marina
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 01/01/2010
Field of study

The vast amount of biological knowledge accumulated over the years has allowed researchers to identify various biochemical interactions and define different families of pathways. There is an increased interest in identifying pathways and pathway elements involved in particular biological processes. Drug discovery efforts, for example, are focused on identifying biomarkers as well as pathways related to a disease. We propose a Bayesian model that addresses this question by incorporating information on pathways and gene networks in the analysis of DNA microarray data. Such information is used to define pathway summaries, specify prior distributions, and structure the MCMC moves to fit the model. We illustrate the method with an application to gene expression data with censored survival outcomes. In addition to identifying markers that would have been missed otherwise and improving prediction accuracy, the integration of existing biological knowledge into the analysis provides a better understanding of underlying molecular processes.Comment: Published in at http://dx.doi.org/10.1214/11-AOAS463 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive

CiteSeerX

Crossref