13 research outputs found
Fast cross-validation for multi-penalty ridge regression
High-dimensional prediction with multiple data types needs to account for
potentially strong differences in predictive signal. Ridge regression is a
simple model for high-dimensional data that has challenged the predictive
performance of many more complex models and learners, and that allows inclusion
of data type specific penalties. The largest challenge for multi-penalty ridge
is to optimize these penalties efficiently in a cross-validation (CV) setting,
in particular for GLM and Cox ridge regression, which require an additional
estimation loop by iterative weighted least squares (IWLS). Our main
contribution is a computationally very efficient formula for the multi-penalty,
sample-weighted hat-matrix, as used in the IWLS algorithm. As a result, nearly
all computations are in low-dimensional space, rendering a speed-up of several
orders of magnitude. We developed a flexible framework that facilitates
multiple types of response, unpenalized covariates, several performance
criteria and repeated CV. Extensions to paired and preferential data types are
included and illustrated on several cancer genomics survival prediction
problems. Moreover, we present similar computational shortcuts for maximum
marginal likelihood and Bayesian probit regression. The corresponding
R-package, multiridge, serves as a versatile standalone tool, but also as a
fast benchmark for other more complex models and multi-view learners
Fast marginal likelihood estimation of penalties for group-adaptive elastic net
Nowadays, clinical research routinely uses omics data, such as gene
expression, for predicting clinical outcomes or selecting markers.
Additionally, so-called co-data are often available, providing complementary
information on the covariates, like p-values from previously published studies
or groups of genes corresponding to pathways. Elastic net penalisation is
widely used for prediction and covariate selection. Group-adaptive elastic net
penalisation learns from co-data to improve the prediction and covariate
selection, by penalising important groups of covariates less than other groups.
Existing methods are, however, computationally expensive. Here we present a
fast method for marginal likelihood estimation of group-adaptive elastic net
penalties for generalised linear models. We first derive a low-dimensional
representation of the Taylor approximation of the marginal likelihood and its
first derivative for group-adaptive ridge penalties, to efficiently estimate
these penalties. Then we show by using asymptotic normality of the linear
predictors that the marginal likelihood for elastic net models may be
approximated well by the marginal likelihood for ridge models. The ridge group
penalties are then transformed to elastic net group penalties by using the
variance function. The method allows for overlapping groups and unpenalised
variables. We demonstrate the method in a model-based simulation study and an
application to cancer genomics. The method substantially decreases computation
time and outperforms or matches other methods by learning from co-data.Comment: 16 pages, 6 figures, 1 tabl
Flexible co-data learning for high-dimensional prediction.
Clinical research often focuses on complex traits in which many variables play a role in mechanisms driving, or curing, diseases. Clinical prediction is hard when data is high-dimensional, but additional information, like domain knowledge and previously published studies, may be helpful to improve predictions. Such complementary data, or co-data, provide information on the covariates, such as genomic location or P-values from external studies. We use multiple and various co-data to define possibly overlapping or hierarchically structured groups of covariates. These are then used to estimate adaptive multi-group ridge penalties for generalized linear and Cox models. Available group adaptive methods primarily target for settings with few groups, and therefore likely overfit for non-informative, correlated or many groups, and do not account for known structure on group level. To handle these issues, our method combines empirical Bayes estimation of the hyperparameters with an extra level of flexible shrinkage. This renders a uniquely flexible framework as any type of shrinkage can be used on the group level. We describe various types of co-data and propose suitable forms of hypershrinkage. The method is very versatile, as it allows for integration and weighting of multiple co-data sets, inclusion of unpenalized covariates and posterior variable selection. For three cancer genomics applications we demonstrate improvements compared to other models in terms of performance, variable selection stability and validation
Fast cross-validation for multi-penalty ridge regression
Prediction based on multiple high-dimensional data types needs to account for the potentially strong differences in predictive signal. Ridge regression is a simple, yet versatile and interpretable model for high-dimensional data that has challenged the predictive performance of many more complex models and learners, in particular in dense settings. Moreover, it allows using a specific penalty per data type to account for differences between those. Then, the largest challenge for multi-penalty ridge is to optimize these penalties efficiently in a cross-validation (CV) setting, in particular for GLM and Cox ridge regression, which require an additional loop for fitting the model by iterative weighted least squares (IWLS). Our main contribution is a computationally very efficient formula for the multi-penalty, sample-weighted hat-matrix, as used in the IWLS algorithm. As a result, nearly all computations are in the low-dimensional sample space. We show that our approach is several orders of magnitude faster than more naive ones. We developed a very flexible framework that includes prediction of several types of response, allows for unpenalized covariates, can optimize several performance criteria and implements repeated CV. Moreover, extensions to pair data types and to allow a preferential order of data types are included and illustrated on several cancer genomics survival prediction problems. The corresponding R-package, multiridge, serves as a versatile standalone tool, but also as a fast benchmark for other more complex models and multi-view learners
ecpc: an R-package for generic co-data models for high-dimensional prediction
Abstract Background High-dimensional prediction considers data with more variables than samples. Generic research goals are to find the best predictor or to select variables. Results may be improved by exploiting prior information in the form of co-data, providing complementary data not on the samples, but on the variables. We consider adaptive ridge penalised generalised linear and Cox models, in which the variable-specific ridge penalties are adapted to the co-data to give a priori more weight to more important variables. The R-package ecpc originally accommodated various and possibly multiple co-data sources, including categorical co-data, i.e. groups of variables, and continuous co-data. Continuous co-data, however, were handled by adaptive discretisation, potentially inefficiently modelling and losing information. As continuous co-data such as external p values or correlations often arise in practice, more generic co-data models are needed. Results Here, we present an extension to the method and software for generic co-data models, particularly for continuous co-data. At the basis lies a classical linear regression model, regressing prior variance weights on the co-data. Co-data variables are then estimated with empirical Bayes moment estimation. After placing the estimation procedure in the classical regression framework, extension to generalised additive and shape constrained co-data models is straightforward. Besides, we show how ridge penalties may be transformed to elastic net penalties. In simulation studies we first compare various co-data models for continuous co-data from the extension to the original method. Secondly, we compare variable selection performance to other variable selection methods. The extension is faster than the original method and shows improved prediction and variable selection performance for non-linear co-data relations. Moreover, we demonstrate use of the package in several genomics examples throughout the paper. Conclusions The R-package ecpc accommodates linear, generalised additive and shape constrained additive co-data models for the purpose of improved high-dimensional prediction and variable selection. The extended version of the package as presented here (version number 3.1.1 and higher) is available on ( https://cran.r-project.org/web/packages/ecpc/ )
Percolate: An Exponential Family JIVE Model to Design DNA-Based Predictors of Drug Response
Motivation: Anti-cancer drugs may elicit resistance or sensitivity through mechanisms which involve several genomic layers. Nevertheless, we have demonstrated that gene expression contains most of the predictive capacity compared to the remaining omic data types. Unfortunately, this comes at a price: gene expression biomarkers are often hard to interpret and show poor robustness. Results: To capture the best of both worlds, i.e. the accuracy of gene expression and the robustness of other genomic levels, such as mutations, copy-number or methylation, we developed Percolate, a computational approach which extracts the joint signal between gene expression and the other omic data types. We developed an out-of-sample extension of Percolate which allows predictions on unseen samples without the necessity to recompute the joint signal on all data. We employed Percolate to extract the joint signal between gene expression and either mutations, copy-number or methylation, and used the out-of sample extension to perform response prediction on unseen samples. We showed that the joint signal recapitulates, and sometimes exceeds, the predictive performance achieved with each data type individually. Importantly, molecular signatures created by Percolate do not require gene expression to be evaluated, rendering them suitable to clinical applications where only one data type is available. Availability: Percolate is available as a Python 3.7 package and the scripts to reproduce the results are available here.</p
Percolate: An Exponential Family JIVE Model to Design DNA-Based Predictors of Drug Response
Motivation: Anti-cancer drugs may elicit resistance or sensitivity through mechanisms which involve several genomic layers. Nevertheless, we have demonstrated that gene expression contains most of the predictive capacity compared to the remaining omic data types. Unfortunately, this comes at a price: gene expression biomarkers are often hard to interpret and show poor robustness. Results: To capture the best of both worlds, i.e. the accuracy of gene expression and the robustness of other genomic levels, such as mutations, copy-number or methylation, we developed Percolate, a computational approach which extracts the joint signal between gene expression and the other omic data types. We developed an out-of-sample extension of Percolate which allows predictions on unseen samples without the necessity to recompute the joint signal on all data. We employed Percolate to extract the joint signal between gene expression and either mutations, copy-number or methylation, and used the out-of sample extension to perform response prediction on unseen samples. We showed that the joint signal recapitulates, and sometimes exceeds, the predictive performance achieved with each data type individually. Importantly, molecular signatures created by Percolate do not require gene expression to be evaluated, rendering them suitable to clinical applications where only one data type is available. Availability: Percolate is available as a Python 3.7 package and the scripts to reproduce the results are available here.Pattern Recognition and Bioinformatic
CSF proteome profiling reveals biomarkers to discriminate dementia with Lewy bodies from Alzheimer\u2032s disease
Abstract: Diagnosis of dementia with Lewy bodies (DLB) is challenging and specific biofluid biomarkers are highly needed. We employed proximity extension-based assays to measure 665 proteins in the cerebrospinal fluid (CSF) from patients with DLB (n=109), Alzheimers disease (AD, n=235) and cognitively unimpaired controls (n=190). We identified over 50 CSF proteins dysregulated in DLB, enriched in myelination processes among others. The dopamine biosynthesis enzyme DDC was the strongest dysregulated protein, and could efficiently discriminate DLB from controls and AD (AUC:0.91 and 0.81 respectively). Classification modeling unveiled a 7-CSF biomarker panel that better discriminate DLB from AD (AUC:0.93). A custom multiplex panel for six of these markers (DDC, CRH, MMP-3, ABL1, MMP-10, THOP1) was developed and validated in independent cohorts, including an AD and DLB autopsy cohort. This DLB CSF proteome study identifies DLB-specific protein changes and translates these findings to a practicable biomarker panel that accurately identifies DLB patients, providing promising diagnostic and clinical trial testing opportunities
CSF proteome profiling reveals biomarkers to discriminate dementia with Lewy bodies from Alzheimer´s disease
Abstract Diagnosis of dementia with Lewy bodies (DLB) is challenging and specific biofluid biomarkers are highly needed. We employed proximity extension-based assays to measure 665 proteins in the cerebrospinal fluid (CSF) from patients with DLB (n = 109), Alzheimer´s disease (AD, n = 235) and cognitively unimpaired controls (n = 190). We identified over 50 CSF proteins dysregulated in DLB, enriched in myelination processes among others. The dopamine biosynthesis enzyme DDC was the strongest dysregulated protein, and could efficiently discriminate DLB from controls and AD (AUC:0.91 and 0.81 respectively). Classification modeling unveiled a 7-CSF biomarker panel that better discriminate DLB from AD (AUC:0.93). A custom multiplex panel for six of these markers (DDC, CRH, MMP-3, ABL1, MMP-10, THOP1) was developed and validated in independent cohorts, including an AD and DLB autopsy cohort. This DLB CSF proteome study identifies DLB-specific protein changes and translates these findings to a practicable biomarker panel that accurately identifies DLB patients, providing promising diagnostic and clinical trial testing opportunities