264 research outputs found
Lecture notes on ridge regression
The linear regression model cannot be fitted to high-dimensional data, as the
high-dimensionality brings about empirical non-identifiability. Penalized
regression overcomes this non-identifiability by augmentation of the loss
function by a penalty (i.e. a function of regression coefficients). The ridge
penalty is the sum of squared regression coefficients, giving rise to ridge
regression. Here many aspect of ridge regression are reviewed e.g. moments,
mean squared error, its equivalence to constrained estimation, and its relation
to Bayesian regression. Finally, its behaviour and use are illustrated in
simulation and on omics data. Subsequently, ridge regression is generalized to
allow for a more general penalty. The ridge penalization framework is then
translated to logistic regression and its properties are shown to carry over.
To contrast ridge penalized estimation, the final chapter introduces its lasso
counterpart
Ridge Estimation of Inverse Covariance Matrices from High-Dimensional Data
We study ridge estimation of the precision matrix in the high-dimensional
setting where the number of variables is large relative to the sample size. We
first review two archetypal ridge estimators and note that their utilized
penalties do not coincide with common ridge penalties. Subsequently, starting
from a common ridge penalty, analytic expressions are derived for two
alternative ridge estimators of the precision matrix. The alternative
estimators are compared to the archetypes with regard to eigenvalue shrinkage
and risk. The alternatives are also compared to the graphical lasso within the
context of graphical modeling. The comparisons may give reason to prefer the
proposed alternative estimators
Modeling association between DNA copy number and gene expression with constrained piecewise linear regression splines
DNA copy number and mRNA expression are widely used data types in cancer
studies, which combined provide more insight than separately. Whereas in
existing literature the form of the relationship between these two types of
markers is fixed a priori, in this paper we model their association. We employ
piecewise linear regression splines (PLRS), which combine good interpretation
with sufficient flexibility to identify any plausible type of relationship. The
specification of the model leads to estimation and model selection in a
constrained, nonstandard setting. We provide methodology for testing the effect
of DNA on mRNA and choosing the appropriate model. Furthermore, we present a
novel approach to obtain reliable confidence bands for constrained PLRS, which
incorporates model uncertainty. The procedures are applied to colorectal and
breast cancer data. Common assumptions are found to be potentially misleading
for biologically relevant genes. More flexible models may bring more insight in
the interaction between the two markers.Comment: Published in at http://dx.doi.org/10.1214/12-AOAS605 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Normalized, Segmented or Called aCGH Data?
Array comparative genomic hybridization (aCGH) is a high-throughput lab technique to measure genome-wide chromosomal copy numbers. Data from aCGH experiments require extensive pre-processing, which consists of three steps: normalization, segmentation and calling. Each of these pre-processing steps yields a different data set: normalized data, segmented data, and called data. Publications using aCGH base their findings on data from all stages of the pre-processing. Hence, there is no consensus on which should be used for further down-stream analysis. This consensus is however important for correct reporting of findings, and comparison of results from different studies. We discuss several issues that should be taken into account when deciding on which data are to be used. We express the believe that called data are best used, but would welcome opposing views
The spectral condition number plot for regularization parameter evaluation
Abstract: Many modern statistical applications ask for the estimation of a covariance (or precision) matrix in settings where the number of variables is larger than the number of observations. There exists a broad class of ridge-type estimators that employs regularization to cope with the subsequent singularity of the sample covariance matrix. These estimators depend on a penalty parameter and choosing its value can be hard, in terms of being computationally unfeasible or tenable only for a restricted set of ridge-type estimators. Here we introduce a simple graphical tool, the spectral condition number plot, for informed heuristic penalty parameter assessment. The proposed tool is computationally friendly and can be employed for the full class of ridge-type covariance (precision) estimators
rags2ridges:A One-Stop-ā<sub>2</sub>-Shop for Graphical Modeling of High-Dimensional Precision Matrices
A graphical model is an undirected network representing the conditional independence properties between random variables. Graphical modeling has become part and parcel of systems or network approaches to multivariate data, in particular when the variable dimension exceeds the observation dimension. rags2ridges is an R package for graphical modeling of high-dimensional precision matrices through ridge (ā2) penalties. It provides a modular framework for the extraction, visualization, and analysis of Gaussian graphical models from high-dimensional data. Moreover, it can handle the incorporation of prior information as well as multiple heterogeneous data classes. As such, it provides a one-stop-ā2-shop for graphical modeling of high-dimensional precision matrices. The functionality of the package is illustrated with an example dataset pertaining to blood-based metabolite measurements in persons suffering from Alzheimerās disease.</p
Better prediction by use of co-data: Adaptive group-regularized ridge regression
For many high-dimensional studies, additional information on the variables,
like (genomic) annotation or external p-values, is available. In the context of
binary and continuous prediction, we develop a method for adaptive
group-regularized (logistic) ridge regression, which makes structural use of
such 'co-data'. Here, 'groups' refer to a partition of the variables according
to the co-data. We derive empirical Bayes estimates of group-specific
penalties, which possess several nice properties: i) they are analytical; ii)
they adapt to the informativeness of the co-data for the data at hand; iii)
only one global penalty parameter requires tuning by cross-validation. In
addition, the method allows use of multiple types of co-data at little extra
computational effort.
We show that the group-specific penalties may lead to a larger distinction
between `near-zero' and relatively large regression parameters, which
facilitates post-hoc variable selection. The method, termed GRridge, is
implemented in an easy-to-use R-package. It is demonstrated on two cancer
genomics studies, which both concern the discrimination of precancerous
cervical lesions from normal cervix tissues using methylation microarray data.
For both examples, GRridge clearly improves the predictive performances of
ordinary logistic ridge regression and the group lasso. In addition, we show
that for the second study the relatively good predictive performance is
maintained when selecting only 42 variables.Comment: 15 pages, 2 figures. Supplementary Information available on first
author's web sit
- ā¦