70 research outputs found
High-Dimensional Metrics in R
The package High-dimensional Metrics (\Rpackage{hdm}) is an evolving
collection of statistical methods for estimation and quantification of
uncertainty in high-dimensional approximately sparse models. It focuses on
providing confidence intervals and significance testing for (possibly many)
low-dimensional subcomponents of the high-dimensional parameter vector.
Efficient estimators and uniformly valid confidence intervals for regression
coefficients on target variables (e.g., treatment or policy variable) in a
high-dimensional approximately sparse regression model, for average treatment
effect (ATE) and average treatment effect for the treated (ATET), as well for
extensions of these parameters to the endogenous setting are provided. Theory
grounded, data-driven methods for selecting the penalization parameter in Lasso
regressions under heteroscedastic and non-Gaussian errors are implemented.
Moreover, joint/ simultaneous confidence intervals for regression coefficients
of a high-dimensional sparse regression are implemented, including a joint
significance test for Lasso regression. Data sets which have been used in the
literature and might be useful for classroom demonstration and for testing new
estimators are included. \R and the package \Rpackage{hdm} are open-source
software projects and can be freely downloaded from CRAN:
\texttt{http://cran.r-project.org}.Comment: 34 pages; vignette for the R package hdm, available at
http://cran.r-project.org/web/packages/hdm/ and
http://r-forge.r-project.org/R/?group_id=2084 (development version
lassopack: Model selection and prediction with regularized regression in Stata
This article introduces lassopack, a suite of programs for regularized
regression in Stata. lassopack implements lasso, square-root lasso, elastic
net, ridge regression, adaptive lasso and post-estimation OLS. The methods are
suitable for the high-dimensional setting where the number of predictors
may be large and possibly greater than the number of observations, . We
offer three different approaches for selecting the penalization (`tuning')
parameters: information criteria (implemented in lasso2), -fold
cross-validation and -step ahead rolling cross-validation for cross-section,
panel and time-series data (cvlasso), and theory-driven (`rigorous')
penalization for the lasso and square-root lasso for cross-section and panel
data (rlasso). We discuss the theoretical framework and practical
considerations for each approach. We also present Monte Carlo results to
compare the performance of the penalization approaches.Comment: 52 pages, 6 figures, 6 tables; submitted to Stata Journal; for more
information see https://statalasso.github.io
Random lasso
We propose a computationally intensive method, the random lasso method, for
variable selection in linear models. The method consists of two major steps. In
step 1, the lasso method is applied to many bootstrap samples, each using a set
of randomly selected covariates. A measure of importance is yielded from this
step for each covariate. In step 2, a similar procedure to the first step is
implemented with the exception that for each bootstrap sample, a subset of
covariates is randomly selected with unequal selection probabilities determined
by the covariates' importance. Adaptive lasso may be used in the second step
with weights determined by the importance measures. The final set of covariates
and their coefficients are determined by averaging bootstrap results obtained
from step 2. The proposed method alleviates some of the limitations of lasso,
elastic-net and related methods noted especially in the context of microarray
data analysis: it tends to remove highly correlated variables altogether or
select them all, and maintains maximal flexibility in estimating their
coefficients, particularly with different signs; the number of selected
variables is no longer limited by the sample size; and the resulting prediction
accuracy is competitive or superior compared to the alternatives. We illustrate
the proposed method by extensive simulation studies. The proposed method is
also applied to a Glioblastoma microarray data analysis.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS377 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models
Constructing confidence intervals for the coefficients of high-dimensional
sparse linear models remains a challenge, mainly because of the complicated
limiting distributions of the widely used estimators, such as the lasso.
Several methods have been developed for constructing such intervals. Bootstrap
lasso+ols is notable for its technical simplicity, good interpretability, and
performance that is comparable with that of other more complicated methods.
However, bootstrap lasso+ols depends on the beta-min assumption, a theoretic
criterion that is often violated in practice. Thus, we introduce a new method,
called bootstrap lasso+partial ridge, to relax this assumption. Lasso+partial
ridge is a two-stage estimator. First, the lasso is used to select features.
Then, the partial ridge is used to refit the coefficients. Simulation results
show that bootstrap lasso+partial ridge outperforms bootstrap lasso+ols when
there exist small, but nonzero coefficients, a common situation that violates
the beta-min assumption. For such coefficients, the confidence intervals
constructed using bootstrap lasso+partial ridge have, on average, larger
coverage probabilities than those of bootstrap lasso+ols. Bootstrap
lasso+partial ridge also has, on average, shorter confidence interval
lengths than those of the de-sparsified lasso methods, regardless of whether
the linear models are misspecified. Additionally, we provide theoretical
guarantees for bootstrap lasso+partial ridge under appropriate conditions, and
implement it in the R package "HDCI.
An Algorithmic Framework for Computing Validation Performance Bounds by Using Suboptimal Models
Practical model building processes are often time-consuming because many
different models must be trained and validated. In this paper, we introduce a
novel algorithm that can be used for computing the lower and the upper bounds
of model validation errors without actually training the model itself. A key
idea behind our algorithm is using a side information available from a
suboptimal model. If a reasonably good suboptimal model is available, our
algorithm can compute lower and upper bounds of many useful quantities for
making inferences on the unknown target model. We demonstrate the advantage of
our algorithm in the context of model selection for regularized learning
problems
- …