2,308 research outputs found
Adaptive robust variable selection
Heavy-tailed high-dimensional data are commonly encountered in various
scientific fields and pose great challenges to modern statistical analysis. A
natural procedure to address this problem is to use penalized quantile
regression with weighted -penalty, called weighted robust Lasso
(WR-Lasso), in which weights are introduced to ameliorate the bias problem
induced by the -penalty. In the ultra-high dimensional setting, where the
dimensionality can grow exponentially with the sample size, we investigate the
model selection oracle property and establish the asymptotic normality of the
WR-Lasso. We show that only mild conditions on the model error distribution are
needed. Our theoretical results also reveal that adaptive choice of the weight
vector is essential for the WR-Lasso to enjoy these nice asymptotic properties.
To make the WR-Lasso practically feasible, we propose a two-step procedure,
called adaptive robust Lasso (AR-Lasso), in which the weight vector in the
second step is constructed based on the -penalized quantile regression
estimate from the first step. This two-step procedure is justified
theoretically to possess the oracle property and the asymptotic normality.
Numerical studies demonstrate the favorable finite-sample performance of the
AR-Lasso.Comment: Published in at http://dx.doi.org/10.1214/13-AOS1191 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
One-step estimator paths for concave regularization
The statistics literature of the past 15 years has established many favorable
properties for sparse diminishing-bias regularization: techniques which can
roughly be understood as providing estimation under penalty functions spanning
the range of concavity between and norms. However, lasso
-regularized estimation remains the standard tool for industrial `Big
Data' applications because of its minimal computational cost and the presence
of easy-to-apply rules for penalty selection. In response, this article
proposes a simple new algorithm framework that requires no more computation
than a lasso path: the path of one-step estimators (POSE) does penalized
regression estimation on a grid of decreasing penalties, but adapts
coefficient-specific weights to decrease as a function of the coefficient
estimated in the previous path step. This provides sparse diminishing-bias
regularization at no extra cost over the fastest lasso algorithms. Moreover,
our `gamma lasso' implementation of POSE is accompanied by a reliable heuristic
for the fit degrees of freedom, so that standard information criteria can be
applied in penalty selection. We also provide novel results on the distance
between weighted- and penalized predictors; this allows us to build
intuition about POSE and other diminishing-bias regularization schemes. The
methods and results are illustrated in extensive simulations and in application
of logistic regression to evaluating the performance of hockey players.Comment: Data and code are in the gamlr package for R. Supplemental appendix
is at https://github.com/TaddyLab/pose/raw/master/paper/supplemental.pd
APPLE: Approximate Path for Penalized Likelihood Estimators
In high-dimensional data analysis, penalized likelihood estimators are shown
to provide superior results in both variable selection and parameter
estimation. A new algorithm, APPLE, is proposed for calculating the Approximate
Path for Penalized Likelihood Estimators. Both the convex penalty (such as
LASSO) and the nonconvex penalty (such as SCAD and MCP) cases are considered.
The APPLE efficiently computes the solution path for the penalized likelihood
estimator using a hybrid of the modified predictor-corrector method and the
coordinate-descent algorithm. APPLE is compared with several well-known
packages via simulation and analysis of two gene expression data sets.Comment: 24 pages, 9 figure
Maximum Likelihood Estimation of Stochastic Frontier Models with Endogeneity
We propose and study a maximum likelihood estimator of stochastic frontier
models with endogeneity in cross-section data when the composite error term may
be correlated with inputs and environmental variables. Our framework is a
generalization of the normal half-normal stochastic frontier model with
endogeneity. We derive the likelihood function in closed form using three
fundamental assumptions: the existence of control functions that fully capture
the dependence between regressors and unobservables; the conditional
independence of the two error components given the control functions; and the
conditional distribution of the stochastic inefficiency term given the control
functions being a folded normal distribution. We also provide a Battese-Coelli
estimator of technical efficiency. Our estimator is computationally fast and
easy to implement. We study some of its asymptotic properties, and we showcase
its finite sample behavior in Monte-Carlo simulations and an empirical
application to farmers in Nepal
Variance Estimation Using Refitted Cross-validation in Ultrahigh Dimensional Regression
Variance estimation is a fundamental problem in statistical modeling. In
ultrahigh dimensional linear regressions where the dimensionality is much
larger than sample size, traditional variance estimation techniques are not
applicable. Recent advances on variable selection in ultrahigh dimensional
linear regressions make this problem accessible. One of the major problems in
ultrahigh dimensional regression is the high spurious correlation between the
unobserved realized noise and some of the predictors. As a result, the realized
noises are actually predicted when extra irrelevant variables are selected,
leading to serious underestimate of the noise level. In this paper, we propose
a two-stage refitted procedure via a data splitting technique, called refitted
cross-validation (RCV), to attenuate the influence of irrelevant variables with
high spurious correlations. Our asymptotic results show that the resulting
procedure performs as well as the oracle estimator, which knows in advance the
mean regression function. The simulation studies lend further support to our
theoretical claims. The naive two-stage estimator which fits the selected
variables in the first stage and the plug-in one stage estimators using LASSO
and SCAD are also studied and compared. Their performances can be improved by
the proposed RCV method
Doubly Robust Inference when Combining Probability and Non-probability Samples with High-dimensional Data
Non-probability samples become increasingly popular in survey statistics but
may suffer from selection biases that limit the generalizability of results to
the target population. We consider integrating a non-probability sample with a
probability sample which provides high-dimensional representative covariate
information of the target population. We propose a two-step approach for
variable selection and finite population inference. In the first step, we use
penalized estimating equations with folded-concave penalties to select
important variables for the sampling score of selection into the
non-probability sample and the outcome model. We show that the penalized
estimating equation approach enjoys the selection consistency property for
general probability samples. The major technical hurdle is due to the possible
dependence of the sample under the finite population framework. To overcome
this challenge, we construct martingales which enable us to apply Bernstein
concentration inequality for martingales. In the second step, we focus on a
doubly robust estimator of the finite population mean and re-estimate the
nuisance model parameters by minimizing the asymptotic squared bias of the
doubly robust estimator. This estimating strategy mitigates the possible
first-step selection error and renders the doubly robust estimator root-n
consistent if either the sampling probability or the outcome model is correctly
specified
Publication Bias in Meta-Analysis: Confidence Intervals for Rosenthal's Fail-Safe Number
The purpose of the present paper is to assess the efficacy of confidence
intervals for Rosenthal's fail-safe number. Although Rosenthal's estimator is
highly used by researchers, its statistical properties are largely unexplored.
First of all, we developed statistical theory which allowed us to produce
confidence intervals for Rosenthal's fail-safe number.This was produced by
discerning whether the number of studies analysed in a meta-analysis is fixed
or random. Each case produces different variance estimators. For a given number
of studies and a given distribution, we provided five variance estimators.
Confidence intervals are examined with a normal approximation and a
nonparametric bootstrap. The accuracy of the different confidence interval
estimates was then tested by methods of simulation under different
distributional assumptions. The half normal distribution variance estimator has
the best probability coverage. Finally, we provide a table of lower confidence
intervals for Rosenthal's estimator.Comment: Published in the International Scholarly Research Notices in December
201
- …