474 research outputs found
Robust rank correlation based screening
Independence screening is a variable selection method that uses a ranking
criterion to select significant variables, particularly for statistical models
with nonpolynomial dimensionality or "large p, small n" paradigms when p can be
as large as an exponential of the sample size n. In this paper we propose a
robust rank correlation screening (RRCS) method to deal with ultra-high
dimensional data. The new procedure is based on the Kendall \tau correlation
coefficient between response and predictor variables rather than the Pearson
correlation of existing methods. The new method has four desirable features
compared with existing independence screening methods. First, the sure
independence screening property can hold only under the existence of a second
order moment of predictor variables, rather than exponential tails or
alikeness, even when the number of predictor variables grows as fast as
exponentially of the sample size. Second, it can be used to deal with
semiparametric models such as transformation regression models and single-index
models under monotonic constraint to the link function without involving
nonparametric estimation even when there are nonparametric functions in the
models. Third, the procedure can be largely used against outliers and influence
points in the observations. Last, the use of indicator functions in rank
correlation screening greatly simplifies the theoretical derivation due to the
boundedness of the resulting statistics, compared with previous studies on
variable screening. Simulations are carried out for comparisons with existing
methods and a real data example is analyzed.Comment: Published in at http://dx.doi.org/10.1214/12-AOS1024 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org). arXiv admin note: text overlap with
arXiv:0903.525
Doubly Robust Inference when Combining Probability and Non-probability Samples with High-dimensional Data
Non-probability samples become increasingly popular in survey statistics but
may suffer from selection biases that limit the generalizability of results to
the target population. We consider integrating a non-probability sample with a
probability sample which provides high-dimensional representative covariate
information of the target population. We propose a two-step approach for
variable selection and finite population inference. In the first step, we use
penalized estimating equations with folded-concave penalties to select
important variables for the sampling score of selection into the
non-probability sample and the outcome model. We show that the penalized
estimating equation approach enjoys the selection consistency property for
general probability samples. The major technical hurdle is due to the possible
dependence of the sample under the finite population framework. To overcome
this challenge, we construct martingales which enable us to apply Bernstein
concentration inequality for martingales. In the second step, we focus on a
doubly robust estimator of the finite population mean and re-estimate the
nuisance model parameters by minimizing the asymptotic squared bias of the
doubly robust estimator. This estimating strategy mitigates the possible
first-step selection error and renders the doubly robust estimator root-n
consistent if either the sampling probability or the outcome model is correctly
specified
Penalized single-index quantile regression
This article is made available through the Brunel Open Access Publishing Fund. Copyright for this article is retained by the author(s), with first publication rights granted to the journal.
This is an open-access article distributed under the terms and conditions of the Creative Commons Attribution
license (http://creativecommons.org/licenses/by/3.0/).The single-index (SI) regression and single-index quantile (SIQ) estimation methods product linear combinations of all the original predictors. However, it is possible that there are many unimportant predictors within the original predictors. Thus, the precision of parameter estimation as well as the accuracy of prediction will be effected by the existence of those unimportant predictors when the previous methods are used. In this article, an extension of the SIQ method of Wu et al. (2010) has been proposed, which considers Lasso and Adaptive Lasso for estimation and variable selection. Computational algorithms have been developed in order to calculate the penalized SIQ estimates. A simulation study and a real data application have been used to assess the performance of the methods under consideration
Multinomial Inverse Regression for Text Analysis
Text data, including speeches, stories, and other document forms, are often
connected to sentiment variables that are of interest for research in
marketing, economics, and elsewhere. It is also very high dimensional and
difficult to incorporate into statistical analyses. This article introduces a
straightforward framework of sentiment-preserving dimension reduction for text
data. Multinomial inverse regression is introduced as a general tool for
simplifying predictor sets that can be represented as draws from a multinomial
distribution, and we show that logistic regression of phrase counts onto
document annotations can be used to obtain low dimension document
representations that are rich in sentiment information. To facilitate this
modeling, a novel estimation technique is developed for multinomial logistic
regression with very high-dimension response. In particular, independent
Laplace priors with unknown variance are assigned to each regression
coefficient, and we detail an efficient routine for maximization of the joint
posterior over coefficients and their prior scale. This "gamma-lasso" scheme
yields stable and effective estimation for general high-dimension logistic
regression, and we argue that it will be superior to current methods in many
settings. Guidelines for prior specification are provided, algorithm
convergence is detailed, and estimator properties are outlined from the
perspective of the literature on non-concave likelihood penalization. Related
work on sentiment analysis from statistics, econometrics, and machine learning
is surveyed and connected. Finally, the methods are applied in two detailed
examples and we provide out-of-sample prediction studies to illustrate their
effectiveness.Comment: Published in the Journal of the American Statistical Association 108,
2013, with discussion (rejoinder is here: http://arxiv.org/abs/1304.4200).
Software is available in the textir package for
- …