87 research outputs found
No penalty no tears: Least squares in high-dimensional linear models
Ordinary least squares (OLS) is the default method for fitting linear models,
but is not applicable for problems with dimensionality larger than the sample
size. For these problems, we advocate the use of a generalized version of OLS
motivated by ridge regression, and propose two novel three-step algorithms
involving least squares fitting and hard thresholding. The algorithms are
methodologically simple to understand intuitively, computationally easy to
implement efficiently, and theoretically appealing for choosing models
consistently. Numerical exercises comparing our methods with penalization-based
approaches in simulations and data analyses illustrate the great potential of
the proposed algorithms.Comment: Added results for non-sparse models; Added results for elliptical
distribution; Added simulations for adaptive lass
Asymptotics in directed exponential random graph models with an increasing bi-degree sequence
Although asymptotic analyses of undirected network models based on degree
sequences have started to appear in recent literature, it remains an open
problem to study statistical properties of directed network models. In this
paper, we provide for the first time a rigorous analysis of directed
exponential random graph models using the in-degrees and out-degrees as
sufficient statistics with binary as well as continuous weighted edges. We
establish the uniform consistency and the asymptotic normality for the maximum
likelihood estimate, when the number of parameters grows and only one realized
observation of the graph is available. One key technique in the proofs is to
approximate the inverse of the Fisher information matrix using a simple matrix
with high accuracy. Numerical studies confirm our theoretical findings.Comment: Published at http://dx.doi.org/10.1214/15-AOS1343 in the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Dynamic Linear Discriminant Analysis in High Dimensional Space
High-dimensional data that evolve dynamically feature predominantly in the
modern data era. As a partial response to this, recent years have seen
increasing emphasis to address the dimensionality challenge. However, the
non-static nature of these datasets is largely ignored. This paper addresses
both challenges by proposing a novel yet simple dynamic linear programming
discriminant (DLPD) rule for binary classification. Different from the usual
static linear discriminant analysis, the new method is able to capture the
changing distributions of the underlying populations by modeling their means
and covariances as smooth functions of covariates of interest. Under an
approximate sparse condition, we show that the conditional misclassification
rate of the DLPD rule converges to the Bayes risk in probability uniformly over
the range of the variables used for modeling the dynamics, when the
dimensionality is allowed to grow exponentially with the sample size. The
minimax lower bound of the estimation of the Bayes risk is also established,
implying that the misclassification rate of our proposed rule is minimax-rate
optimal. The promising performance of the DLPD rule is illustrated via
extensive simulation studies and the analysis of a breast cancer dataset.Comment: 34 pages; 3 figure
High-dimensional ordinary least-squares projection for screening variables
Variable selection is a challenging issue in statistical applications when the number of predictors p far exceeds the number of observations n. In this ultra-high dimensional setting, the sure independence screening (SIS) procedure was introduced to significantly reduce the dimensionality by preserving the true model with overwhelming probability, before a refined second stage analysis. However, the aforementioned sure screening property strongly relies on the assumption that the important variables in the model have large marginal correlations with the response, which rarely holds in reality. To overcome this, we propose a novel and simple screening technique called the high-dimensional ordinary least-squares projection (HOLP). We show that HOLP possesses the sure screening property and gives consistent variable selection without the strong correlation assumption, and has a low computational complexity. A ridge type HOLP procedure is also discussed. Simulation study shows that HOLP performs competitively compared to many other marginal correlation based methods. An application to a mammalian eye disease data illustrates the attractiveness of HOLP
Optimal Subsampling Bootstrap for Massive Data
The bootstrap is a widely used procedure for statistical inference because of
its simplicity and attractive statistical properties. However, the vanilla
version of bootstrap is no longer feasible computationally for many modern
massive datasets due to the need to repeatedly resample the entire data.
Therefore, several improvements to the bootstrap method have been made in
recent years, which assess the quality of estimators by subsampling the full
dataset before resampling the subsamples. Naturally, the performance of these
modern subsampling methods is influenced by tuning parameters such as the size
of subsamples, the number of subsamples, and the number of resamples per
subsample. In this paper, we develop a novel hyperparameter selection
methodology for selecting these tuning parameters. Formulated as an
optimization problem to find the optimal value of some measure of accuracy of
an estimator subject to computational cost, our framework provides closed-form
solutions for the optimal hyperparameter values for subsampled bootstrap,
subsampled double bootstrap and bag of little bootstraps, at no or little extra
time cost. Using the mean square errors as a proxy of the accuracy measure, we
apply our methodology to study, compare and improve the performance of these
modern versions of bootstrap developed for massive data through simulation
study. The results are promising
- …