2,845 research outputs found
Feature screening for ultrahigh-dimensional binary classification via linear projection
Linear discriminant analysis (LDA) is one of the most widely used methods in discriminant classification and pattern recognition. However, with the rapid development of information science and technology, the dimensionality of collected data is high or ultrahigh, which causes the failure of LDA. To address this issue, a feature screening procedure based on the Fisher's linear projection and the marginal score test is proposed to deal with the ultrahigh-dimensional binary classification problem. The sure screening property is established to ensure that the important features could be retained and the irrelevant predictors could be eliminated. The finite sample properties of the proposed procedure are assessed by Monte Carlo simulation studies and a real-life data example
Variable Screening for High Dimensional Time Series
Variable selection is a widely studied problem in high dimensional
statistics, primarily since estimating the precise relationship between the
covariates and the response is of great importance in many scientific
disciplines. However, most of theory and methods developed towards this goal
for the linear model invoke the assumption of iid sub-Gaussian covariates and
errors. This paper analyzes the theoretical properties of Sure Independence
Screening (SIS) (Fan and Lv [J. R. Stat. Soc. Ser. B Stat. Methodol. 70 (2008)
849-911]) for high dimensional linear models with dependent and/or heavy tailed
covariates and errors. We also introduce a generalized least squares screening
(GLSS) procedure which utilizes the serial correlation present in the data. By
utilizing this serial correlation when estimating our marginal effects, GLSS is
shown to outperform SIS in many cases. For both procedures we prove sure
screening properties, which depend on the moment conditions, and the strength
of dependence in the error and covariate processes, amongst other factors.
Additionally, combining these screening procedures with the adaptive Lasso is
analyzed. Dependence is quantified by functional dependence measures (Wu [Proc.
Natl. Acad. Sci. USA 102 (2005) 14150-14154]), and the results rely on the use
of Nagaev-type and exponential inequalities for dependent random variables. We
also conduct simulations to demonstrate the finite sample performance of these
procedures, and include a real data application of forecasting the US inflation
rate.Comment: Published in the Electronic Journal of Statistics
(https://projecteuclid.org/euclid.ejs/1519700498
Bayesian Variable Selection for Ultrahigh-dimensional Sparse Linear Models
We propose a Bayesian variable selection procedure for ultrahigh-dimensional
linear regression models. The number of regressors involved in regression,
, is allowed to grow exponentially with . Assuming the true model to be
sparse, in the sense that only a small number of regressors contribute to this
model, we propose a set of priors suitable for this regime. The model selection
procedure based on the proposed set of priors is shown to be variable selection
consistent when all the models are considered. In the
ultrahigh-dimensional setting, selection of the true model among all the
possible ones involves prohibitive computation. To cope with this, we
present a two-step model selection algorithm based on screening and Gibbs
sampling. The first step of screening discards a large set of unimportant
covariates, and retains a smaller set containing all the active covariates with
probability tending to one. In the next step, we search for the best model
among the covariates obtained in the screening step. This procedure is
computationally quite fast, simple and intuitive. We demonstrate competitive
performance of the proposed algorithm for a variety of simulated and real data
sets when compared with several frequentist, as well as Bayesian methods
Challenges of Big Data Analysis
Big Data bring new opportunities to modern society and challenges to data
scientists. On one hand, Big Data hold great promises for discovering subtle
population patterns and heterogeneities that are not possible with small-scale
data. On the other hand, the massive sample size and high dimensionality of Big
Data introduce unique computational and statistical challenges, including
scalability and storage bottleneck, noise accumulation, spurious correlation,
incidental endogeneity, and measurement errors. These challenges are
distinguished and require new computational and statistical paradigm. This
article give overviews on the salient features of Big Data and how these
features impact on paradigm change on statistical and computational methods as
well as computing architectures. We also provide various new perspectives on
the Big Data analysis and computation. In particular, we emphasis on the
viability of the sparsest solution in high-confidence set and point out that
exogeneous assumptions in most statistical methods for Big Data can not be
validated due to incidental endogeneity. They can lead to wrong statistical
inferences and consequently wrong scientific conclusions
- …