2,845 research outputs found

    Feature screening for ultrahigh-dimensional binary classification via linear projection

    Get PDF
    Linear discriminant analysis (LDA) is one of the most widely used methods in discriminant classification and pattern recognition. However, with the rapid development of information science and technology, the dimensionality of collected data is high or ultrahigh, which causes the failure of LDA. To address this issue, a feature screening procedure based on the Fisher's linear projection and the marginal score test is proposed to deal with the ultrahigh-dimensional binary classification problem. The sure screening property is established to ensure that the important features could be retained and the irrelevant predictors could be eliminated. The finite sample properties of the proposed procedure are assessed by Monte Carlo simulation studies and a real-life data example

    Variable Screening for High Dimensional Time Series

    Full text link
    Variable selection is a widely studied problem in high dimensional statistics, primarily since estimating the precise relationship between the covariates and the response is of great importance in many scientific disciplines. However, most of theory and methods developed towards this goal for the linear model invoke the assumption of iid sub-Gaussian covariates and errors. This paper analyzes the theoretical properties of Sure Independence Screening (SIS) (Fan and Lv [J. R. Stat. Soc. Ser. B Stat. Methodol. 70 (2008) 849-911]) for high dimensional linear models with dependent and/or heavy tailed covariates and errors. We also introduce a generalized least squares screening (GLSS) procedure which utilizes the serial correlation present in the data. By utilizing this serial correlation when estimating our marginal effects, GLSS is shown to outperform SIS in many cases. For both procedures we prove sure screening properties, which depend on the moment conditions, and the strength of dependence in the error and covariate processes, amongst other factors. Additionally, combining these screening procedures with the adaptive Lasso is analyzed. Dependence is quantified by functional dependence measures (Wu [Proc. Natl. Acad. Sci. USA 102 (2005) 14150-14154]), and the results rely on the use of Nagaev-type and exponential inequalities for dependent random variables. We also conduct simulations to demonstrate the finite sample performance of these procedures, and include a real data application of forecasting the US inflation rate.Comment: Published in the Electronic Journal of Statistics (https://projecteuclid.org/euclid.ejs/1519700498

    Bayesian Variable Selection for Ultrahigh-dimensional Sparse Linear Models

    Full text link
    We propose a Bayesian variable selection procedure for ultrahigh-dimensional linear regression models. The number of regressors involved in regression, pnp_n, is allowed to grow exponentially with nn. Assuming the true model to be sparse, in the sense that only a small number of regressors contribute to this model, we propose a set of priors suitable for this regime. The model selection procedure based on the proposed set of priors is shown to be variable selection consistent when all the 2pn2^{p_n} models are considered. In the ultrahigh-dimensional setting, selection of the true model among all the 2pn2^{p_n} possible ones involves prohibitive computation. To cope with this, we present a two-step model selection algorithm based on screening and Gibbs sampling. The first step of screening discards a large set of unimportant covariates, and retains a smaller set containing all the active covariates with probability tending to one. In the next step, we search for the best model among the covariates obtained in the screening step. This procedure is computationally quite fast, simple and intuitive. We demonstrate competitive performance of the proposed algorithm for a variety of simulated and real data sets when compared with several frequentist, as well as Bayesian methods

    Challenges of Big Data Analysis

    Full text link
    Big Data bring new opportunities to modern society and challenges to data scientists. On one hand, Big Data hold great promises for discovering subtle population patterns and heterogeneities that are not possible with small-scale data. On the other hand, the massive sample size and high dimensionality of Big Data introduce unique computational and statistical challenges, including scalability and storage bottleneck, noise accumulation, spurious correlation, incidental endogeneity, and measurement errors. These challenges are distinguished and require new computational and statistical paradigm. This article give overviews on the salient features of Big Data and how these features impact on paradigm change on statistical and computational methods as well as computing architectures. We also provide various new perspectives on the Big Data analysis and computation. In particular, we emphasis on the viability of the sparsest solution in high-confidence set and point out that exogeneous assumptions in most statistical methods for Big Data can not be validated due to incidental endogeneity. They can lead to wrong statistical inferences and consequently wrong scientific conclusions
    • …
    corecore