698 research outputs found

    The Influence Function of Penalized Regression Estimators

    Full text link
    To perform regression analysis in high dimensions, lasso or ridge estimation are a common choice. However, it has been shown that these methods are not robust to outliers. Therefore, alternatives as penalized M-estimation or the sparse least trimmed squares (LTS) estimator have been proposed. The robustness of these regression methods can be measured with the influence function. It quantifies the effect of infinitesimal perturbations in the data. Furthermore it can be used to compute the asymptotic variance and the mean squared error. In this paper we compute the influence function, the asymptotic variance and the mean squared error for penalized M-estimators and the sparse LTS estimator. The asymptotic biasedness of the estimators make the calculations nonstandard. We show that only M-estimators with a loss function with a bounded derivative are robust against regression outliers. In particular, the lasso has an unbounded influence function.Comment: appears in Statistics: A Journal of Theoretical and Applied Statistics, 201

    Robustness in sparse linear models: relative efficiency based on robust approximate message passing

    Full text link
    Understanding efficiency in high dimensional linear models is a longstanding problem of interest. Classical work with smaller dimensional problems dating back to Huber and Bickel has illustrated the benefits of efficient loss functions. When the number of parameters pp is of the same order as the sample size nn, p≈np \approx n, an efficiency pattern different from the one of Huber was recently established. In this work, we consider the effects of model selection on the estimation efficiency of penalized methods. In particular, we explore whether sparsity, results in new efficiency patterns when p>np > n. In the interest of deriving the asymptotic mean squared error for regularized M-estimators, we use the powerful framework of approximate message passing. We propose a novel, robust and sparse approximate message passing algorithm (RAMP), that is adaptive to the error distribution. Our algorithm includes many non-quadratic and non-differentiable loss functions. We derive its asymptotic mean squared error and show its convergence, while allowing p,n,s→∞p, n, s \to \infty, with n/p∈(0,1)n/p \in (0,1) and n/s∈(1,∞)n/s \in (1,\infty). We identify new patterns of relative efficiency regarding a number of penalized MM estimators, when pp is much larger than nn. We show that the classical information bound is no longer reachable, even for light--tailed error distributions. We show that the penalized least absolute deviation estimator dominates the penalized least square estimator, in cases of heavy--tailed distributions. We observe this pattern for all choices of the number of non-zero parameters ss, both s≤ns \leq n and s≈ns \approx n. In non-penalized problems where s=p≈ns =p \approx n, the opposite regime holds. Therefore, we discover that the presence of model selection significantly changes the efficiency patterns.Comment: 49 pages, 10 figure

    Robust high-dimensional data analysis using a weight shrinkage rule

    Get PDF
    In high-dimensional settings, a penalized least squares approach may lose its efficiency in both estimation and variable selection due to the existence of either outliers or heteroscedasticity. In this thesis, we propose a novel approach to perform robust high-dimensional data analysis in a penalized weighted least square framework. The main idea is to relate the irregularity of each observation to a weight vector and obtain the outlying status data-adaptively using a weight shrinkage rule. By usage of L-1 type regularization on both the coefficients and weight vectors, the proposed method is able to perform simultaneous variable selection and outliers detection efficiently. Eventually, this procedure results in estimators with potentially strong robustness and non-asymptotic consistency. We provide a unified link between the weight shrinkage rule and a robust M-estimation in general settings. We also establish the non-asymptotic oracle inequalities for the joint estimation of both the regression coefficients and weight vectors. These theoretical results allow the number of variables to far exceed the sample size. The performance of the proposed estimator is demonstrated in both simulation studies and real examples

    High-dimensional variable selection for genomics data, from both frequentist and Bayesian perspectives

    Get PDF
    Doctor of PhilosophyDepartment of StatisticsCen WuVariable selection is one of the most popular tools for analyzing high-dimensional genomic data. It has been developed to accommodate complex data structures and lead to structured sparse identification of important genomics features. We focus on the network and interaction structure that commonly exist in genomic data, and develop novel variable selection methods from both frequentist and Bayesian perspectives. Network-based regularization has achieved success in variable selections for high-dimensional cancer genomic data, due to its ability to incorporate the correlations among genomic features. However, as survival time data usually follow skewed distributions, and are contaminated by outliers, network-constrained regularization that does not take the robustness into account leads to false identifications of network structure and biased estimation of patients’ survival. In the first project, we develop a novel robust network-based variable selection method under the accelerated failure time (AFT) model. Extensive simulation studies show the advantage of the proposed method over the alternative methods. Promising findings are made in two case studies of lung cancer datasets with high dimensional gene expression measurements. Gene-environment (G×E) interactions are important for the elucidation of disease etiology beyond the main genetic and environmental effects. In the second project, a novel and powerful semi-parametric Bayesian variable selection model has been proposed to investigate linear and nonlinear G×E interactions simultaneously. It can further conduct structural identification by distinguishing nonlinear interactions from main-effects-only case within the Bayesian framework. The proposed method conducts Bayesian variable selection more efficiently and accurately than alternatives. Simulation shows that the proposed model outperforms competing alternatives in terms of both identification and prediction. In the case study, the proposed Bayesian method leads to the identification of effects with important implications in a high-throughput profiling study with high-dimensional SNP data. In the last project, a robust Bayesian variable selection method has been developed for G×E interaction studies. The proposed robust Bayesian method can effectively accommodate heavy-tailed errors and outliers in the response variable while conducting variable selection by accounting for structural sparsity. Spike and slab priors are incorporated on both individual and group levels to identify the sparse main and interaction effects. Extensive simulation studies and analysis of both the diabetes data with SNP measurements from the Nurses’ Health Study and TCGA melanoma data with gene expression measurements demonstrate the superior performance of the proposed method over multiple competing alternatives. To facilitate reproducible research and fast computation, we have developed open source R packages for each project, which provide highly efficient C++ implementation for all the proposed and alternative approaches. The R packages regnet and spinBayes, associated with the first and second project correspondingly, are available on CRAN. For the third project, the R package robin is available from GitHub and will be submitted to CRAN soon

    Robust penalized regression for complex high-dimensional data

    Get PDF
    Robust high-dimensional data analysis has become an important and challenging task in complex Big Data analysis due to the high-dimensionality and data contamination. One of the most popular procedures is the robust penalized regression. In this dissertation, we address three typical robust ultra-high dimensional regression problems via penalized regression approaches. The first problem is related to the linear model with the existence of outliers, dealing with the outlier detection, variable selection and parameter estimation simultaneously. The second problem is related to robust high-dimensional mean regression with irregular settings such as the data contamination, data asymmetry and heteroscedasticity. The third problem is related to robust bi-level variable selection for the linear regression model with grouping structures in covariates. In Chapter 1, we introduce the background and challenges by overviews of penalized least squares methods and robust regression techniques. In Chapter 2, we propose a novel approach in a penalized weighted least squares framework to perform simultaneous variable selection and outlier detection. We provide a unified link between the proposed framework and a robust M-estimation in general settings. We also establish the non-asymptotic oracle inequalities for the joint estimation of both the regression coefficients and weight vectors. In Chapter 3, we establish a framework of robust estimators in high-dimensional regression models using Penalized Robust Approximated quadratic M estimation (PRAM). This framework allows general settings such as random errors lack of symmetry and homogeneity, or covariates are not sub-Gaussian. Theoretically, we show that, in the ultra-high dimension setting, the PRAM estimator has local estimation consistency at the minimax rate enjoyed by the LS-Lasso and owns the local oracle property, under certain mild conditions. In Chapter 4, we extend the study in Chapter 3 to robust high-dimensional data analysis with structured sparsity. In particular, we propose a framework of high-dimensional M-estimators for bi-level variable selection. This framework encourages bi-level sparsity through a computationally efficient two-stage procedure. It produces strong robust parameter estimators if some nonconvex redescending loss functions are applied. In theory, we provide sufficient conditions under which our proposed two-stage penalized M-estimator possesses simultaneous local estimation consistency and the bi-level variable selection consistency, if a certain nonconvex penalty function is used at the group level. The performances of the proposed estimators are demonstrated in both simulation studies and real examples. In Chapter 5, we provide some discussions and future work

    HB-PLS: A statistical method for identifying biological process or pathway regulators by integrating Huber loss and Berhu penalty with partial least squares regression

    Get PDF
    Gene expression data features high dimensionality, multicollinearity, and non-Gaussian distribution noise, posing hurdles for identification of true regulatory genes controlling a biological process or pathway. In this study, we integrated the Huber loss function and the Berhu penalty (HB) into partial least squares (PLS) framework to deal with the high dimension and multicollinearity property of gene expression data, and developed a new method called HB-PLS regression to model the relationships between regulatory genes and pathway genes. To solve the Huber-Berhu optimization problem, an accelerated proximal gradient descent algorithm with at least 10 times faster than the general convex optimization solver (CVX), was developed. Application of HB-PLS to recognize pathway regulators of lignin biosynthesis and photosynthesis in Arabidopsis thaliana led to the identification of many known positive pathway regulators that had previously been experimentally validated. As compared to sparse partial least squares (SPLS) regression, an efficient method for variable selection and dimension reduction in handling multicollinearity, HB-PLS has higher efficacy in identifying more positive known regulators, a much higher but slightly less sensitivity/(1-specificity) in ranking the true positive known regulators to the top of the output regulatory gene lists for the two aforementioned pathways. In addition, each method could identify some unique regulators that cannot be identified by the other methods. Our results showed that the overall performance of HB-PLS slightly exceeds that of SPLS but both methods are instrumental for identifying real pathway regulators from high-throughput gene expression data, suggesting that integration of statistics, machine leaning and convex optimization can result in a method with high efficacy and is worth further exploration

    Robust variable screening for regression using factor profiling

    Full text link
    Sure Independence Screening is a fast procedure for variable selection in ultra-high dimensional regression analysis. Unfortunately, its performance greatly deteriorates with increasing dependence among the predictors. To solve this issue, Factor Profiled Sure Independence Screening (FPSIS) models the correlation structure of the predictor variables, assuming that it can be represented by a few latent factors. The correlations can then be profiled out by projecting the data onto the orthogonal complement of the subspace spanned by these factors. However, neither of these methods can handle the presence of outliers in the data. Therefore, we propose a robust screening method which uses a least trimmed squares method to estimate the latent factors and the factor profiled variables. Variable screening is then performed on factor profiled variables by using regression MM-estimators. Different types of outliers in this model and their roles in variable screening are studied. Both simulation studies and a real data analysis show that the proposed robust procedure has good performance on clean data and outperforms the two nonrobust methods on contaminated data
    • …
    corecore