698 research outputs found
The Influence Function of Penalized Regression Estimators
To perform regression analysis in high dimensions, lasso or ridge estimation
are a common choice. However, it has been shown that these methods are not
robust to outliers. Therefore, alternatives as penalized M-estimation or the
sparse least trimmed squares (LTS) estimator have been proposed. The robustness
of these regression methods can be measured with the influence function. It
quantifies the effect of infinitesimal perturbations in the data. Furthermore
it can be used to compute the asymptotic variance and the mean squared error.
In this paper we compute the influence function, the asymptotic variance and
the mean squared error for penalized M-estimators and the sparse LTS estimator.
The asymptotic biasedness of the estimators make the calculations nonstandard.
We show that only M-estimators with a loss function with a bounded derivative
are robust against regression outliers. In particular, the lasso has an
unbounded influence function.Comment: appears in Statistics: A Journal of Theoretical and Applied
Statistics, 201
Robustness in sparse linear models: relative efficiency based on robust approximate message passing
Understanding efficiency in high dimensional linear models is a longstanding
problem of interest. Classical work with smaller dimensional problems dating
back to Huber and Bickel has illustrated the benefits of efficient loss
functions. When the number of parameters is of the same order as the sample
size , , an efficiency pattern different from the one of Huber
was recently established. In this work, we consider the effects of model
selection on the estimation efficiency of penalized methods. In particular, we
explore whether sparsity, results in new efficiency patterns when . In
the interest of deriving the asymptotic mean squared error for regularized
M-estimators, we use the powerful framework of approximate message passing. We
propose a novel, robust and sparse approximate message passing algorithm
(RAMP), that is adaptive to the error distribution. Our algorithm includes many
non-quadratic and non-differentiable loss functions. We derive its asymptotic
mean squared error and show its convergence, while allowing , with and . We identify new
patterns of relative efficiency regarding a number of penalized estimators,
when is much larger than . We show that the classical information bound
is no longer reachable, even for light--tailed error distributions. We show
that the penalized least absolute deviation estimator dominates the penalized
least square estimator, in cases of heavy--tailed distributions. We observe
this pattern for all choices of the number of non-zero parameters , both and . In non-penalized problems where ,
the opposite regime holds. Therefore, we discover that the presence of model
selection significantly changes the efficiency patterns.Comment: 49 pages, 10 figure
Robust high-dimensional data analysis using a weight shrinkage rule
In high-dimensional settings, a penalized least squares approach may lose its efficiency in both estimation and variable selection due to the existence of either outliers or heteroscedasticity. In this thesis, we propose a novel approach to perform robust high-dimensional data analysis in a penalized weighted least square framework. The main idea is to relate the irregularity of each observation to a weight vector and obtain the outlying status data-adaptively using a weight shrinkage rule. By usage of L-1 type regularization on both the coefficients and weight vectors, the proposed method is able to perform simultaneous variable selection and outliers detection efficiently. Eventually, this procedure results in estimators with potentially strong robustness and non-asymptotic consistency. We provide a unified link between the weight shrinkage rule and a robust M-estimation in general settings. We also establish the non-asymptotic oracle inequalities for the joint estimation of both the regression coefficients and weight vectors. These theoretical results allow the number of variables to far exceed the sample size. The performance of the proposed estimator is demonstrated in both simulation studies and real examples
High-dimensional variable selection for genomics data, from both frequentist and Bayesian perspectives
Doctor of PhilosophyDepartment of StatisticsCen WuVariable selection is one of the most popular tools for analyzing high-dimensional genomic data. It has been developed to accommodate complex data structures and lead to structured sparse identification of important genomics features. We focus on the network and interaction structure that commonly exist in genomic data, and develop novel variable selection methods from both frequentist and Bayesian perspectives.
Network-based regularization has achieved success in variable selections for high-dimensional cancer genomic data, due to its ability to incorporate the correlations among genomic features. However, as survival time data usually follow skewed distributions, and are contaminated by outliers, network-constrained regularization that does not take the robustness into account leads to false identifications of network structure and biased estimation of patients’ survival. In the first project, we develop a novel robust network-based variable selection method under the accelerated failure time (AFT) model. Extensive simulation studies show the advantage of the proposed method over the alternative methods. Promising findings are made in two case studies of lung cancer datasets with high dimensional gene expression measurements.
Gene-environment (G×E) interactions are important for the elucidation of disease etiology beyond the main genetic and environmental effects. In the second project, a novel and powerful semi-parametric Bayesian variable selection model has been proposed to investigate linear and nonlinear G×E interactions simultaneously. It can further conduct structural identification by distinguishing nonlinear interactions from main-effects-only case within the Bayesian framework. The proposed method conducts Bayesian variable selection more efficiently and accurately than alternatives. Simulation shows that the proposed model outperforms competing alternatives in terms of both identification and prediction. In the case study, the proposed Bayesian method leads to the identification of effects with important implications in a high-throughput profiling study with high-dimensional SNP data.
In the last project, a robust Bayesian variable selection method has been developed for G×E interaction studies. The proposed robust Bayesian method can effectively accommodate heavy-tailed errors and outliers in the response variable while conducting variable selection by accounting for structural sparsity. Spike and slab priors are incorporated on both individual and group levels to identify the sparse main and interaction effects. Extensive simulation studies and analysis of both the diabetes data with SNP measurements from the Nurses’ Health Study and TCGA melanoma data with gene expression measurements demonstrate the superior performance of the proposed method over multiple competing alternatives.
To facilitate reproducible research and fast computation, we have developed open source R packages for each project, which provide highly efficient C++ implementation for all the proposed and alternative approaches. The R packages regnet and spinBayes, associated with the first and second project correspondingly, are available on CRAN. For the third project, the R package robin is available from GitHub and will be submitted to CRAN soon
Robust penalized regression for complex high-dimensional data
Robust high-dimensional data analysis has become an important and challenging task in complex Big Data analysis due to the high-dimensionality and data contamination. One of the most popular procedures is the robust penalized regression. In this dissertation, we address three typical robust ultra-high dimensional regression problems via penalized regression approaches. The first problem is related to the linear model with the existence of outliers, dealing with the outlier detection, variable selection and parameter estimation simultaneously. The second problem is related to robust high-dimensional mean regression with irregular settings such as the data contamination, data asymmetry and heteroscedasticity. The third problem is related to robust bi-level variable selection for the linear regression model with grouping structures in covariates. In Chapter 1, we introduce the background and challenges by overviews of penalized least squares methods and robust regression techniques. In Chapter 2, we propose a novel approach in a penalized weighted least squares framework to perform simultaneous variable selection and outlier detection. We provide a unified link between the proposed framework and a robust M-estimation in general settings. We also establish the non-asymptotic oracle inequalities for the joint estimation of both the regression coefficients and weight vectors. In Chapter 3, we establish a framework of robust estimators in high-dimensional regression models using Penalized Robust Approximated quadratic M estimation (PRAM). This framework allows general settings such as random errors lack of symmetry and homogeneity, or covariates are not sub-Gaussian. Theoretically, we show that, in the ultra-high dimension setting, the PRAM estimator has local estimation consistency at the minimax rate enjoyed by the LS-Lasso and owns the local oracle property, under certain mild conditions. In Chapter 4, we extend the study in Chapter 3 to robust high-dimensional data analysis with structured sparsity. In particular, we propose a framework of high-dimensional M-estimators for bi-level variable selection. This framework encourages bi-level sparsity through a computationally efficient two-stage procedure. It produces strong robust parameter estimators if some nonconvex redescending loss functions are applied. In theory, we provide sufficient conditions under which our proposed two-stage penalized M-estimator possesses simultaneous local estimation consistency and the bi-level variable selection consistency, if a certain nonconvex penalty function is used at the group level. The performances of the proposed estimators are demonstrated in both simulation studies and real examples. In Chapter 5, we provide some discussions and future work
HB-PLS: A statistical method for identifying biological process or pathway regulators by integrating Huber loss and Berhu penalty with partial least squares regression
Gene expression data features high dimensionality, multicollinearity, and non-Gaussian distribution noise, posing hurdles for identification of true regulatory genes controlling a biological process or pathway. In this study, we integrated the Huber loss function and the Berhu penalty (HB) into partial least squares (PLS) framework to deal with the high dimension and multicollinearity property of gene expression data, and developed a new method called HB-PLS regression to model the relationships between regulatory genes and pathway genes. To solve the Huber-Berhu optimization problem, an accelerated proximal gradient descent algorithm with at least 10 times faster than the general convex optimization solver (CVX), was developed. Application of HB-PLS to recognize pathway regulators of lignin biosynthesis and photosynthesis in Arabidopsis thaliana led to the identification of many known positive pathway regulators that had previously been experimentally validated. As compared to sparse partial least squares (SPLS) regression, an efficient method for variable selection and dimension reduction in handling multicollinearity, HB-PLS has higher efficacy in identifying more positive known regulators, a much higher but slightly less sensitivity/(1-specificity) in ranking the true positive known regulators to the top of the output regulatory gene lists for the two aforementioned pathways. In addition, each method could identify some unique regulators that cannot be identified by the other methods. Our results showed that the overall performance of HB-PLS slightly exceeds that of SPLS but both methods are instrumental for identifying real pathway regulators from high-throughput gene expression data, suggesting that integration of statistics, machine leaning and convex optimization can result in a method with high efficacy and is worth further exploration
Robust variable screening for regression using factor profiling
Sure Independence Screening is a fast procedure for variable selection in
ultra-high dimensional regression analysis. Unfortunately, its performance
greatly deteriorates with increasing dependence among the predictors. To solve
this issue, Factor Profiled Sure Independence Screening (FPSIS) models the
correlation structure of the predictor variables, assuming that it can be
represented by a few latent factors. The correlations can then be profiled out
by projecting the data onto the orthogonal complement of the subspace spanned
by these factors. However, neither of these methods can handle the presence of
outliers in the data. Therefore, we propose a robust screening method which
uses a least trimmed squares method to estimate the latent factors and the
factor profiled variables. Variable screening is then performed on factor
profiled variables by using regression MM-estimators. Different types of
outliers in this model and their roles in variable screening are studied. Both
simulation studies and a real data analysis show that the proposed robust
procedure has good performance on clean data and outperforms the two nonrobust
methods on contaminated data
- …