55,005 research outputs found

    Outlier Detection Using Nonconvex Penalized Regression

    Full text link
    This paper studies the outlier detection problem from the point of view of penalized regressions. Our regression model adds one mean shift parameter for each of the nn data points. We then apply a regularization favoring a sparse vector of mean shift parameters. The usual L1L_1 penalty yields a convex criterion, but we find that it fails to deliver a robust estimator. The L1L_1 penalty corresponds to soft thresholding. We introduce a thresholding (denoted by Θ\Theta) based iterative procedure for outlier detection (Θ\Theta-IPOD). A version based on hard thresholding correctly identifies outliers on some hard test problems. We find that Θ\Theta-IPOD is much faster than iteratively reweighted least squares for large data because each iteration costs at most O(np)O(np) (and sometimes much less) avoiding an O(np2)O(np^2) least squares estimate. We describe the connection between Θ\Theta-IPOD and MM-estimators. Our proposed method has one tuning parameter with which to both identify outliers and estimate regression coefficients. A data-dependent choice can be made based on BIC. The tuned Θ\Theta-IPOD shows outstanding performance in identifying outliers in various situations in comparison to other existing approaches. This methodology extends to high-dimensional modeling with p≫np\gg n, if both the coefficient vector and the outlier pattern are sparse

    Letter to the Editor

    Full text link
    The paper by Alfons, Croux and Gelper (2013), Sparse least trimmed squares regression for analyzing high-dimensional large data sets, considered a combination of least trimmed squares (LTS) and lasso penalty for robust and sparse high-dimensional regression. In a recent paper [She and Owen (2011)], a method for outlier detection based on a sparsity penalty on the mean shift parameter was proposed (designated by "SO" in the following). This work is mentioned in Alfons et al. as being an "entirely different approach." Certainly the problem studied by Alfons et al. is novel and interesting.Comment: Published in at http://dx.doi.org/10.1214/13-AOAS640 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Approaches for Outlier Detection in Sparse High-Dimensional Regression Models

    Get PDF
    Modern regression studies often encompass a very large number of potential predictors, possibly larger than the sample size, and sometimes growing with the sample size itself. This increases the chances that a substantial portion of the predictors is redundant, as well as the risk of data contamination. Tackling these problems is of utmost importance to facilitate scientific discoveries, since model estimates are highly sensitive both to the choice of predictors and to the presence of outliers. In this thesis, we contribute to this area considering the problem of robust model selection in a variety of settings, where outliers may arise both in the response and the predictors. Our proposals simplify model interpretation, guarantee predictive performance, and allow us to study and control the influence of outlying cases on the fit. First, we consider the co-occurrence of multiple mean-shift and variance-inflation outliers in low-dimensional linear models. We rely on robust estimation techniques to identify outliers of each type, exclude mean-shift outliers, and use restricted maximum likelihood estimation to down-weight and accommodate variance-inflation outliers into the model fit. Second, we extend our setting to high-dimensional linear models. We show that mean-shift and variance-inflation outliers can be modeled as additional fixed and random components, respectively, and evaluated independently. Specifically, we perform feature selection and mean-shift outlier detection through a robust class of nonconcave penalization methods, and variance-inflation outlier detection through the penalization of the restricted posterior mode. The resulting approach satisfies a robust oracle property for feature selection in the presence of data contamination – which allows the number of features to exponentially increase with the sample size – and detects truly outlying cases of each type with asymptotic probability one. This provides an optimal trade-off between a high breakdown point and efficiency. Third, focusing on high-dimensional linear models affected by meanshift outliers, we develop a general framework in which L0-constraints coupled with mixed-integer programming techniques are used to perform simultaneous feature selection and outlier detection with provably optimal guarantees. In particular, we provide necessary and sufficient conditions for a robustly strong oracle property, where again the number of features can increase exponentially with the sample size, and prove optimality for parameter estimation and the resulting breakdown point. Finally, we consider generalized linear models and rely on logistic slippage to perform outlier detection and removal in binary classification. Here we use L0-constraints and mixed-integer conic programming techniques to solve the underlying double combinatorial problem of feature selection and outlier detection, and the framework allows us again to pursue optimality guarantees. For all the proposed approaches, we also provide computationally lean heuristic algorithms, tuning procedures, and diagnostic tools which help to guide the analysis. We consider several real-world applications, including the study of the relationships between childhood obesity and the human microbiome, and of the main drivers of honey bee loss. All methods developed and data used, as well as the source code to replicate our analyses, are publicly available

    Diagnostic measures for linear mixed measurement error models

    Get PDF
    In this paper, we present case deletion and mean shift outlier models for linear mixed measurement error models using the corrected likelihood of Nakamura (1990). We derive the corrected score test statistic for outliers detection based on mean shift outlier models. Furthermore, several case deletion diagnostics are constructed as a tool for influence diagnostics. It is found that they can be written in terms of studentized residuals of model, error contrast matrix and the inverse of the response variable covariance matrix. Our influence diagnostics are illustrated through a real data set

    Mean Shift versus Variance Inflation Approach for Outlier Detection—A Comparative Study

    Get PDF
    Outlier detection is one of the most important tasks in the analysis of measured quantities to ensure reliable results. In recent years, a variety of multi-sensor platforms has become available, which allow autonomous and continuous acquisition of large quantities of heterogeneous observations. Because the probability that such data sets contain outliers increases with the quantity of measured values, powerful methods are required to identify contaminated observations. In geodesy, the mean shift model (MS) is one of the most commonly used approaches for outlier detection. In addition to the MS model, there is an alternative approach with the model of variance inflation (VI). In this investigation the VI approach is derived in detail, truly maximizing the likelihood functions and examined for outlier detection of one or multiple outliers. In general, the variance inflation approach is non-linear, even if the null model is linear. Thus, an analytical solution does usually not exist, except in the case of repeated measurements. The test statistic is derived from the likelihood ratio (LR) of the models. The VI approach is compared with the MS model in terms of statistical power, identifiability of actual outliers, and numerical effort. The main purpose of this paper is to examine the performance of both approaches in order to derive recommendations for the practical application of outlier detection

    A comparison of two methods for detecting abrupt changes in the variance of climatic time series

    Full text link
    Two methods for detecting abrupt shifts in the variance, Integrated Cumulative Sum of Squares (ICSS) and Sequential Regime Shift Detector (SRSD), have been compared on both synthetic and observed time series. In Monte Carlo experiments, SRSD outperformed ICSS in the overwhelming majority of the modelled scenarios with different sequences of variance regimes. The SRSD advantage was particularly apparent in the case of outliers in the series. When tested on climatic time series, in most cases both methods detected the same change points in the longer series (252-787 monthly values). The only exception was the Arctic Ocean SST series, when ICSS found one extra change point that appeared to be spurious. As for the shorter time series (66-136 yearly values), ICSS failed to detect any change points even when the variance doubled or tripled from one regime to another. For these time series, SRSD is recommended. Interestingly, all the climatic time series tested, from the Arctic to the Tropics, had one thing in common: the last shift detected in each of these series was toward a high-variance regime. This is consistent with other findings of increased climate variability in recent decades.Comment: 32 pages, 11 figure

    Outlier detection in multivariate time series via projection pursuit

    Get PDF
    This article uses Projection Pursuit methods to develop a procedure for detecting outliers in a multivariate time series. We show that testing for outliers in some projection directions could be more powerful than testing the multivariate series directly. The optimal directions for detecting outliers are found by numerical optimization of the kurtosis coefficient of the projected series. We propose an iterative procedure to detect and handle multiple outliers based on univariate search in these optimal directions. In contrast with the existing methods, the proposed procedure can identify outliers without pre-specifying a vector ARMA model for the data. The good performance of the proposed method is verified in a Monte Carlo study and in a real data analysis

    Finding an unknown number of multivariate outliers

    Get PDF
    We use the forward search to provide robust Mahalanobis distances to detect the presence of outliers in a sample of multivariate normal data. Theoretical results on order statistics and on estimation in truncated samples provide the distribution of our test statistic. We also introduce several new robust distances with associated distributional results. Comparisons of our procedure with tests using other robust Mahalanobis distances show the good size and high power of our procedure. We also provide a unification of results on correction factors for estimation from truncated samples
    • 

    corecore