37,052 research outputs found

    Approaches for Outlier Detection in Sparse High-Dimensional Regression Models

    Get PDF
    Modern regression studies often encompass a very large number of potential predictors, possibly larger than the sample size, and sometimes growing with the sample size itself. This increases the chances that a substantial portion of the predictors is redundant, as well as the risk of data contamination. Tackling these problems is of utmost importance to facilitate scientific discoveries, since model estimates are highly sensitive both to the choice of predictors and to the presence of outliers. In this thesis, we contribute to this area considering the problem of robust model selection in a variety of settings, where outliers may arise both in the response and the predictors. Our proposals simplify model interpretation, guarantee predictive performance, and allow us to study and control the influence of outlying cases on the fit. First, we consider the co-occurrence of multiple mean-shift and variance-inflation outliers in low-dimensional linear models. We rely on robust estimation techniques to identify outliers of each type, exclude mean-shift outliers, and use restricted maximum likelihood estimation to down-weight and accommodate variance-inflation outliers into the model fit. Second, we extend our setting to high-dimensional linear models. We show that mean-shift and variance-inflation outliers can be modeled as additional fixed and random components, respectively, and evaluated independently. Specifically, we perform feature selection and mean-shift outlier detection through a robust class of nonconcave penalization methods, and variance-inflation outlier detection through the penalization of the restricted posterior mode. The resulting approach satisfies a robust oracle property for feature selection in the presence of data contamination – which allows the number of features to exponentially increase with the sample size – and detects truly outlying cases of each type with asymptotic probability one. This provides an optimal trade-off between a high breakdown point and efficiency. Third, focusing on high-dimensional linear models affected by meanshift outliers, we develop a general framework in which L0-constraints coupled with mixed-integer programming techniques are used to perform simultaneous feature selection and outlier detection with provably optimal guarantees. In particular, we provide necessary and sufficient conditions for a robustly strong oracle property, where again the number of features can increase exponentially with the sample size, and prove optimality for parameter estimation and the resulting breakdown point. Finally, we consider generalized linear models and rely on logistic slippage to perform outlier detection and removal in binary classification. Here we use L0-constraints and mixed-integer conic programming techniques to solve the underlying double combinatorial problem of feature selection and outlier detection, and the framework allows us again to pursue optimality guarantees. For all the proposed approaches, we also provide computationally lean heuristic algorithms, tuning procedures, and diagnostic tools which help to guide the analysis. We consider several real-world applications, including the study of the relationships between childhood obesity and the human microbiome, and of the main drivers of honey bee loss. All methods developed and data used, as well as the source code to replicate our analyses, are publicly available

    Profile control charts based on nonparametric LL-1 regression methods

    Full text link
    Classical statistical process control often relies on univariate characteristics. In many contemporary applications, however, the quality of products must be characterized by some functional relation between a response variable and its explanatory variables. Monitoring such functional profiles has been a rapidly growing field due to increasing demands. This paper develops a novel nonparametric LL-1 location-scale model to screen the shapes of profiles. The model is built on three basic elements: location shifts, local shape distortions, and overall shape deviations, which are quantified by three individual metrics. The proposed approach is applied to the previously analyzed vertical density profile data, leading to some interesting insights.Comment: Published in at http://dx.doi.org/10.1214/11-AOAS501 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Wage Determination in Russia: An Econometric Investigation

    Full text link
    Using a firm level dataset from four regions of Russia covering 1996/97, an investigation was carried out into how the surplus created within the firm is divided between profits and wages. An efficient bargaining framework based on the work of Svejnar (1986) is employed which takes into account the alternative wage or outside option available to employees in the firm as well as the value added per employee. Statistical differences in the share of the surplus taken by employees employed in state, private and mixed forms of firms are found. In addition, the results prove sensitive to the presence of outliers and influential observations. A variety of diagnostic methods are employed to identify these influential observations and robust methods are employed to lessen the influence of them. Whereas in practice some of the diagnostic and robust methods utilised proved incapable of identifying or accommodating the gross outlier(s) in the data, the more successful methods included robust regression, Winsorising, the Hadi and Siminoff algorithm, Cook's Distance and Covratio.http://deepblue.lib.umich.edu/bitstream/2027.42/39679/3/wp295.pd

    A variance shift model for outlier detection and estimation in linear and linear mixed models

    Get PDF
    Outliers are data observations that fall outside the usual conditional ranges of the response data.They are common in experimental research data, for example, due to transcription errors or faulty experimental equipment. Often outliers are quickly identified and addressed, that is, corrected, removed from the data, or retained for subsequent analysis. However, in many cases they are completely anomalous and it is unclear how to treat them. Case deletion techniques are established methods in detecting outliers in linear fixed effects analysis. The extension of these methods to detecting outliers in linear mixed models has not been entirely successful, in the literature. This thesis focuses on a variance shift outlier model as an approach to detecting and assessing outliers in both linear fixed effects and linear mixed effects analysis. A variance shift outlier model assumes a variance shift parameter, !i, for the ith observation, where !i is unknown and estimated from the data. Estimated values of !i indicate observations with possibly inflated variances relative to the remainder of the observations in the data set and hence outliers. When outliers lurk within anomalous elements in the data set, a variance shift outlier model offers an opportunity to include anomalies in the analysis, but down-weighted using the variance shift estimate Ë!i. This down-weighting might be considered preferable to omitting data points (as in case-deletion methods). For very large values of !i a variance shift outlier model is approximately equivalent to the case deletion approach. We commence with a detailed review of parameter estimation and inferential procedures for the linear mixed model. The review is necessary for the development of the variance shift outlier model as a method for detecting outliers in linear fixed and linear mixed models. This review is followed by a discussion of the status of current research into linear mixed model diagnostics. Different types of residuals in the linear mixed model are defined. A decomposition of the leverage matrix for the linear mixed model leads to interpretable leverage measures. ii A detailed review of a variance shift outlier model in linear fixed effects analysis is given. The purpose of this review is firstly, to gain insight into the general case (the linear mixed model) and secondly, to develop the model further in linear fixed effects analysis. A variance shift outlier model can be formulated as a linear mixed model so that the calculations required to estimate the parameters of the model are those associated with fitting a linear mixed model, and hence the model can be fitted using standard software packages. Likelihood ratio and score test statistics are developed as objective measures for the variance shift estimates. The proposed test statistics initially assume balanced longitudinal data with a Gaussian distributed response variable. The dependence of the proposed test statistics on the second derivatives of the log-likelihood function is also examined. For the single-case outlier in linear fixed effects analysis, analytical expressions for the proposed test statistics are obtained. A resampling algorithm is proposed for assessing the significance of the proposed test statistics and for handling the problem of multiple testing. A variance shift outlier model is then adapted to detect a group of outliers in a fixed effects model. Properties and performance of the likelihood ratio and score test statistics are also investigated. A variance shift outlier model for detecting single-case outliers is also extended to linear mixed effects analysis under Gaussian assumptions for the random effects and the random errors. The variance parameters are estimated using the residual maximum likelihood method. Likelihood ratio and score tests are also constructed for this extended model. Two distinct computing algorithms which constrain the variance parameter estimates to be positive, are given. Properties of the resulting variance parameter estimates from each computing algorithm are also investigated. A variance shift outlier model for detecting single-case outliers in linear mixed effects analysis is extended to detect groups of outliers or subjects having outlying profiles with random intercepts and random slopes that are inconsistent with the corresponding model elements for the remaining subjects in the data set. The issue of influence on the fixed effects under a variance shift outlier model is also discussed

    A variance shilf model for outlier detection and estimation in linear and linear mixed models

    Get PDF
    Includes abstract.Includes bibliographical references.Outliers are data observations that fall outside the usual conditional ranges of the response data.They are common in experimental research data, for example, due to transcription errors or faulty experimental equipment. Often outliers are quickly identified and addressed, that is, corrected, removed from the data, or retained for subsequent analysis. However, in many cases they are completely anomalous and it is unclear how to treat them. Case deletion techniques are established methods in detecting outliers in linear fixed effects analysis. The extension of these methods to detecting outliers in linear mixed models has not been entirely successful, in the literature. This thesis focuses on a variance shift outlier model as an approach to detecting and assessing outliers in both linear fixed effects and linear mixed effects analysis. A variance shift outlier model assumes a variance shift parameter, wi, for the ith observation, where wi is unknown and estimated from the data. Estimated values of wi indicate observations with possibly inflated variances relative to the remainder of the observations in the data set and hence outliers. When outliers lurk within anomalous elements in the data set, a variance shift outlier model offers an opportunity to include anomalies in the analysis, but down-weighted using the variance shift estimate wi. This down-weighting might be considered preferable to omitting data points (as in case-deletion methods). For very large values of wi a variance shift outlier model is approximately equivalent to the case deletion approach
    corecore