482 research outputs found

    A Fast Algorithm for Robust Regression with Penalised Trimmed Squares

    Full text link
    The presence of groups containing high leverage outliers makes linear regression a difficult problem due to the masking effect. The available high breakdown estimators based on Least Trimmed Squares often do not succeed in detecting masked high leverage outliers in finite samples. An alternative to the LTS estimator, called Penalised Trimmed Squares (PTS) estimator, was introduced by the authors in \cite{ZiouAv:05,ZiAvPi:07} and it appears to be less sensitive to the masking problem. This estimator is defined by a Quadratic Mixed Integer Programming (QMIP) problem, where in the objective function a penalty cost for each observation is included which serves as an upper bound on the residual error for any feasible regression line. Since the PTS does not require presetting the number of outliers to delete from the data set, it has better efficiency with respect to other estimators. However, due to the high computational complexity of the resulting QMIP problem, exact solutions for moderately large regression problems is infeasible. In this paper we further establish the theoretical properties of the PTS estimator, such as high breakdown and efficiency, and propose an approximate algorithm called Fast-PTS to compute the PTS estimator for large data sets efficiently. Extensive computational experiments on sets of benchmark instances with varying degrees of outlier contamination, indicate that the proposed algorithm performs well in identifying groups of high leverage outliers in reasonable computational time.Comment: 27 page

    Selecting Good Measurements via ℓ₁ Relaxation: A Convex Approach for Robust Estimation Over Graphs

    Get PDF
    © 2014 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other users, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works for resale or redistribution to servers or lists, or reuse of any copyrighted components of this work in other works.DOI: 10.1109/IROS.2014.6942927Pose graph optimization is an elegant and efficient formulation for robot localization and mapping. Experimental evidence suggests that, in real problems, the set of measurements used to estimate robot poses is prone to contain outliers, due to perceptual aliasing and incorrect data association. While several related works deal with the rejection of outliers during pose estimation, the goal of this paper is to propose a grounded strategy for measurements selection, i.e., the output of our approach is a set of “reliable” measurements, rather than pose estimates. Because the classification in inliers /outliers is not observable in general, we pose the problem as finding the maximal subset of the measurements that is internally coherent. In the linear case, we show that the selection of the maximal coherent set can be (conservatively) relaxed to obtain a linear programming problem with ℓ₁ objective. We show that this approach can be extended to (nonlinear) planar pose graph optimization using similar ideas as our previous work on linear approaches to pose graph optimization. We evaluate our method on standard datasets, and we show that it is robust to a large number of outliers and different outlier generation models, while entailing the advantages of linear programming (fast computation, scalability)

    Approaches for Outlier Detection in Sparse High-Dimensional Regression Models

    Get PDF
    Modern regression studies often encompass a very large number of potential predictors, possibly larger than the sample size, and sometimes growing with the sample size itself. This increases the chances that a substantial portion of the predictors is redundant, as well as the risk of data contamination. Tackling these problems is of utmost importance to facilitate scientific discoveries, since model estimates are highly sensitive both to the choice of predictors and to the presence of outliers. In this thesis, we contribute to this area considering the problem of robust model selection in a variety of settings, where outliers may arise both in the response and the predictors. Our proposals simplify model interpretation, guarantee predictive performance, and allow us to study and control the influence of outlying cases on the fit. First, we consider the co-occurrence of multiple mean-shift and variance-inflation outliers in low-dimensional linear models. We rely on robust estimation techniques to identify outliers of each type, exclude mean-shift outliers, and use restricted maximum likelihood estimation to down-weight and accommodate variance-inflation outliers into the model fit. Second, we extend our setting to high-dimensional linear models. We show that mean-shift and variance-inflation outliers can be modeled as additional fixed and random components, respectively, and evaluated independently. Specifically, we perform feature selection and mean-shift outlier detection through a robust class of nonconcave penalization methods, and variance-inflation outlier detection through the penalization of the restricted posterior mode. The resulting approach satisfies a robust oracle property for feature selection in the presence of data contamination – which allows the number of features to exponentially increase with the sample size – and detects truly outlying cases of each type with asymptotic probability one. This provides an optimal trade-off between a high breakdown point and efficiency. Third, focusing on high-dimensional linear models affected by meanshift outliers, we develop a general framework in which L0-constraints coupled with mixed-integer programming techniques are used to perform simultaneous feature selection and outlier detection with provably optimal guarantees. In particular, we provide necessary and sufficient conditions for a robustly strong oracle property, where again the number of features can increase exponentially with the sample size, and prove optimality for parameter estimation and the resulting breakdown point. Finally, we consider generalized linear models and rely on logistic slippage to perform outlier detection and removal in binary classification. Here we use L0-constraints and mixed-integer conic programming techniques to solve the underlying double combinatorial problem of feature selection and outlier detection, and the framework allows us again to pursue optimality guarantees. For all the proposed approaches, we also provide computationally lean heuristic algorithms, tuning procedures, and diagnostic tools which help to guide the analysis. We consider several real-world applications, including the study of the relationships between childhood obesity and the human microbiome, and of the main drivers of honey bee loss. All methods developed and data used, as well as the source code to replicate our analyses, are publicly available

    Collinearity and consequences for estimation: a study and simulation

    Get PDF

    Aspects of estimation in the linear model with special reference to collinearity

    Get PDF

    Exact Combinatorial Optimization with Graph Convolutional Neural Networks

    Get PDF
    Combinatorial optimization problems are typically tackled by the branch-and-bound paradigm. We propose to learn a variable selection policy for branch-and-bound in mixed-integer linear programming, by imitation learning on a diversified variant of the strong branching expert rule. We encode states as bipartite graphs and parameterize the policy as a graph convolutional neural network. Experiments on a series of synthetic problems demonstrate that our approach produces policies that can improve upon expert-designed branching rules on large problems, and generalize to instances significantly larger than seen during training

    Robust estimation, regression and ranking with applications in portfolio optimization

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Sloan School of Management, Operations Research Center, 2009.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Cataloged from student-submitted PDF version of thesis.Includes bibliographical references (p. 108-112).Classical methods of maximum likelihood and least squares rely a great deal on the correctness of the model assumptions. Since these assumptions are only approximations of reality, many robust statistical methods have been developed to produce estimators that are robust against the deviation from the model assumptions. Unfortunately, these techniques have very high computational complexity that prevents their application to large scale problems. We present computationally efficient methods for robust mean-covariance estimation and robust linear regression using special mathematical programming models and semi-definite programming (SDP). In the robust covariance estimation problem, we design an optimization model with a loss function on the weighted Mahalanobis distances and show that the problem is equivalent to a system of equations and can be solved using the Newton-Raphson method. The problem can also be transformed into an SDP problem from which we can flexibly incorporate prior beliefs into the estimators without much increase in the computational complexity. The robust regression problem is often formulated as the least trimmed squares (LTS) regression problem where we want to nd the best subset of observations with the smallest sum of squared residuals. We show the LTS problem is equivalent to a concave minimization problem, which is very hard to solve. We resolve this difficulty by introducing the maximum trimmed squares" problem that finds the worst subset of observations. This problem can be transformed into an SDP problem that can be solved efficiently while still guaranteeing that we can identify outliers.(cont.) In addition, we model the robust ranking problem as a mixed integer minimax problem where the ranking is in a discrete uncertainty set. We use mixed integer programming methods, specifically column generation and network flows, to solve the robust ranking problem. To illustrate the power of these robust methods, we apply them to the mean-variance portfolio optimization problem in order to incorporate estimation errors into the model.by Tri-Dung Nguyen.Ph.D

    Multi-criteria optimization in regression

    Get PDF
    In this paper, we consider standard as well as instrumental variables regression. Specification problems related to autocorrelation, heteroskedasticity, neglected non-linearity, unsatisfactory out-of-small performance and endogeneity can be addressed in the context of multi-criteria optimization. The new technique performs well, it minimizes all these problems simultaneously, and eliminates them for the most part. Markov Chain Monte Carlo techniques are used to perform the computations. An empirical application to NASDAQ returns is provided

    Rails Quality Data Modelling via Machine Learning-Based Paradigms

    Get PDF
    corecore