482 research outputs found
A Fast Algorithm for Robust Regression with Penalised Trimmed Squares
The presence of groups containing high leverage outliers makes linear
regression a difficult problem due to the masking effect. The available high
breakdown estimators based on Least Trimmed Squares often do not succeed in
detecting masked high leverage outliers in finite samples.
An alternative to the LTS estimator, called Penalised Trimmed Squares (PTS)
estimator, was introduced by the authors in \cite{ZiouAv:05,ZiAvPi:07} and it
appears to be less sensitive to the masking problem. This estimator is defined
by a Quadratic Mixed Integer Programming (QMIP) problem, where in the objective
function a penalty cost for each observation is included which serves as an
upper bound on the residual error for any feasible regression line. Since the
PTS does not require presetting the number of outliers to delete from the data
set, it has better efficiency with respect to other estimators. However, due to
the high computational complexity of the resulting QMIP problem, exact
solutions for moderately large regression problems is infeasible.
In this paper we further establish the theoretical properties of the PTS
estimator, such as high breakdown and efficiency, and propose an approximate
algorithm called Fast-PTS to compute the PTS estimator for large data sets
efficiently. Extensive computational experiments on sets of benchmark instances
with varying degrees of outlier contamination, indicate that the proposed
algorithm performs well in identifying groups of high leverage outliers in
reasonable computational time.Comment: 27 page
Selecting Good Measurements via ℓ₁ Relaxation: A Convex Approach for Robust Estimation Over Graphs
© 2014 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other users, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works for resale or redistribution to servers or lists, or reuse of any copyrighted components of this work in other works.DOI: 10.1109/IROS.2014.6942927Pose graph optimization is an elegant and efficient
formulation for robot localization and mapping. Experimental
evidence suggests that, in real problems, the set of measurements used to estimate robot poses is prone to contain
outliers, due to perceptual aliasing and incorrect data association. While
several related works deal with the rejection of outliers during
pose estimation, the goal of this paper is to propose a grounded strategy for
measurements selection, i.e., the output of our
approach is a set of “reliable” measurements, rather than pose estimates. Because the classification in inliers
/outliers is not observable in general, we pose the problem as finding the
maximal subset of the measurements that is internally coherent.
In the linear case, we show that the selection of the maximal
coherent set can be (conservatively) relaxed to obtain a linear
programming problem with ℓ₁ objective. We show that this
approach can be extended to (nonlinear) planar pose graph
optimization using similar ideas as our previous work on linear
approaches to pose graph optimization. We evaluate our method
on standard datasets, and we show that it is robust to a large
number of outliers and different outlier generation models,
while entailing the advantages of linear programming (fast
computation, scalability)
Approaches for Outlier Detection in Sparse High-Dimensional Regression Models
Modern regression studies often encompass a very large number of potential predictors,
possibly larger than the sample size, and sometimes growing with the sample
size itself. This increases the chances that a substantial portion of the predictors
is redundant, as well as the risk of data contamination. Tackling these problems is
of utmost importance to facilitate scientific discoveries, since model estimates are
highly sensitive both to the choice of predictors and to the presence of outliers. In
this thesis, we contribute to this area considering the problem of robust model selection
in a variety of settings, where outliers may arise both in the response and
the predictors. Our proposals simplify model interpretation, guarantee predictive
performance, and allow us to study and control the influence of outlying cases on
the fit.
First, we consider the co-occurrence of multiple mean-shift and variance-inflation
outliers in low-dimensional linear models. We rely on robust estimation techniques
to identify outliers of each type, exclude mean-shift outliers, and use restricted
maximum likelihood estimation to down-weight and accommodate variance-inflation
outliers into the model fit. Second, we extend our setting to high-dimensional linear
models. We show that mean-shift and variance-inflation outliers can be modeled as
additional fixed and random components, respectively, and evaluated independently.
Specifically, we perform feature selection and mean-shift outlier detection through
a robust class of nonconcave penalization methods, and variance-inflation outlier
detection through the penalization of the restricted posterior mode. The resulting
approach satisfies a robust oracle property for feature selection in the presence of
data contamination – which allows the number of features to exponentially increase
with the sample size – and detects truly outlying cases of each type with asymptotic
probability one. This provides an optimal trade-off between a high breakdown point
and efficiency. Third, focusing on high-dimensional linear models affected by meanshift
outliers, we develop a general framework in which L0-constraints coupled with
mixed-integer programming techniques are used to perform simultaneous feature
selection and outlier detection with provably optimal guarantees. In particular,
we provide necessary and sufficient conditions for a robustly strong oracle property,
where again the number of features can increase exponentially with the sample size,
and prove optimality for parameter estimation and the resulting breakdown point.
Finally, we consider generalized linear models and rely on logistic slippage to perform
outlier detection and removal in binary classification. Here we use L0-constraints
and mixed-integer conic programming techniques to solve the underlying double
combinatorial problem of feature selection and outlier detection, and the framework
allows us again to pursue optimality guarantees.
For all the proposed approaches, we also provide computationally lean heuristic
algorithms, tuning procedures, and diagnostic tools which help to guide the analysis.
We consider several real-world applications, including the study of the relationships
between childhood obesity and the human microbiome, and of the main drivers of
honey bee loss. All methods developed and data used, as well as the source code to
replicate our analyses, are publicly available
Exact Combinatorial Optimization with Graph Convolutional Neural Networks
Combinatorial optimization problems are typically tackled by the branch-and-bound paradigm. We propose to learn a variable selection policy for branch-and-bound in mixed-integer linear programming, by imitation learning on a diversified variant of the strong branching expert rule. We encode states as bipartite graphs and parameterize the policy as a graph convolutional neural network. Experiments on a series of synthetic problems demonstrate that our approach produces policies that can improve upon expert-designed branching rules on large problems, and generalize to instances significantly larger than seen during training
Robust estimation, regression and ranking with applications in portfolio optimization
Thesis (Ph. D.)--Massachusetts Institute of Technology, Sloan School of Management, Operations Research Center, 2009.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Cataloged from student-submitted PDF version of thesis.Includes bibliographical references (p. 108-112).Classical methods of maximum likelihood and least squares rely a great deal on the correctness of the model assumptions. Since these assumptions are only approximations of reality, many robust statistical methods have been developed to produce estimators that are robust against the deviation from the model assumptions. Unfortunately, these techniques have very high computational complexity that prevents their application to large scale problems. We present computationally efficient methods for robust mean-covariance estimation and robust linear regression using special mathematical programming models and semi-definite programming (SDP). In the robust covariance estimation problem, we design an optimization model with a loss function on the weighted Mahalanobis distances and show that the problem is equivalent to a system of equations and can be solved using the Newton-Raphson method. The problem can also be transformed into an SDP problem from which we can flexibly incorporate prior beliefs into the estimators without much increase in the computational complexity. The robust regression problem is often formulated as the least trimmed squares (LTS) regression problem where we want to nd the best subset of observations with the smallest sum of squared residuals. We show the LTS problem is equivalent to a concave minimization problem, which is very hard to solve. We resolve this difficulty by introducing the maximum trimmed squares" problem that finds the worst subset of observations. This problem can be transformed into an SDP problem that can be solved efficiently while still guaranteeing that we can identify outliers.(cont.) In addition, we model the robust ranking problem as a mixed integer minimax problem where the ranking is in a discrete uncertainty set. We use mixed integer programming methods, specifically column generation and network flows, to solve the robust ranking problem. To illustrate the power of these robust methods, we apply them to the mean-variance portfolio optimization problem in order to incorporate estimation errors into the model.by Tri-Dung Nguyen.Ph.D
Multi-criteria optimization in regression
In this paper, we consider standard as well as instrumental variables regression. Specification problems related to autocorrelation, heteroskedasticity, neglected non-linearity, unsatisfactory out-of-small performance and endogeneity can be addressed in the context of multi-criteria optimization. The new technique performs well, it minimizes all these problems simultaneously, and eliminates them for the most part. Markov Chain Monte Carlo techniques are used to perform the computations. An empirical application to NASDAQ returns is provided
- …