6,776 research outputs found
A Fast Algorithm for Robust Regression with Penalised Trimmed Squares
The presence of groups containing high leverage outliers makes linear
regression a difficult problem due to the masking effect. The available high
breakdown estimators based on Least Trimmed Squares often do not succeed in
detecting masked high leverage outliers in finite samples.
An alternative to the LTS estimator, called Penalised Trimmed Squares (PTS)
estimator, was introduced by the authors in \cite{ZiouAv:05,ZiAvPi:07} and it
appears to be less sensitive to the masking problem. This estimator is defined
by a Quadratic Mixed Integer Programming (QMIP) problem, where in the objective
function a penalty cost for each observation is included which serves as an
upper bound on the residual error for any feasible regression line. Since the
PTS does not require presetting the number of outliers to delete from the data
set, it has better efficiency with respect to other estimators. However, due to
the high computational complexity of the resulting QMIP problem, exact
solutions for moderately large regression problems is infeasible.
In this paper we further establish the theoretical properties of the PTS
estimator, such as high breakdown and efficiency, and propose an approximate
algorithm called Fast-PTS to compute the PTS estimator for large data sets
efficiently. Extensive computational experiments on sets of benchmark instances
with varying degrees of outlier contamination, indicate that the proposed
algorithm performs well in identifying groups of high leverage outliers in
reasonable computational time.Comment: 27 page
SOCP relaxation bounds for the optimal subset selection problem applied to robust linear regression
This paper deals with the problem of finding the globally optimal subset of h
elements from a larger set of n elements in d space dimensions so as to
minimize a quadratic criterion, with an special emphasis on applications to
computing the Least Trimmed Squares Estimator (LTSE) for robust regression. The
computation of the LTSE is a challenging subset selection problem involving a
nonlinear program with continuous and binary variables, linked in a highly
nonlinear fashion. The selection of a globally optimal subset using the branch
and bound (BB) algorithm is limited to problems in very low dimension,
tipically d<5, as the complexity of the problem increases exponentially with d.
We introduce a bold pruning strategy in the BB algorithm that results in a
significant reduction in computing time, at the price of a negligeable accuracy
lost. The novelty of our algorithm is that the bounds at nodes of the BB tree
come from pseudo-convexifications derived using a linearization technique with
approximate bounds for the nonlinear terms. The approximate bounds are computed
solving an auxiliary semidefinite optimization problem. We show through a
computational study that our algorithm performs well in a wide set of the most
difficult instances of the LTSE problem.Comment: 12 pages, 3 figures, 2 table
Robust Sparse Canonical Correlation Analysis
Canonical correlation analysis (CCA) is a multivariate statistical method
which describes the associations between two sets of variables. The objective
is to find linear combinations of the variables in each data set having maximal
correlation. This paper discusses a method for Robust Sparse CCA. Sparse
estimation produces canonical vectors with some of their elements estimated as
exactly zero. As such, their interpretability is improved. We also robustify
the method such that it can cope with outliers in the data. To estimate the
canonical vectors, we convert the CCA problem into an alternating regression
framework, and use the sparse Least Trimmed Squares estimator. We illustrate
the good performance of the Robust Sparse CCA method in several simulation
studies and two real data examples
Evolutionary algorithms for robust methods
A drawback of robust statistical techniques is the increased computational effort often needed compared to non robust methods. Robust estimators possessing the exact fit property, for example, are NP-hard to compute. This means thatunder the widely believed assumption that the computational complexity classes NP and P are not equalthere is no hope to compute exact solutions for large high dimensional data sets. To tackle this problem, search heuristics are used to compute NP-hard estimators in high dimensions. Here, an evolutionary algorithm that is applicable to different robust estimators is presented. Further, variants of this evolutionary algorithm for selected estimatorsmost prominently least trimmed squares and least median of squaresare introduced and shown to outperform existing popular search heuristics in difficult data situations. The results increase the applicability of robust methods and underline the usefulness of evolutionary computation for computational statistics. --Evolutionary algorithms,robust regression,least trimmed squares (LTS),least median of squares (LMS),least quantile of squares (LQS),least quartile difference (LQD)
Outlier Detection Using Nonconvex Penalized Regression
This paper studies the outlier detection problem from the point of view of
penalized regressions. Our regression model adds one mean shift parameter for
each of the data points. We then apply a regularization favoring a sparse
vector of mean shift parameters. The usual penalty yields a convex
criterion, but we find that it fails to deliver a robust estimator. The
penalty corresponds to soft thresholding. We introduce a thresholding (denoted
by ) based iterative procedure for outlier detection (-IPOD). A
version based on hard thresholding correctly identifies outliers on some hard
test problems. We find that -IPOD is much faster than iteratively
reweighted least squares for large data because each iteration costs at most
(and sometimes much less) avoiding an least squares estimate.
We describe the connection between -IPOD and -estimators. Our
proposed method has one tuning parameter with which to both identify outliers
and estimate regression coefficients. A data-dependent choice can be made based
on BIC. The tuned -IPOD shows outstanding performance in identifying
outliers in various situations in comparison to other existing approaches. This
methodology extends to high-dimensional modeling with , if both the
coefficient vector and the outlier pattern are sparse
BSA - exact algorithm computing LTS estimate
The main result of this paper is a new exact algorithm computing the estimate
given by the Least Trimmed Squares (LTS). The algorithm works under very weak
assumptions. To prove that, we study the respective objective function using
basic techniques of analysis and linear algebra.Comment: 18 pages, 1 figur
- …