43 research outputs found
Robust nonparametric regression: review and practical considerations
Nonparametric regression models offer a way to understand and quantify
relationships between variables without having to identify an appropriate
family of possible regression functions. Although many estimation methods for
these models have been proposed in the literature, most of them can be highly
sensitive to the presence of a small proportion of atypical observations in the
training set. In this paper we review outlier robust estimation methods for
nonparametric regression models, paying particular attention to practical
considerations. Since outliers can also influence negatively the regression
estimator by affecting the selection of bandwidths or smoothing parameters, we
also discuss available robust alternatives for this task. Finally, since using
many of the ``classical'' nonparametric regression estimators (and their robust
counterparts) can be very challenging in settings with a moderate or large
number of explanatory variables, we review recent robust nonparametric
regression methods that scale well with a growing number of covariates
Uniform asymptotics for robust location estimates when the scale is unknown
Most asymptotic results for robust estimates rely on regularity conditions
that are difficult to verify in practice. Moreover, these results apply to
fixed distribution functions. In the robustness context the distribution of the
data remains largely unspecified and hence results that hold uniformly over a
set of possible distribution functions are of theoretical and practical
interest. Also, it is desirable to be able to determine the size of the set of
distribution functions where the uniform properties hold. In this paper we
study the problem of obtaining verifiable regularity conditions that suffice to
yield uniform consistency and uniform asymptotic normality for location robust
estimates when the scale of the errors is unknown.
We study M-location estimates calculated with an S-scale and we obtain
uniform asymptotic results over contamination neighborhoods. Moreover, we show
how to calculate the maximum size of the contamination neighborhoods where
these uniform results hold. There is a trade-off between the size of these
neighborhoods and the breakdown point of the scale estimate.Comment: Published by the Institute of Mathematical Statistics
(http://www.imstat.org) in the Annals of Statistics
(http://www.imstat.org/aos/) at http://dx.doi.org/10.1214/00905360400000054
A robust and sparse K-means clustering algorithm
In many situations where the interest lies in identifying clusters one might
expect that not all available variables carry information about these groups.
Furthermore, data quality (e.g. outliers or missing entries) might present a
serious and sometimes hard-to-assess problem for large and complex datasets. In
this paper we show that a small proportion of atypical observations might have
serious adverse effects on the solutions found by the sparse clustering
algorithm of Witten and Tibshirani (2010). We propose a robustification of
their sparse K-means algorithm based on the trimmed K-means algorithm of
Cuesta-Albertos et al. (1997) Our proposal is also able to handle datasets with
missing values. We illustrate the use of our method on microarray data for
cancer patients where we are able to identify strong biological clusters with a
much reduced number of genes. Our simulation studies show that, when there are
outliers in the data, our robust sparse K-means algorithm performs better than
other competing methods both in terms of the selection of features and also the
identified clusters. This robust sparse K-means algorithm is implemented in the
R package RSKC which is publicly available from the CRAN repository
Uniform asymptotics for robust location estimates when the scale is unknown
Most asymptotic results for robust estimates rely on regularity conditions that are difficult to verify and that real data sets rarely satisfy. Moreover, these results apply to fixed distribution functions. In the robustness context the distribution of the data remains largely unspecified and hence results that hold uniformly over a set of possible distribution functions are of theoretical and practical interest. In this paper we study the problem of obtaining verifiable and realistic conditions that suffice to obtain uniform consistency and uniform asymptotic normality for location robust estimates when the scale of the errors is unknown. We study M-location estimates calculated withan S-scale and we obtain uniform asymptotic results over contamination neighbourhoods. There is a trade-off between the size of these neighbourhoods and the breakdown point of the scale estimate. We also show how to calculate the maximum size of the contamination neighbourhoods where these uniform results hold.
S-estimation for penalized regression splines.
This paper is about S-estimation for penalized regression splines. Penalized regression splines are one of the currently most used methods for smoothing noisy data. The estimation method used for fitting such a penalized regression spline model is mostly based on least squares methods, which are known to be sensitive to outlying observations. In real world applications, outliers are quite commonly observed. There are several robust estimation methods taking outlying observations into account. We define and study S-estimators for penalized regression spline models. Hereby we replace the least squares estimation method for penalized regression splines by a suitable S-estimation method. By keeping the modeling by means of splines and by keeping the penalty term, though using S-estimators instead of least squares estimators, we arrive at an estimation method that is both robust and flexible enough to capture non-linear trends in the data. Simulated data and a real data example are used to illustrate the effectiveness of the procedure.M-estimator; Penalized least squares method; Penalized regression spline; S-estimator; Smoothing parameter;
RSKC: An R Package for a Robust and Sparse K-Means Clustering Algorithm
Witten and Tibshirani (2010) proposed an algorithim to simultaneously find clusters and select clustering variables, called sparse K-means (SK-means). SK-means is particularly useful when the dataset has a large fraction of noise variables (that is, variables without useful information to separate the clusters). SK-means works very well on clean and complete data but cannot handle outliers nor missing data. To remedy these problems we introduce a new robust and sparse K-means clustering algorithm implemented in the R package RSKC. We demonstrate the use of our package on four datasets. We also conduct a Monte Carlo study to compare the performances of RSK-means and SK-means regarding the selection of important variables and identification of clusters. Our simulation study shows that RSK-means performs well on clean data and better than SK-means and other competitors on outlier-contaminated data