43 research outputs found

    Robust nonparametric regression: review and practical considerations

    Full text link
    Nonparametric regression models offer a way to understand and quantify relationships between variables without having to identify an appropriate family of possible regression functions. Although many estimation methods for these models have been proposed in the literature, most of them can be highly sensitive to the presence of a small proportion of atypical observations in the training set. In this paper we review outlier robust estimation methods for nonparametric regression models, paying particular attention to practical considerations. Since outliers can also influence negatively the regression estimator by affecting the selection of bandwidths or smoothing parameters, we also discuss available robust alternatives for this task. Finally, since using many of the ``classical'' nonparametric regression estimators (and their robust counterparts) can be very challenging in settings with a moderate or large number of explanatory variables, we review recent robust nonparametric regression methods that scale well with a growing number of covariates

    Uniform asymptotics for robust location estimates when the scale is unknown

    Full text link
    Most asymptotic results for robust estimates rely on regularity conditions that are difficult to verify in practice. Moreover, these results apply to fixed distribution functions. In the robustness context the distribution of the data remains largely unspecified and hence results that hold uniformly over a set of possible distribution functions are of theoretical and practical interest. Also, it is desirable to be able to determine the size of the set of distribution functions where the uniform properties hold. In this paper we study the problem of obtaining verifiable regularity conditions that suffice to yield uniform consistency and uniform asymptotic normality for location robust estimates when the scale of the errors is unknown. We study M-location estimates calculated with an S-scale and we obtain uniform asymptotic results over contamination neighborhoods. Moreover, we show how to calculate the maximum size of the contamination neighborhoods where these uniform results hold. There is a trade-off between the size of these neighborhoods and the breakdown point of the scale estimate.Comment: Published by the Institute of Mathematical Statistics (http://www.imstat.org) in the Annals of Statistics (http://www.imstat.org/aos/) at http://dx.doi.org/10.1214/00905360400000054

    A robust and sparse K-means clustering algorithm

    Full text link
    In many situations where the interest lies in identifying clusters one might expect that not all available variables carry information about these groups. Furthermore, data quality (e.g. outliers or missing entries) might present a serious and sometimes hard-to-assess problem for large and complex datasets. In this paper we show that a small proportion of atypical observations might have serious adverse effects on the solutions found by the sparse clustering algorithm of Witten and Tibshirani (2010). We propose a robustification of their sparse K-means algorithm based on the trimmed K-means algorithm of Cuesta-Albertos et al. (1997) Our proposal is also able to handle datasets with missing values. We illustrate the use of our method on microarray data for cancer patients where we are able to identify strong biological clusters with a much reduced number of genes. Our simulation studies show that, when there are outliers in the data, our robust sparse K-means algorithm performs better than other competing methods both in terms of the selection of features and also the identified clusters. This robust sparse K-means algorithm is implemented in the R package RSKC which is publicly available from the CRAN repository

    Uniform asymptotics for robust location estimates when the scale is unknown

    Get PDF
    Most asymptotic results for robust estimates rely on regularity conditions that are difficult to verify and that real data sets rarely satisfy. Moreover, these results apply to fixed distribution functions. In the robustness context the distribution of the data remains largely unspecified and hence results that hold uniformly over a set of possible distribution functions are of theoretical and practical interest. In this paper we study the problem of obtaining verifiable and realistic conditions that suffice to obtain uniform consistency and uniform asymptotic normality for location robust estimates when the scale of the errors is unknown. We study M-location estimates calculated withan S-scale and we obtain uniform asymptotic results over contamination neighbourhoods. There is a trade-off between the size of these neighbourhoods and the breakdown point of the scale estimate. We also show how to calculate the maximum size of the contamination neighbourhoods where these uniform results hold.

    S-estimation for penalized regression splines.

    Get PDF
    This paper is about S-estimation for penalized regression splines. Penalized regression splines are one of the currently most used methods for smoothing noisy data. The estimation method used for fitting such a penalized regression spline model is mostly based on least squares methods, which are known to be sensitive to outlying observations. In real world applications, outliers are quite commonly observed. There are several robust estimation methods taking outlying observations into account. We define and study S-estimators for penalized regression spline models. Hereby we replace the least squares estimation method for penalized regression splines by a suitable S-estimation method. By keeping the modeling by means of splines and by keeping the penalty term, though using S-estimators instead of least squares estimators, we arrive at an estimation method that is both robust and flexible enough to capture non-linear trends in the data. Simulated data and a real data example are used to illustrate the effectiveness of the procedure.M-estimator; Penalized least squares method; Penalized regression spline; S-estimator; Smoothing parameter;

    RSKC: An R Package for a Robust and Sparse K-Means Clustering Algorithm

    Get PDF
    Witten and Tibshirani (2010) proposed an algorithim to simultaneously find clusters and select clustering variables, called sparse K-means (SK-means). SK-means is particularly useful when the dataset has a large fraction of noise variables (that is, variables without useful information to separate the clusters). SK-means works very well on clean and complete data but cannot handle outliers nor missing data. To remedy these problems we introduce a new robust and sparse K-means clustering algorithm implemented in the R package RSKC. We demonstrate the use of our package on four datasets. We also conduct a Monte Carlo study to compare the performances of RSK-means and SK-means regarding the selection of important variables and identification of clusters. Our simulation study shows that RSK-means performs well on clean data and better than SK-means and other competitors on outlier-contaminated data
    corecore