529 research outputs found
A fast algorithm for robust constrained clustering
The application of “concentration” steps is the main principle behind Forgy’s
k-means algorithm and Rousseeuw and van Driessen’s fast-MCD algorithm.
Despite this coincidence, it is not completely straightforward to combine both
algorithms for developing a clustering method which is not severely affected
by few outlying observations and being able to cope with non spherical clusters.
A sensible way of combining them relies on controlling the relative cluster
scatters through constrained concentration steps. With this idea in mind,
a new algorithm for the TCLUST robust clustering procedure is proposed
which implements such constrained concentration steps in a computationally
efficient fashion.Estadística e I
Robustness and Outliers
Producción CientíficaUnexpected deviations from assumed models as well as the presence of certain amounts of outlying data are common in most practical statistical applications. This fact could lead to undesirable solutions when applying non-robust statistical techniques. This is often the case in cluster analysis, too. The search for homogeneous groups with large heterogeneity between them can be spoiled due to the lack of robustness of standard clustering methods. For instance, the presence of (even few) outlying observations may result in heterogeneous clusters artificially joined together or in the detection of spurious clusters merely made up of outlying observations. In this chapter we will analyze the effects of different kinds of outlying data in cluster analysis and explore several alternative methodologies designed to avoid or minimize their undesirable effects.Ministerio de Economía, Industria y Competitividad (MTM2014-56235-C2-1-P)Junta de Castilla y León (programa de apoyo a proyectos de investigación – Ref. VA212U13
Trimming Stability Selection increases variable selection robustness
Contamination can severely distort an estimator unless the estimation
procedure is suitably robust. This is a well-known issue and has been addressed
in Robust Statistics, however, the relation of contamination and distorted
variable selection has been rarely considered in literature. As for variable
selection, many methods for sparse model selection have been proposed,
including the Stability Selection which is a meta-algorithm based on some
variable selection algorithm in order to immunize against particular data
configurations. We introduce the variable selection breakdown point that
quantifies the number of cases resp. cells that have to be contaminated in
order to let no relevant variable be detected. We show that particular outlier
configurations can completely mislead model selection and argue why even
cell-wise robust methods cannot fix this problem. We combine the variable
selection breakdown point with resampling, resulting in the Stability Selection
breakdown point that quantifies the robustness of Stability Selection. We
propose a trimmed Stability Selection which only aggregates the models with the
lowest in-sample losses so that, heuristically, models computed on heavily
contaminated resamples should be trimmed away. An extensive simulation study
with non-robust regression and classification algorithms as well as with Sparse
Least Trimmed Squares reveals both the potential of our approach to boost the
model selection robustness as well as the fragility of variable selection using
non-robust algorithms, even for an extremely small cell-wise contamination
rate
Robust fuzzyclustering for object recognition and classification of relational data
Prototype based fuzzy clustering algorithms have unique ability to partition the data while detecting multiple clusters simultaneously. However since real data is often contaminated with noise, the clustering methods need to be made robust to be useful in practice. This dissertation focuses on robust detection of multiple clusters from noisy range images for object recognition. Dave\u27s noise clustering (NC) method has been shown to make prototype-based fuzzy clustering techniques robust. In this work, NC is generalized and the new NC membership is shown to be a product of fuzzy c-means (FCM) membership and robust M-estimator weight (or possibilistic membership). Thus the generalized NC approach is shown to have the partitioning ability of FCM and robustness of M-estimators. Since the NC (or FCM) algorithms are based on fixed-point iteration technique, they suffer from the problem of initializations. To overcome this problem, the sampling based robust LMS algorithm is considered by extending it to fuzzy c-LMS algorithm for detecting multiple clusters. The concept of repeated evidence has been incorporated to increase the speed of the new approach. The main problem with the LMS approach is the need for ordering the distance data. To eliminate this problem, a novel sampling based robust algorithm is proposed following the NC principle, called the NLS method, that directly searches for clusters in the maximum density region of the range data without requiring the specification of number of clusters.
The NC concept is also introduced to several fuzzy methods for robust classification of relational data for pattern recognition. This is also extended to non-Euclidean relational data.
The resulting algorithms are used for object recognition from range images as well as for identification of bottleneck parts while creating desegregated cells of machine/ components in cellular manufacturing and group technology (GT) applications
A data driven equivariant approach to constrained Gaussian mixture modeling
Maximum likelihood estimation of Gaussian mixture models with different
class-specific covariance matrices is known to be problematic. This is due to
the unboundedness of the likelihood, together with the presence of spurious
maximizers. Existing methods to bypass this obstacle are based on the fact that
unboundedness is avoided if the eigenvalues of the covariance matrices are
bounded away from zero. This can be done imposing some constraints on the
covariance matrices, i.e. by incorporating a priori information on the
covariance structure of the mixture components. The present work introduces a
constrained equivariant approach, where the class conditional covariance
matrices are shrunk towards a pre-specified matrix Psi. Data-driven choices of
the matrix Psi, when a priori information is not available, and the optimal
amount of shrinkage are investigated. The effectiveness of the proposal is
evaluated on the basis of a simulation study and an empirical example
Advances in robust clustering methods with applications
Robust methods in statistics are mainly concerned with deviations from model assumptions.
As already pointed out in Huber (1981) and in Huber & Ronchetti
(2009) \these assumptions are not exactly true since they are just a mathematically
convenient rationalization of an often fuzzy knowledge or belief". For that reason \a
minor error in the mathematical model should cause only a small error in the nal
conclusions". Nevertheless it is well known that many classical statistical procedures
are \excessively sensitive to seemingly minor deviations from the assumptions".
All statistical methods based on the minimization of the average square loss may
suer of lack of robustness. Illustrative examples of how outliers' in
uence may
completely alter the nal results in regression analysis and linear model context are
provided in Atkinson & Riani (2012). A presentation of classical multivariate tools'
robust counterparts is provided in Farcomeni & Greco (2015).
The whole dissertation is focused on robust clustering models and the outline of the
thesis is as follows.
Chapter 1 is focused on robust methods. Robust methods are aimed at increasing
the eciency when contamination appears in the sample. Thus a general denition
of such (quite general) concept is required. To do so we give a brief account of
some kinds of contamination we can encounter in real data applications. Secondly
we introduce the \Spurious outliers model" (Gallegos & Ritter 2009a) which is the
cornerstone of the robust model based clustering models. Such model is aimed at
formalizing clustering problems when one has to deal with contaminated samples.
The assumption standing behind the \Spurious outliers model" is that two dierent
random mechanisms generate the data: one is assumed to generate the \clean"
part while the another one generates the contamination. This idea is actually very
common within robust models like the \Tukey-Huber model" which is introduced in
Subsection 1.2.2. Outliers' recognition, especially in the multivariate case, plays a
key role and is not straightforward as the dimensionality of the data increases. An
overview of the most widely used (robust) methods for outliers detection is provided
within Section 1.3. Finally, in Section 1.4, we provide a non technical review of the
classical tools introduced in the Robust Statistics' literature aimed at evaluating the robustness properties of a methodology.
Chapter 2 is focused on model based clustering methods and their robustness' properties.
Cluster analysis, \the art of nding groups in the data" (Kaufman & Rousseeuw
1990), is one of the most widely used tools within the unsupervised learning context.
A very popular method is the k-means algorithm (MacQueen et al. 1967) which is
based on minimizing the Euclidean distance of each observation from the estimated
clusters' centroids and therefore it is aected by lack of robustness. Indeed even a
single outlying observation may completely alter centroids' estimation and simultaneously
provoke a bias in the standard errors' estimation. Cluster's contours may be
in
ated and the \real" underlying clusterwise structure might be completely hidden.
A rst attempt of robustifying the k- means algorithm appeared in Cuesta-Albertos
et al. (1997), where a trimming step is inserted in the algorithm in order to avoid
the outliers' exceeding in
uence.
It shall be noticed that k-means algorithm is ecient for detecting spherical homoscedastic
clusters. Whenever more
exible shapes are desired the procedure becomes
inecient. In order to overcome this problem Gaussian model based clustering
methods should be adopted instead of k-means algorithm. An example, among
the other proposals described in Chapter 2, is the TCLUST methodology (Garca-
Escudero et al. 2008), which is the cornerstone of the thesis. Such methodology is
based on two main characteristics: trimming a xed proportion of observations and
imposing a constraint on the estimates of the scatter matrices. As it will be explained
in Chapter 2, trimming is used to protect the results from outliers' in
uence
while the constraint is involved as spurious maximizers may completely spoil the
solution.
Chapter 3 and 4 are mainly focused on extending the TCLUST methodology.
In particular, in Chapter 3, we introduce a new contribution (compare Dotto et al.
2015 and Dotto et al. 2016b), based on the TCLUST approach, called reweighted
TCLUST or RTCLUST for the sake of brevity. The idea standing behind such
method is based on reweighting the observations initially
agged as outlying. This
is helpful both to gain eciency in the parameters' estimation process and to provide
a reliable estimation of the true contamination level. Indeed, as the TCLUST
is based on trimming a xed proportion of observations, a proper choice of the
trimming level is required. Such choice, especially in the applications, can be cumbersome.
As it will be claried later on, RTCLUST methodology allows the user to
overcome such problem. Indeed, in the RTCLUST approach the user is only required
to impose a high preventive trimming level. The procedure, by iterating through a
sequence of decreasing trimming levels, is aimed at reinserting the discarded observations
at each step and provides more precise estimation of the parameters and a nal estimation of the true contamination level ^.
The theoretical properties of the methodology are studied in Section 3.6 and proved
in Appendix A.1, while, Section 3.7, contains a simulation study aimed at evaluating
the properties of the methodology and the advantages with respect to some other
robust (reweigthed and single step procedures).
Chapter 4 contains an extension of the TCLUST method for fuzzy linear clustering
(Dotto et al. 2016a). Such contribution can be viewed as the extension of
Fritz et al. (2013a) for linear clustering problems, or, equivalently, as the extension
of Garca-Escudero, Gordaliza, Mayo-Iscar & San Martn (2010) to the fuzzy
clustering framework. Fuzzy clustering is also useful to deal with contamination.
Fuzziness is introduced to deal with overlapping between clusters and the presence
of bridge points, to be dened in Section 1.1. Indeed bridge points may arise in case
of overlapping between clusters and may completely alter the estimated cluster's
parameters (i.e. the coecients of a linear model in each cluster). By introducing
fuzziness such observations are suitably down weighted and the clusterwise structure
can be correctly detected. On the other hand, robustness against gross outliers,
as in the TCLUST methodology, is guaranteed by trimming a xed proportion of
observations. Additionally a simulation study, aimed at comparing the proposed
methodology with other proposals (both robust and non robust) is also provided in
Section 4.4.
Chapter 5 is entirely dedicated to real data applications of the proposed contributions.
In particular, the RTCLUST method is applied to two dierent datasets. The
rst one is the \Swiss Bank Note" dataset, a well known benchmark dataset for clustering
models, and to a dataset collected by Gallup Organization, which is, to our
knowledge, an original dataset, on which no other existing proposals have been applied
yet. Section 5.3 contains an application of our fuzzy linear clustering proposal
to allometry data. In our opinion such dataset, already considered in the robust
linear clustering proposal appeared in Garca-Escudero, Gordaliza, Mayo-Iscar &
San Martn (2010), is particularly useful to show the advantages of our proposed
methodology. Indeed allometric quantities are often linked by a linear relationship
but, at the same time, there may be overlap between dierent groups and outliers
may often appear due to errors in data registration.
Finally Chapter 6 contains the concluding remarks and the further directions of
research. In particular we wish to mention an ongoing work (Dotto & Farcomeni,
In preparation) in which we consider the possibility of implementing robust parsimonious
Gaussian clustering models. Within the chapter, the algorithm is briefly
described and some illustrative examples are also provided. The potential advantages
of such proposals are the following. First of all, by considering the parsimonious
models introduced in Celeux & Govaert (1995), the user is able to impose the shape of the detected clusters, which often, in the applications, plays a key role.
Secondly, by constraining the shape of the detected clusters, the constraint on the
eigenvalue ratio can be avoided. This leads to the removal of a tuning parameter of
the procedure and, at the same time, allows the user to obtain ane equivariant estimators.
Finally, since the possibility of trimming a xed proportion of observations
is allowed, then the procedure is also formally robust
The usefulness of robust multivariate methods: A case study with the menu items of a fast food restaurant chain
Multivariate statistical methods have been playing an important role in statistics and data analysis for a very long time. Nowadays, with the increase in the amounts of data collected every day in many disciplines, and with the raise of data science, machine learning and applied statistics, that role is even more important. Two of the most widely used multivariate statistical methods are cluster analysis and principal component analysis. These, similarly to many other models and algorithms, are adequate when the data satisfies certain assumptions. However, when the distribution of the data is not normal and/or it shows heavy tails and outlying observations, the classic models and algorithms might produce erroneous conclusions. Robust statistical methods such as algorithms for robust cluster analysis and for robust principal component analysis are of great usefulness when analyzing contaminated data with outlying observations. In this paper we consider a data set containing the products available in a fast food restaurant chain together with their respective nutritional information, and discuss the usefulness of robust statistical methods for classification, clustering and data visualization
- …