4 research outputs found
Consistency, breakdown robustness, and algorithms for robust improper maximum likelihood clustering
The robust improper maximum likelihood estimator (RIMLE) is a new method for
robust multivariate clustering finding approximately Gaussian clusters. It
maximizes a pseudo-likelihood defined by adding a component with improper
constant density for accommodating outliers to a Gaussian mixture. A special
case of the RIMLE is MLE for multivariate finite Gaussian mixture models. In
this paper we treat existence, consistency, and breakdown theory for the RIMLE
comprehensively. RIMLE's existence is proved under non-smooth covariance matrix
constraints. It is shown that these can be implemented via a computationally
feasible Expectation-Conditional Maximization algorithm.Comment: The title of this paper was originally: "A consistent and breakdown
robust model-based clustering method
A skewness-based clustering method
Partitive clustering methods represent one of the earlier and most famous sets of strategy in the field
of clustering. The name comes from their main feature: all these methods start from an initial
partition and modify it at every step of the process according to a known criterion, until a given
convergence rule is satisfied. In other words, as pointed out by Äyrämö and Kärkkäinen (2006),
they work essentially as iterative allocation algorithms. In this framework, we do not only focus on
“canonical” approaches such as K-means and fuzzy C-means, but discuss some recent symmetrybased
partitive clustering methods, mostly developed in the context of computer science and
engineering. As it will be shown, these approaches seem to provide encouraging results, especially
in the field of image recognition and some related applications, and for this reason, they represent a
starting point for our work.
In this respect, we are particularly interested in the case of overlapping clusters. As we will clarify,
this case may represent a critical aspect for most clustering methods we have considered. In
particular, we started our analysis by noting that, in a case of high-dimensional data with
overlapping clusters, it may be difficult to choose the component-specific distributions, and no
graphical device can help us. So, we decided to investigate non parametric approaches to clustering.
In this framework, we focused on the case of clusters with elliptical shapes, and in Gaussian
mixtures as a special case. Then, we realized that for elliptical shapes the symmetry could be a
“natural” choice. So, we searched for such clustering approaches, and we found the symmetrybased
methods cited above. But, surprisingly, none of them was intended to focus on elliptical
clusters, since their aim is essentially at handling image recognition of different symmetric shapes.
So, we decided to discuss this issue, and to test whether a suitable function of symmetry could
improve clustering results in the case of elliptical overlapping clusters.
Since we are interested in elliptical shapes, from a clustering point of view, another broad subject
that we will discuss is the Gaussian mixture model. In this context, our interest is in the EM-based
Mclust algorithm from the R library mclust, see Fraley and Raftery (1999). Thus, our work address
both of these topics, partitive clustering methods (with a focus on the symmetry-based approach)
and Gaussian model-based clustering.
The main reason of such a choice, that is to address two partially different subjects, derives from the
essential features of our proposal: a symmetry-based partitive method which is intended to deal with
elliptical clusters (with Gaussian being a special case). In this sense, we provide an evaluation of
our clustering performances by proposing a comparison with the Gaussian mixture model
implemented in the Mclust library, see Fraley and Raftery (1999). This is surely a challenging task,
since this method has home-court advantage in the case of Gaussian clusters. In this framework, as
pointed out before, we are mainly interested in the case of overlapping clusters. In this sense, a
starting point for our work was the assumption that Mclust (also in its “natural” framework, that is
Gaussian mixtures) could have problems in centroid estimation when clusters are highly
overlapping. Quite obviously, this drawback could be related to its dependency on the mutivariate
Gaussian density. So, we searched for a non parametric skewness-based method, which could be
appropriate for elliptical distribution (including Gaussian) in the case of overlapping clusters. This
was exactly the framework of the proposed Sbam (Skewness-Based Allocation Method) algorithm