510 research outputs found
Breakdown points for maximum likelihood estimators of location-scale mixtures
ML-estimation based on mixtures of Normal distributions is a widely used tool
for cluster analysis. However, a single outlier can make the parameter
estimation of at least one of the mixture components break down. Among others,
the estimation of mixtures of t-distributions by McLachlan and
Peel [Finite Mixture Models (2000) Wiley, New York] and the addition of a
further mixture component accounting for ``noise'' by Fraley and Raftery
[The Computer J. 41 (1998) 578-588] were suggested as more robust
alternatives.
In this paper, the definition of an adequate robustness measure for cluster
analysis is discussed and bounds for the breakdown points of the mentioned
methods are given. It turns out that the two alternatives, while adding
stability in the presence of outliers of moderate size, do not possess a
substantially better breakdown behavior than estimation based on Normal
mixtures. If the number of clusters s is treated as fixed, r additional points
suffice for all three methods to let the parameters of r clusters explode. Only
in the case of r=s is this not possible for t-mixtures. The ability to estimate
the number of mixture components, for example, by use of the Bayesian
information criterion of Schwarz [Ann. Statist. 6 (1978)
461-464], and to isolate gross outliers as clusters of one point, is crucial
for an improved breakdown behavior of all three techniques. Furthermore, a
mixture of Normals with an improper uniform distribution is proposed to achieve
more robustness in the case of a fixed number of components.Comment: Published by the Institute of Mathematical Statistics
(http://www.imstat.org) in the Annals of Statistics
(http://www.imstat.org/aos/) at http://dx.doi.org/10.1214/00905360400000057
What are the true clusters?
Constructivist philosophy and Hasok Chang's active scientific realism are
used to argue that the idea of "truth" in cluster analysis depends on the
context and the clustering aims. Different characteristics of clusterings are
required in different situations. Researchers should be explicit about on what
requirements and what idea of "true clusters" their research is based, because
clustering becomes scientific not through uniqueness but through transparent
and open communication. The idea of "natural kinds" is a human construct, but
it highlights the human experience that the reality outside the observer's
control seems to make certain distinctions between categories inevitable.
Various desirable characteristics of clusterings and various approaches to
define a context-dependent truth are listed, and I discuss what impact these
ideas can have on the comparison of clustering methods, and the choice of a
clustering methods and related decisions in practice
Beyond subjective and objective in statistics
We argue that the words "objectivity" and "subjectivity" in statistics
discourse are used in a mostly unhelpful way, and we propose to replace each of
them with broader collections of attributes, with objectivity replaced by
transparency, consensus, impartiality, and correspondence to observable
reality, and subjectivity replaced by awareness of multiple perspectives and
context dependence. The advantage of these reformulations is that the
replacement terms do not oppose each other. Instead of debating over whether a
given statistical method is subjective or objective (or normatively debating
the relative merits of subjectivity and objectivity in statistical practice),
we can recognize desirable attributes such as transparency and acknowledgment
of multiple perspectives as complementary goals. We demonstrate the
implications of our proposal with recent applied examples from pharmacology,
election polling, and socioeconomic stratification.Comment: 35 page
Robust improper maximum likelihood: tuning, computation, and a comparison with other methods for robust Gaussian clustering
The two main topics of this paper are the introduction of the "optimally
tuned improper maximum likelihood estimator" (OTRIMLE) for robust clustering
based on the multivariate Gaussian model for clusters, and a comprehensive
simulation study comparing the OTRIMLE to Maximum Likelihood in Gaussian
mixtures with and without noise component, mixtures of t-distributions, and the
TCLUST approach for trimmed clustering. The OTRIMLE uses an improper constant
density for modelling outliers and noise. This can be chosen optimally so that
the non-noise part of the data looks as close to a Gaussian mixture as
possible. Some deviation from Gaussianity can be traded in for lowering the
estimated noise proportion. Covariance matrix constraints and computation of
the OTRIMLE are also treated. In the simulation study, all methods are
confronted with setups in which their model assumptions are not exactly
fulfilled, and in order to evaluate the experiments in a standardized way by
misclassification rates, a new model-based definition of "true clusters" is
introduced that deviates from the usual identification of mixture components
with clusters. In the study, every method turns out to be superior for one or
more setups, but the OTRIMLE achieves the most satisfactory overall
performance. The methods are also applied to two real datasets, one without and
one with known "true" clusters
Consistency, breakdown robustness, and algorithms for robust improper maximum likelihood clustering
The robust improper maximum likelihood estimator (RIMLE) is a new method for
robust multivariate clustering finding approximately Gaussian clusters. It
maximizes a pseudo-likelihood defined by adding a component with improper
constant density for accommodating outliers to a Gaussian mixture. A special
case of the RIMLE is MLE for multivariate finite Gaussian mixture models. In
this paper we treat existence, consistency, and breakdown theory for the RIMLE
comprehensively. RIMLE's existence is proved under non-smooth covariance matrix
constraints. It is shown that these can be implemented via a computationally
feasible Expectation-Conditional Maximization algorithm.Comment: The title of this paper was originally: "A consistent and breakdown
robust model-based clustering method
Comparing clusterings and numbers of clusters by aggregation of calibrated clustering validity indexes
A key issue in cluster analysis is the choice of an appropriate clustering
method and the determination of the best number of clusters. Different
clusterings are optimal on the same data set according to different criteria,
and the choice of such criteria depends on the context and aim of clustering.
Therefore, researchers need to consider what data analytic characteristics the
clusters they are aiming at are supposed to have, among others within-cluster
homogeneity, between-clusters separation, and stability. Here, a set of
internal clustering validity indexes measuring different aspects of clustering
quality is proposed, including some indexes from the literature. Users can
choose the indexes that are relevant in the application at hand. In order to
measure the overall quality of a clustering (for comparing clusterings from
different methods and/or different numbers of clusters), the index values are
calibrated for aggregation. Calibration is relative to a set of random
clusterings on the same data. Two specific aggregated indexes are proposed and
compared with existing indexes on simulated and real data.Comment: 42 pages, 11 figure
Flexible parametric bootstrap for testing homogeneity against clustering and assessing the number of clusters
There are two notoriously hard problems in cluster analysis, estimating the
number of clusters, and checking whether the population to be clustered is not
actually homogeneous. Given a dataset, a clustering method and a cluster
validation index, this paper proposes to set up null models that capture
structural features of the data that cannot be interpreted as indicating
clustering. Artificial datasets are sampled from the null model with parameters
estimated from the original dataset. This can be used for testing the null
hypothesis of a homogeneous population against a clustering alternative. It can
also be used to calibrate the validation index for estimating the number of
clusters, by taking into account the expected distribution of the index under
the null model for any given number of clusters. The approach is illustrated by
three examples, involving various different clustering techniques (partitioning
around medoids, hierarchical methods, a Gaussian mixture model), validation
indexes (average silhouette width, prediction strength and BIC), and issues
such as mixed type data, temporal and spatial autocorrelation
- …