3,494 research outputs found
Robust improper maximum likelihood: tuning, computation, and a comparison with other methods for robust Gaussian clustering
The two main topics of this paper are the introduction of the "optimally
tuned improper maximum likelihood estimator" (OTRIMLE) for robust clustering
based on the multivariate Gaussian model for clusters, and a comprehensive
simulation study comparing the OTRIMLE to Maximum Likelihood in Gaussian
mixtures with and without noise component, mixtures of t-distributions, and the
TCLUST approach for trimmed clustering. The OTRIMLE uses an improper constant
density for modelling outliers and noise. This can be chosen optimally so that
the non-noise part of the data looks as close to a Gaussian mixture as
possible. Some deviation from Gaussianity can be traded in for lowering the
estimated noise proportion. Covariance matrix constraints and computation of
the OTRIMLE are also treated. In the simulation study, all methods are
confronted with setups in which their model assumptions are not exactly
fulfilled, and in order to evaluate the experiments in a standardized way by
misclassification rates, a new model-based definition of "true clusters" is
introduced that deviates from the usual identification of mixture components
with clusters. In the study, every method turns out to be superior for one or
more setups, but the OTRIMLE achieves the most satisfactory overall
performance. The methods are also applied to two real datasets, one without and
one with known "true" clusters
A robust approach to model-based classification based on trimming and constraints
In a standard classification framework a set of trustworthy learning data are
employed to build a decision rule, with the final aim of classifying unlabelled
units belonging to the test set. Therefore, unreliable labelled observations,
namely outliers and data with incorrect labels, can strongly undermine the
classifier performance, especially if the training size is small. The present
work introduces a robust modification to the Model-Based Classification
framework, employing impartial trimming and constraints on the ratio between
the maximum and the minimum eigenvalue of the group scatter matrices. The
proposed method effectively handles noise presence in both response and
exploratory variables, providing reliable classification even when dealing with
contaminated datasets. A robust information criterion is proposed for model
selection. Experiments on real and simulated data, artificially adulterated,
are provided to underline the benefits of the proposed method
A data driven equivariant approach to constrained Gaussian mixture modeling
Maximum likelihood estimation of Gaussian mixture models with different
class-specific covariance matrices is known to be problematic. This is due to
the unboundedness of the likelihood, together with the presence of spurious
maximizers. Existing methods to bypass this obstacle are based on the fact that
unboundedness is avoided if the eigenvalues of the covariance matrices are
bounded away from zero. This can be done imposing some constraints on the
covariance matrices, i.e. by incorporating a priori information on the
covariance structure of the mixture components. The present work introduces a
constrained equivariant approach, where the class conditional covariance
matrices are shrunk towards a pre-specified matrix Psi. Data-driven choices of
the matrix Psi, when a priori information is not available, and the optimal
amount of shrinkage are investigated. The effectiveness of the proposal is
evaluated on the basis of a simulation study and an empirical example
A fast and recursive algorithm for clustering large datasets with -medians
Clustering with fast algorithms large samples of high dimensional data is an
important challenge in computational statistics. Borrowing ideas from MacQueen
(1967) who introduced a sequential version of the -means algorithm, a new
class of recursive stochastic gradient algorithms designed for the -medians
loss criterion is proposed. By their recursive nature, these algorithms are
very fast and are well adapted to deal with large samples of data that are
allowed to arrive sequentially. It is proved that the stochastic gradient
algorithm converges almost surely to the set of stationary points of the
underlying loss criterion. A particular attention is paid to the averaged
versions, which are known to have better performances, and a data-driven
procedure that allows automatic selection of the value of the descent step is
proposed.
The performance of the averaged sequential estimator is compared on a
simulation study, both in terms of computation speed and accuracy of the
estimations, with more classical partitioning techniques such as -means,
trimmed -means and PAM (partitioning around medoids). Finally, this new
online clustering technique is illustrated on determining television audience
profiles with a sample of more than 5000 individual television audiences
measured every minute over a period of 24 hours.Comment: Under revision for Computational Statistics and Data Analysi
- âŠ