475 research outputs found
An Entropy criterion for assessing the number of clusters in a mixture model
Projet CLORECIn this paper, we consider an entropy criterion to estimate the number of clusters arising from a mixture model. This criterion is derived from a relation linking the likelihood and the classification likelihood of a mixture. Its performances are investigated through Monte-Carlo numerical experiments and show favourable results as compared with other classical criteria
Detection of elliptical shapes via cross-entropy clustering
The problem of finding elliptical shapes in an image will be considered. We
discuss the solution which uses cross-entropy clustering. The proposed method
allows the search for ellipses with predefined sizes and position in the space.
Moreover, it works well for search of ellipsoids in higher dimensions
Localizing the Latent Structure Canonical Uncertainty: Entropy Profiles for Hidden Markov Models
This report addresses state inference for hidden Markov models. These models
rely on unobserved states, which often have a meaningful interpretation. This
makes it necessary to develop diagnostic tools for quantification of state
uncertainty. The entropy of the state sequence that explains an observed
sequence for a given hidden Markov chain model can be considered as the
canonical measure of state sequence uncertainty. This canonical measure of
state sequence uncertainty is not reflected by the classic multivariate state
profiles computed by the smoothing algorithm, which summarizes the possible
state sequences. Here, we introduce a new type of profiles which have the
following properties: (i) these profiles of conditional entropies are a
decomposition of the canonical measure of state sequence uncertainty along the
sequence and makes it possible to localize this uncertainty, (ii) these
profiles are univariate and thus remain easily interpretable on tree
structures. We show how to extend the smoothing algorithms for hidden Markov
chain and tree models to compute these entropy profiles efficiently.Comment: Submitted to Journal of Machine Learning Research; No RR-7896 (2012
Enhancing the selection of a model-based clustering with external categorical variables
In cluster analysis, it can be useful to interpret the partition built from the data in the light of external categorical variables which are not directly involved to cluster the data. An approach is proposed in the model-based clustering context to select a number of clusters which both fits the data well and takes advantage of the potential illustrative ability of the external variables. This approach makes use of the integrated joint likelihood of the data and the partitions at hand, namely the model-based partition and the partitions associated to the external variables. It is noteworthy that each mixture model is fitted by the maximum likelihood methodology to the data, excluding the external variables which are used to select a relevant mixture model only. Numerical experiments illustrate the promising behaviour of the derived criterion.info:eu-repo/semantics/submittedVersio
Adaptive Seeding for Gaussian Mixture Models
We present new initialization methods for the expectation-maximization
algorithm for multivariate Gaussian mixture models. Our methods are adaptions
of the well-known -means++ initialization and the Gonzalez algorithm.
Thereby we aim to close the gap between simple random, e.g. uniform, and
complex methods, that crucially depend on the right choice of hyperparameters.
Our extensive experiments indicate the usefulness of our methods compared to
common techniques and methods, which e.g. apply the original -means++ and
Gonzalez directly, with respect to artificial as well as real-world data sets.Comment: This is a preprint of a paper that has been accepted for publication
in the Proceedings of the 20th Pacific Asia Conference on Knowledge Discovery
and Data Mining (PAKDD) 2016. The final publication is available at
link.springer.com (http://link.springer.com/chapter/10.1007/978-3-319-31750-2
24
Segmental K-Means Learning with Mixture Distribution for HMM Based Handwriting Recognition
This paper investigates the performance of hidden Markov models (HMMs) for handwriting recognition. The Segmental K-Means algorithm is used for updating the transition and observation probabilities, instead of the Baum-Welch algorithm. Observation probabilities are modelled as multi-variate Gaussian mixture distributions. A deterministic clustering technique is used to estimate the initial parameters of an HMM. Bayesian information criterion (BIC) is used to select the topology of the model. The wavelet transform is used to extract features from a grey-scale image, and avoids binarization of the image.</p
Bayesian solutions to the label switching problem
The label switching problem, the unidentifiability of the permutation of clusters or more generally latent variables, makes interpretation of results computed with MCMC sampling difficult. We introduce a fully Bayesian treatment of the permutations which performs better than alternatives. The method can be used to compute summaries of the posterior samples even for nonparametric Bayesian methods, for which no good solutions exist so far. Although being approximative in this case, the results are very promising. The summaries are intuitively appealing: A summarized cluster is defined as a set of points for which the likelihood of being in the same cluster is maximized
Mixed mode data clustering: an approach based on tectrachoric correlations
In this paper we face the problem of clustering mixedmode data by assuming that the observed binary variables aregenerated from latent continuous variables. We perform a principalcomponents analysis on the matrix of tetrachoric correlations and wethen estimate the scores of each latent variable and construct adata matrix with continuous variables to be used in fully Guassianmixture models or in the k-means cluster analysis. The calculationof the expected a posteriori (EAP) estimates may proceed by simplyconsidering a limited number of quadrature points. Results on asimulation study and on a real data set are reported
Model-Based Clustering and Classification of Functional Data
The problem of complex data analysis is a central topic of modern statistical
science and learning systems and is becoming of broader interest with the
increasing prevalence of high-dimensional data. The challenge is to develop
statistical models and autonomous algorithms that are able to acquire knowledge
from raw data for exploratory analysis, which can be achieved through
clustering techniques or to make predictions of future data via classification
(i.e., discriminant analysis) techniques. Latent data models, including mixture
model-based approaches are one of the most popular and successful approaches in
both the unsupervised context (i.e., clustering) and the supervised one (i.e,
classification or discrimination). Although traditionally tools of multivariate
analysis, they are growing in popularity when considered in the framework of
functional data analysis (FDA). FDA is the data analysis paradigm in which the
individual data units are functions (e.g., curves, surfaces), rather than
simple vectors. In many areas of application, the analyzed data are indeed
often available in the form of discretized values of functions or curves (e.g.,
time series, waveforms) and surfaces (e.g., 2d-images, spatio-temporal data).
This functional aspect of the data adds additional difficulties compared to the
case of a classical multivariate (non-functional) data analysis. We review and
present approaches for model-based clustering and classification of functional
data. We derive well-established statistical models along with efficient
algorithmic tools to address problems regarding the clustering and the
classification of these high-dimensional data, including their heterogeneity,
missing information, and dynamical hidden structure. The presented models and
algorithms are illustrated on real-world functional data analysis problems from
several application area
- …