In cluster analysis, it can be useful to interpret the partition built from
the data in the light of external categorical variables which were not directly
involved to cluster the data. An approach is proposed in the model-based
clustering context to select a model and a number of clusters which both fit
the data well and take advantage of the potential illustrative ability of the
external variables. This approach makes use of the integrated joint likelihood
of the data and the partitions at hand, namely the model-based partition and
the partitions associated to the external variables. It is noteworthy that each
mixture model is fitted by the maximum likelihood methodology to the data,
excluding the external variables which are used to select a relevant mixture
model only. Numerical experiments illustrate the promising behaviour of the
derived criterion

Amorim, Maria José

Baudry, Jean-Patrick

Cardoso, Margarida

Celeux, Gilles

Ferreira, Ana Sousa

English

arXiv

In cluster analysis, it is often useful to interpret the obtained partition with respect to external qualitative variables (defining known partitions) derived from alternative information. An approach is proposed in the model-based clustering context to select a model and a number of clusters in order to get a partition which both provides a good fit with the data and is related to the external variables. This approach makes use of the integrated joint likelihood of the data, the partition derived from the mixture model and the known partitions. It is worth noticing that the external qualitative variables are only used to select a relevant mixture model. Each mixture model is fitted by the maximum likelihood methodology from the observed data. Numerical experiments illustrate the promising behaviour of the derived criterion.En classification non supervisée, il est souvent utile d'interpréter la classification à l'aide de variables qualitatives externes qui définissent elles-mêmes des partitions. Nous proposons une approche fondée sur le modèle de mélange de lois de probabilité permettant de sélectionner un modèle et le nombre de classes produisant à la fois un bon ajustement des données et possédant une liaison forte avec les variables qualitatives externes. Cette approche se fonde sur un critère approximant la vraisemblance intégrée des données complétées par les étiquettes de la partition cherchée et par celles des partitions associées aux variables externes. Il est important de souligner que les variables externes sont seulement utilisées pour sélectionner un modèle de mélange estimé par la méthode du maximum de vraisemblance. Des illustrations numériques montrent le comportement prometteur du critère proposé

Amorim, Maria-José

Sousa Ferreira, Ana

INRIA a CCSD electronic archive server

Enhancing the selection of a model-based clustering with external qualitative variables

In cluster analysis, it could be useful to interpret the obtained partition with respect to external qualitative variables. An approach is proposed in the model-based clustering context to select a model and a number of clusters in order to get a partition which both provides a good fit with the data and is well related to the external variables. This approach makes use of the integrated joint likelihood of the data and the partitions at hand, namely the model-based partition and the partitions associated to the external variables. It is worth noticing that the known partitions are only used to select a relevant mixture model. Each mixture model is fitted by the maximum likelihood methodology from the data. Numerical experiments illustrate the promising behaviour of the derived criterion

Hal-Diderot

Enhancing the selection of a model-based clustering with external
  qualitative variables

Enhancing the selection of a model-based clustering with external qualitative variables

Abstract

Similar works

Full text

Available Versions

INRIA a CCSD electronic archive server

INRIA a CCSD electronic archive server

Hal-Diderot