8,256 research outputs found
Bayesian Cluster Enumeration Criterion for Unsupervised Learning
We derive a new Bayesian Information Criterion (BIC) by formulating the
problem of estimating the number of clusters in an observed data set as
maximization of the posterior probability of the candidate models. Given that
some mild assumptions are satisfied, we provide a general BIC expression for a
broad class of data distributions. This serves as a starting point when
deriving the BIC for specific distributions. Along this line, we provide a
closed-form BIC expression for multivariate Gaussian distributed variables. We
show that incorporating the data structure of the clustering problem into the
derivation of the BIC results in an expression whose penalty term is different
from that of the original BIC. We propose a two-step cluster enumeration
algorithm. First, a model-based unsupervised learning algorithm partitions the
data according to a given set of candidate models. Subsequently, the number of
clusters is determined as the one associated with the model for which the
proposed BIC is maximal. The performance of the proposed two-step algorithm is
tested using synthetic and real data sets.Comment: 14 pages, 7 figure
- …