13 research outputs found
An Alternative Prior Process for Nonparametric Bayesian Clustering
Prior distributions play a crucial role in Bayesian approaches to clustering.
Two commonly-used prior distributions are the Dirichlet and Pitman-Yor
processes. In this paper, we investigate the predictive probabilities that
underlie these processes, and the implicit "rich-get-richer" characteristic of
the resulting partitions. We explore an alternative prior for nonparametric
Bayesian clustering -- the uniform process -- for applications where the
"rich-get-richer" property is undesirable. We also explore the cost of this
process: partitions are no longer exchangeable with respect to the ordering of
variables. We present new asymptotic and simulation-based results for the
clustering characteristics of the uniform process and compare these with known
results for the Dirichlet and Pitman-Yor processes. We compare performance on a
real document clustering task, demonstrating the practical advantage of the
uniform process despite its lack of exchangeability over orderings
Nonparametric Hierarchical Clustering of Functional Data
In this paper, we deal with the problem of curves clustering. We propose a
nonparametric method which partitions the curves into clusters and discretizes
the dimensions of the curve points into intervals. The cross-product of these
partitions forms a data-grid which is obtained using a Bayesian model selection
approach while making no assumptions regarding the curves. Finally, a
post-processing technique, aiming at reducing the number of clusters in order
to improve the interpretability of the clustering, is proposed. It consists in
optimally merging the clusters step by step, which corresponds to an
agglomerative hierarchical classification whose dissimilarity measure is the
variation of the criterion. Interestingly this measure is none other than the
sum of the Kullback-Leibler divergences between clusters distributions before
and after the merges. The practical interest of the approach for functional
data exploratory analysis is presented and compared with an alternative
approach on an artificial and a real world data set
Entropy regularization in probabilistic clustering
Bayesian nonparametric mixture models are widely used to cluster observations. However, one major drawback of the approach is that the estimated partition often presents unbalanced clusters' frequencies with only a few dominating clusters and a large number of sparsely-populated ones. This feature translates into results that are often uninterpretable unless we accept to ignore a relevant number of observations and clusters. Interpreting the posterior distribution as penalized likelihood, we show how the unbalance can be explained as a direct consequence of the cost functions involved in estimating the partition. In light of our findings, we propose a novel Bayesian estimator of the clustering configuration. The proposed estimator is equivalent to a post-processing procedure that reduces the number of sparsely-populated clusters and enhances interpretability. The procedure takes the form of entropy-regularization of the Bayesian estimate. While being computationally convenient with respect to alternative strategies, it is also theoretically justified as a correction to the Bayesian loss function used for point estimation and, as such, can be applied to any posterior distribution of clusters, regardless of the specific model used
Probabilistic Size-constrained Microclustering
Abstract Microclustering refers to clustering models that produce small clusters or, equivalently, to models where the size of the clusters grows sublinearly with the number of samples. We formulate probabilistic microclustering models by assigning a prior distribution on the size of the clusters, and in particular consider microclustering models with explicit bounds on the size of the clusters. The combinatorial constraints make full Bayesian inference complicated, but we manage to develop a Gibbs sampling algorithm that can efficiently sample from the joint cluster allocation of all data points. We empirically demonstrate the computational efficiency of the algorithm for problem instances of varying difficulty
A Hybrid Model of Event Comprehension Predicts Human Activity at Human Scale
To act effectively, humans store event schemas and use them to predict the near future. How are schemas learned and represented in memory, and used in online comprehension? One means to answer these questions is modeling event comprehension. What are, then, computational principles of event comprehension? We proposed three candidate properties: 1) abstract representation of visual features, 2) predictive mechanism and prediction error as feedback, and 3) contextual cues to guide prediction, and adapted a computational model embodying these properties. The model learned to predict activity dynamics from one pass through an 18-hour corpus of naturalistic human activity. Evaluated on another 3.5 hours of activities, it updated at times corresponding with human segmentation and formed human-like event categories—despite being given no feedback about segmentation or categorization. These results establish that a computational model embodying the three proposed properties can naturally reproduce two important features of human event comprehension
Unsupervised morpheme segmentation in a non-parametric Bayesian framework
Learning morphemes from any plain text is an emerging research area in the natural language processing. Knowledge about the process of word formation is helpful in devising automatic segmentation of words into their constituent morphemes. This thesis applies unsupervised morpheme induction method, based on the statistical behavior of words, to induce morphemes for word segmentation. The morpheme cache for the purpose is based on the Dirichlet Process (DP) and stores frequency information of the induced morphemes and their occurrences in a Zipfian distribution.
This thesis uses a number of empirical, morpheme-level grammar models to classify the induced morphemes under the labels prefix, stem and suffix. These grammar models capture the different structural relationships among the morphemes. Furthermore, the morphemic categorization reduces the problems of over segmentation. The output of the strategy demonstrates a significant improvement on the baseline system.
Finally, the thesis measures the performance of the unsupervised morphology learning system for Nepali