5,077 research outputs found
Semi-supervised cross-entropy clustering with information bottleneck constraint
In this paper, we propose a semi-supervised clustering method, CEC-IB, that
models data with a set of Gaussian distributions and that retrieves clusters
based on a partial labeling provided by the user (partition-level side
information). By combining the ideas from cross-entropy clustering (CEC) with
those from the information bottleneck method (IB), our method trades between
three conflicting goals: the accuracy with which the data set is modeled, the
simplicity of the model, and the consistency of the clustering with side
information. Experiments demonstrate that CEC-IB has a performance comparable
to Gaussian mixture models (GMM) in a classical semi-supervised scenario, but
is faster, more robust to noisy labels, automatically determines the optimal
number of clusters, and performs well when not all classes are present in the
side information. Moreover, in contrast to other semi-supervised models, it can
be successfully applied in discovering natural subgroups if the partition-level
side information is derived from the top levels of a hierarchical clustering
Recommended from our members
Statistical clustering of data
textCluster analysis aims at segmenting objects into groups with similar members and, therefore helps to discover distribution of properties and correlations in large datasets. Data clustering has been widely studied as it arises in many domains in marketing, engineering, and social sciences. Especially, the occurrence of transactional and experimental datasets in large scale in recent years significantly increased the necessity of clustering techniques to reduce the size of the existing objects, to achieve a better knowledge of the data. This report introduced fundamental concepts related to cluster analysis, addressed the similarity and dissimilarity measurements for cluster definition, and clarified three major clustering algorithms-hierarchical clustering, K-means clustering and Gaussian mixture model fitted by Expectation-Maximization (EM) algorithm-theoretically and experimentally to illustrate the process of clustering. Finally, methods of determining the number of clusters and validating the clustering were presented as for clustering evaluation.Statistic
- …