140 research outputs found
Stochastic Data Clustering
In 1961 Herbert Simon and Albert Ando published the theory behind the
long-term behavior of a dynamical system that can be described by a nearly
uncoupled matrix. Over the past fifty years this theory has been used in a
variety of contexts, including queueing theory, brain organization, and
ecology. In all these applications, the structure of the system is known and
the point of interest is the various stages the system passes through on its
way to some long-term equilibrium.
This paper looks at this problem from the other direction. That is, we
develop a technique for using the evolution of the system to tell us about its
initial structure, and we use this technique to develop a new algorithm for
data clustering.Comment: 23 page
Multi-View Multiple Clusterings using Deep Matrix Factorization
Multi-view clustering aims at integrating complementary information from
multiple heterogeneous views to improve clustering results. Existing multi-view
clustering solutions can only output a single clustering of the data. Due to
their multiplicity, multi-view data, can have different groupings that are
reasonable and interesting from different perspectives. However, how to find
multiple, meaningful, and diverse clustering results from multi-view data is
still a rarely studied and challenging topic in multi-view clustering and
multiple clusterings. In this paper, we introduce a deep matrix factorization
based solution (DMClusts) to discover multiple clusterings. DMClusts gradually
factorizes multi-view data matrices into representational subspaces
layer-by-layer and generates one clustering in each layer. To enforce the
diversity between generated clusterings, it minimizes a new redundancy
quantification term derived from the proximity between samples in these
subspaces. We further introduce an iterative optimization procedure to
simultaneously seek multiple clusterings with quality and diversity.
Experimental results on benchmark datasets confirm that DMClusts outperforms
state-of-the-art multiple clustering solutions
Clustering Patients with Tensor Decomposition
In this paper we present a method for the unsupervised clustering of
high-dimensional binary data, with a special focus on electronic healthcare
records. We present a robust and efficient heuristic to face this problem using
tensor decomposition. We present the reasons why this approach is preferable
for tasks such as clustering patient records, to more commonly used
distance-based methods. We run the algorithm on two datasets of healthcare
records, obtaining clinically meaningful results.Comment: Presented at 2017 Machine Learning for Healthcare Conference (MLHC
2017). Boston, M
Crowdclustering
Is it possible to crowdsource categorization? Amongst the challenges: (a) each worker has only a partial view of data, (b) different workers may have different clustering criteria and may produce different numbers of categories, (c) the underlying category structure may be hierarchical. We propose a Bayesian model of how workers may approach clustering and show how one may infer clusters/categories, as well as worker parameters, using this model. Our experiments, carried out on large collections of images, suggest that Bayesian crowdclustering works well and may be superior to single-expert annotations
Unsupervised Algorithms for Microarray Sample Stratification
The amount of data made available by microarrays gives researchers the opportunity to delve into the complexity of biological systems. However, the noisy and extremely high-dimensional nature of this kind of data poses significant challenges. Microarrays allow for the parallel measurement of thousands of molecular objects spanning different layers of interactions. In order to be able to discover hidden patterns, the most disparate analytical techniques have been proposed. Here, we describe the basic methodologies to approach the analysis of microarray datasets that focus on the task of (sub)group discovery.Peer reviewe
- …