214,501 research outputs found
Unsupervised cryo-EM data clustering through adaptively constrained K-means algorithm
In single-particle cryo-electron microscopy (cryo-EM), K-means clustering
algorithm is widely used in unsupervised 2D classification of projection images
of biological macromolecules. 3D ab initio reconstruction requires accurate
unsupervised classification in order to separate molecular projections of
distinct orientations. Due to background noise in single-particle images and
uncertainty of molecular orientations, traditional K-means clustering algorithm
may classify images into wrong classes and produce classes with a large
variation in membership. Overcoming these limitations requires further
development on clustering algorithms for cryo-EM data analysis. We propose a
novel unsupervised data clustering method building upon the traditional K-means
algorithm. By introducing an adaptive constraint term in the objective
function, our algorithm not only avoids a large variation in class sizes but
also produces more accurate data clustering. Applications of this approach to
both simulated and experimental cryo-EM data demonstrate that our algorithm is
a significantly improved alterative to the traditional K-means algorithm in
single-particle cryo-EM analysis.Comment: 35 pages, 14 figure
Scalable k-Means Clustering via Lightweight Coresets
Coresets are compact representations of data sets such that models trained on
a coreset are provably competitive with models trained on the full data set. As
such, they have been successfully used to scale up clustering models to massive
data sets. While existing approaches generally only allow for multiplicative
approximation errors, we propose a novel notion of lightweight coresets that
allows for both multiplicative and additive errors. We provide a single
algorithm to construct lightweight coresets for k-means clustering as well as
soft and hard Bregman clustering. The algorithm is substantially faster than
existing constructions, embarrassingly parallel, and the resulting coresets are
smaller. We further show that the proposed approach naturally generalizes to
statistical k-means clustering and that, compared to existing results, it can
be used to compute smaller summaries for empirical risk minimization. In
extensive experiments, we demonstrate that the proposed algorithm outperforms
existing data summarization strategies in practice.Comment: To appear in the 24th ACM SIGKDD International Conference on
Knowledge Discovery & Data Mining (KDD
Dynamic Clustering via Asymptotics of the Dependent Dirichlet Process Mixture
This paper presents a novel algorithm, based upon the dependent Dirichlet
process mixture model (DDPMM), for clustering batch-sequential data containing
an unknown number of evolving clusters. The algorithm is derived via a
low-variance asymptotic analysis of the Gibbs sampling algorithm for the DDPMM,
and provides a hard clustering with convergence guarantees similar to those of
the k-means algorithm. Empirical results from a synthetic test with moving
Gaussian clusters and a test with real ADS-B aircraft trajectory data
demonstrate that the algorithm requires orders of magnitude less computational
time than contemporary probabilistic and hard clustering algorithms, while
providing higher accuracy on the examined datasets.Comment: This paper is from NIPS 2013. Please use the following BibTeX
citation: @inproceedings{Campbell13_NIPS, Author = {Trevor Campbell and Miao
Liu and Brian Kulis and Jonathan P. How and Lawrence Carin}, Title = {Dynamic
Clustering via Asymptotics of the Dependent Dirichlet Process}, Booktitle =
{Advances in Neural Information Processing Systems (NIPS)}, Year = {2013}
A Convex Formulation for Spectral Shrunk Clustering
Spectral clustering is a fundamental technique in the field of data mining
and information processing. Most existing spectral clustering algorithms
integrate dimensionality reduction into the clustering process assisted by
manifold learning in the original space. However, the manifold in
reduced-dimensional subspace is likely to exhibit altered properties in
contrast with the original space. Thus, applying manifold information obtained
from the original space to the clustering process in a low-dimensional subspace
is prone to inferior performance. Aiming to address this issue, we propose a
novel convex algorithm that mines the manifold structure in the low-dimensional
subspace. In addition, our unified learning process makes the manifold learning
particularly tailored for the clustering. Compared with other related methods,
the proposed algorithm results in more structured clustering result. To
validate the efficacy of the proposed algorithm, we perform extensive
experiments on several benchmark datasets in comparison with some
state-of-the-art clustering approaches. The experimental results demonstrate
that the proposed algorithm has quite promising clustering performance.Comment: AAAI201
Reducing the Time Requirement of k-Means Algorithm
Traditional k-means and most k-means variants are still computationally expensive for large datasets, such as microarray
data, which have large datasets with large dimension size d. In k-means clustering, we are given a set of n data points in ddimensional
space Rd and an integer k. The problem is to determine a set of k points in Rd, called centers, so as to minimize
the mean squared distance from each data point to its nearest center. In this work, we develop a novel k-means algorithm,
which is simple but more efficient than the traditional k-means and the recent enhanced k-means. Our new algorithm is
based on the recently established relationship between principal component analysis and the k-means clustering. We
provided the correctness proof for this algorithm. Results obtained from testing the algorithm on three biological data and
six non-biological data (three of these data are real, while the other three are simulated) also indicate that our algorithm is
empirically faster than other known k-means algorithms. We assessed the quality of our algorithm clusters against the
clusters of a known structure using the Hubert-Arabie Adjusted Rand index (ARIHA). We found that when k is close to d, the
quality is good (ARIHA.0.8) and when k is not close to d, the quality of our new k-means algorithm is excellent (ARIHA.0.9).
In this paper, emphases are on the reduction of the time requirement of the k-means algorithm and its application to
microarray data due to the desire to create a tool for clustering and malaria research. However, the new clustering algorithm
can be used for other clustering needs as long as an appropriate measure of distance between the centroids and the
members is used. This has been demonstrated in this work on six non-biological data
Parallel Hierarchical Affinity Propagation with MapReduce
The accelerated evolution and explosion of the Internet and social media is
generating voluminous quantities of data (on zettabyte scales). Paramount
amongst the desires to manipulate and extract actionable intelligence from vast
big data volumes is the need for scalable, performance-conscious analytics
algorithms. To directly address this need, we propose a novel MapReduce
implementation of the exemplar-based clustering algorithm known as Affinity
Propagation. Our parallelization strategy extends to the multilevel
Hierarchical Affinity Propagation algorithm and enables tiered aggregation of
unstructured data with minimal free parameters, in principle requiring only a
similarity measure between data points. We detail the linear run-time
complexity of our approach, overcoming the limiting quadratic complexity of the
original algorithm. Experimental validation of our clustering methodology on a
variety of synthetic and real data sets (e.g. images and point data)
demonstrates our competitiveness against other state-of-the-art MapReduce
clustering techniques
- …