64,564 research outputs found
Unsupervised cryo-EM data clustering through adaptively constrained K-means algorithm
In single-particle cryo-electron microscopy (cryo-EM), K-means clustering
algorithm is widely used in unsupervised 2D classification of projection images
of biological macromolecules. 3D ab initio reconstruction requires accurate
unsupervised classification in order to separate molecular projections of
distinct orientations. Due to background noise in single-particle images and
uncertainty of molecular orientations, traditional K-means clustering algorithm
may classify images into wrong classes and produce classes with a large
variation in membership. Overcoming these limitations requires further
development on clustering algorithms for cryo-EM data analysis. We propose a
novel unsupervised data clustering method building upon the traditional K-means
algorithm. By introducing an adaptive constraint term in the objective
function, our algorithm not only avoids a large variation in class sizes but
also produces more accurate data clustering. Applications of this approach to
both simulated and experimental cryo-EM data demonstrate that our algorithm is
a significantly improved alterative to the traditional K-means algorithm in
single-particle cryo-EM analysis.Comment: 35 pages, 14 figure
Space Exploration via Proximity Search
We investigate what computational tasks can be performed on a point set in
, if we are only given black-box access to it via nearest-neighbor
search. This is a reasonable assumption if the underlying point set is either
provided implicitly, or it is stored in a data structure that can answer such
queries. In particular, we show the following: (A) One can compute an
approximate bi-criteria -center clustering of the point set, and more
generally compute a greedy permutation of the point set. (B) One can decide if
a query point is (approximately) inside the convex-hull of the point set.
We also investigate the problem of clustering the given point set, such that
meaningful proximity queries can be carried out on the centers of the clusters,
instead of the whole point set
The Evolution of Beliefs over Signed Social Networks
We study the evolution of opinions (or beliefs) over a social network modeled
as a signed graph. The sign attached to an edge in this graph characterizes
whether the corresponding individuals or end nodes are friends (positive links)
or enemies (negative links). Pairs of nodes are randomly selected to interact
over time, and when two nodes interact, each of them updates its opinion based
on the opinion of the other node and the sign of the corresponding link. This
model generalizes DeGroot model to account for negative links: when two enemies
interact, their opinions go in opposite directions. We provide conditions for
convergence and divergence in expectation, in mean-square, and in almost sure
sense, and exhibit phase transition phenomena for these notions of convergence
depending on the parameters of the opinion update model and on the structure of
the underlying graph. We establish a {\it no-survivor} theorem, stating that
the difference in opinions of any two nodes diverges whenever opinions in the
network diverge as a whole. We also prove a {\it live-or-die} lemma, indicating
that almost surely, the opinions either converge to an agreement or diverge.
Finally, we extend our analysis to cases where opinions have hard lower and
upper limits. In these cases, we study when and how opinions may become
asymptotically clustered to the belief boundaries, and highlight the crucial
influence of (strong or weak) structural balance of the underlying network on
this clustering phenomenon
The Bane of Low-Dimensionality Clustering
In this paper, we give a conditional lower bound of on
running time for the classic k-median and k-means clustering objectives (where
n is the size of the input), even in low-dimensional Euclidean space of
dimension four, assuming the Exponential Time Hypothesis (ETH). We also
consider k-median (and k-means) with penalties where each point need not be
assigned to a center, in which case it must pay a penalty, and extend our lower
bound to at least three-dimensional Euclidean space.
This stands in stark contrast to many other geometric problems such as the
traveling salesman problem, or computing an independent set of unit spheres.
While these problems benefit from the so-called (limited) blessing of
dimensionality, as they can be solved in time or
in d dimensions, our work shows that widely-used clustering
objectives have a lower bound of , even in dimension four.
We complete the picture by considering the two-dimensional case: we show that
there is no algorithm that solves the penalized version in time less than
, and provide a matching upper bound of .
The main tool we use to establish these lower bounds is the placement of
points on the moment curve, which takes its inspiration from constructions of
point sets yielding Delaunay complexes of high complexity
Information-theoretic bounds and phase transitions in clustering, sparse PCA, and submatrix localization
We study the problem of detecting a structured, low-rank signal matrix
corrupted with additive Gaussian noise. This includes clustering in a Gaussian
mixture model, sparse PCA, and submatrix localization. Each of these problems
is conjectured to exhibit a sharp information-theoretic threshold, below which
the signal is too weak for any algorithm to detect. We derive upper and lower
bounds on these thresholds by applying the first and second moment methods to
the likelihood ratio between these "planted models" and null models where the
signal matrix is zero. Our bounds differ by at most a factor of root two when
the rank is large (in the clustering and submatrix localization problems, when
the number of clusters or blocks is large) or the signal matrix is very sparse.
Moreover, our upper bounds show that for each of these problems there is a
significant regime where reliable detection is information- theoretically
possible but where known algorithms such as PCA fail completely, since the
spectrum of the observed matrix is uninformative. This regime is analogous to
the conjectured 'hard but detectable' regime for community detection in sparse
graphs.Comment: For sparse PCA and submatrix localization, we determine the
information-theoretic threshold exactly in the limit where the number of
blocks is large or the signal matrix is very sparse based on a conditional
second moment method, closing the factor of root two gap in the first versio
- âŠ