8,824 research outputs found
Faster K-Means Cluster Estimation
There has been considerable work on improving popular clustering algorithm
`K-means' in terms of mean squared error (MSE) and speed, both. However, most
of the k-means variants tend to compute distance of each data point to each
cluster centroid for every iteration. We propose a fast heuristic to overcome
this bottleneck with only marginal increase in MSE. We observe that across all
iterations of K-means, a data point changes its membership only among a small
subset of clusters. Our heuristic predicts such clusters for each data point by
looking at nearby clusters after the first iteration of k-means. We augment
well known variants of k-means with our heuristic to demonstrate effectiveness
of our heuristic. For various synthetic and real-world datasets, our heuristic
achieves speed-up of up-to 3 times when compared to efficient variants of
k-means.Comment: 6 pages, Accepted at ECIR 201
BigFCM: Fast, Precise and Scalable FCM on Hadoop
Clustering plays an important role in mining big data both as a modeling
technique and a preprocessing step in many data mining process implementations.
Fuzzy clustering provides more flexibility than non-fuzzy methods by allowing
each data record to belong to more than one cluster to some degree. However, a
serious challenge in fuzzy clustering is the lack of scalability. Massive
datasets in emerging fields such as geosciences, biology and networking do
require parallel and distributed computations with high performance to solve
real-world problems. Although some clustering methods are already improved to
execute on big data platforms, but their execution time is highly increased for
large datasets. In this paper, a scalable Fuzzy C-Means (FCM) clustering named
BigFCM is proposed and designed for the Hadoop distributed data platform. Based
on the map-reduce programming model, it exploits several mechanisms including
an efficient caching design to achieve several orders of magnitude reduction in
execution time. Extensive evaluation over multi-gigabyte datasets shows that
BigFCM is scalable while it preserves the quality of clustering
Learning Representations in Model-Free Hierarchical Reinforcement Learning
Common approaches to Reinforcement Learning (RL) are seriously challenged by
large-scale applications involving huge state spaces and sparse delayed reward
feedback. Hierarchical Reinforcement Learning (HRL) methods attempt to address
this scalability issue by learning action selection policies at multiple levels
of temporal abstraction. Abstraction can be had by identifying a relatively
small set of states that are likely to be useful as subgoals, in concert with
the learning of corresponding skill policies to achieve those subgoals. Many
approaches to subgoal discovery in HRL depend on the analysis of a model of the
environment, but the need to learn such a model introduces its own problems of
scale. Once subgoals are identified, skills may be learned through intrinsic
motivation, introducing an internal reward signal marking subgoal attainment.
In this paper, we present a novel model-free method for subgoal discovery using
incremental unsupervised learning over a small memory of the most recent
experiences (trajectories) of the agent. When combined with an intrinsic
motivation learning mechanism, this method learns both subgoals and skills,
based on experiences in the environment. Thus, we offer an original approach to
HRL that does not require the acquisition of a model of the environment,
suitable for large-scale applications. We demonstrate the efficiency of our
method on two RL problems with sparse delayed feedback: a variant of the rooms
environment and the first screen of the ATARI 2600 Montezuma's Revenge game
Unsupervised cryo-EM data clustering through adaptively constrained K-means algorithm
In single-particle cryo-electron microscopy (cryo-EM), K-means clustering
algorithm is widely used in unsupervised 2D classification of projection images
of biological macromolecules. 3D ab initio reconstruction requires accurate
unsupervised classification in order to separate molecular projections of
distinct orientations. Due to background noise in single-particle images and
uncertainty of molecular orientations, traditional K-means clustering algorithm
may classify images into wrong classes and produce classes with a large
variation in membership. Overcoming these limitations requires further
development on clustering algorithms for cryo-EM data analysis. We propose a
novel unsupervised data clustering method building upon the traditional K-means
algorithm. By introducing an adaptive constraint term in the objective
function, our algorithm not only avoids a large variation in class sizes but
also produces more accurate data clustering. Applications of this approach to
both simulated and experimental cryo-EM data demonstrate that our algorithm is
a significantly improved alterative to the traditional K-means algorithm in
single-particle cryo-EM analysis.Comment: 35 pages, 14 figure
- …