29,447 research outputs found
Faster k-Medoids Clustering: Improving the PAM, CLARA, and CLARANS Algorithms
Clustering non-Euclidean data is difficult, and one of the most used
algorithms besides hierarchical clustering is the popular algorithm
Partitioning Around Medoids (PAM), also simply referred to as k-medoids. In
Euclidean geometry the mean-as used in k-means-is a good estimator for the
cluster center, but this does not hold for arbitrary dissimilarities. PAM uses
the medoid instead, the object with the smallest dissimilarity to all others in
the cluster. This notion of centrality can be used with any (dis-)similarity,
and thus is of high relevance to many domains such as biology that require the
use of Jaccard, Gower, or more complex distances.
A key issue with PAM is its high run time cost. We propose modifications to
the PAM algorithm to achieve an O(k)-fold speedup in the second SWAP phase of
the algorithm, but will still find the same results as the original PAM
algorithm. If we slightly relax the choice of swaps performed (at comparable
quality), we can further accelerate the algorithm by performing up to k swaps
in each iteration. With the substantially faster SWAP, we can now also explore
alternative strategies for choosing the initial medoids. We also show how the
CLARA and CLARANS algorithms benefit from these modifications. It can easily be
combined with earlier approaches to use PAM and CLARA on big data (some of
which use PAM as a subroutine, hence can immediately benefit from these
improvements), where the performance with high k becomes increasingly
important.
In experiments on real data with k=100, we observed a 200-fold speedup
compared to the original PAM SWAP algorithm, making PAM applicable to larger
data sets as long as we can afford to compute a distance matrix, and in
particular to higher k (at k=2, the new SWAP was only 1.5 times faster, as the
speedup is expected to increase with k)
BigFCM: Fast, Precise and Scalable FCM on Hadoop
Clustering plays an important role in mining big data both as a modeling
technique and a preprocessing step in many data mining process implementations.
Fuzzy clustering provides more flexibility than non-fuzzy methods by allowing
each data record to belong to more than one cluster to some degree. However, a
serious challenge in fuzzy clustering is the lack of scalability. Massive
datasets in emerging fields such as geosciences, biology and networking do
require parallel and distributed computations with high performance to solve
real-world problems. Although some clustering methods are already improved to
execute on big data platforms, but their execution time is highly increased for
large datasets. In this paper, a scalable Fuzzy C-Means (FCM) clustering named
BigFCM is proposed and designed for the Hadoop distributed data platform. Based
on the map-reduce programming model, it exploits several mechanisms including
an efficient caching design to achieve several orders of magnitude reduction in
execution time. Extensive evaluation over multi-gigabyte datasets shows that
BigFCM is scalable while it preserves the quality of clustering
Asynchronous Training of Word Embeddings for Large Text Corpora
Word embeddings are a powerful approach for analyzing language and have been
widely popular in numerous tasks in information retrieval and text mining.
Training embeddings over huge corpora is computationally expensive because the
input is typically sequentially processed and parameters are synchronously
updated. Distributed architectures for asynchronous training that have been
proposed either focus on scaling vocabulary sizes and dimensionality or suffer
from expensive synchronization latencies.
In this paper, we propose a scalable approach to train word embeddings by
partitioning the input space instead in order to scale to massive text corpora
while not sacrificing the performance of the embeddings. Our training procedure
does not involve any parameter synchronization except a final sub-model merge
phase that typically executes in a few minutes. Our distributed training scales
seamlessly to large corpus sizes and we get comparable and sometimes even up to
45% performance improvement in a variety of NLP benchmarks using models trained
by our distributed procedure which requires of the time taken by the
baseline approach. Finally we also show that we are robust to missing words in
sub-models and are able to effectively reconstruct word representations.Comment: This paper contains 9 pages and has been accepted in the WSDM201
- …