Search CORE

12,048 research outputs found

Recommended from our members

Efficient clustering techniques for big data

Author: Al Ghamdi Sami
Publication venue
Publication date: 01/01/2018
Field of study

Clustering is an essential data mining technique that divides observations into groups where each group contains similar observations. K-Means is one of the most popular and widely used clustering algorithms that has been used for over fifty years. The majority of the running time in the original K-Means algorithm (known as Lloyd’s algorithm) is spent on computing distances from each data point to all cluster centres to find the closest centre to each data point. Due to the current exponential growth of the data, it became a necessity to improve KMeans even further to cope with large-scale datasets, known as Big Data. Hence, the main aim of this thesis is to improve the efficiency and scalability of Lloyd’s K-Means. One of the most efficient techniques to accelerate K-Means is to use triangle inequality. Implementing such efficient techniques on a reliable distributed model creates a powerful combination. This combination can lead to an efficient and highly scalable parallel version of K-Means that offers a practical solution to the problem of clustering Big Data. MapReduce, and its popular open-source implementation known as Hadoop, provides a distributed computing framework that efficiently stores, manages, and processes large-scale datasets over a large cluster of commodity machines. Many studies introduced a parallel implementation of Lloyd’s K-Means on Hadoop in order to improve the algorithm’s scalability. This research examines methods based on triangle inequality to achieve further improvements on the efficiency of the parallel Lloyd’s K-Means on Hadoop. Variants of K-Means that use triangle inequality usually require extra information, such as distance bounds and cluster assignments, from the previous iteration to work efficiently. This is a challenging task to achieve on Hadoop for two reasons: 1) Hadoop does not directly support iterative algorithms; and 2) Hadoop does not allow information to be exchanged between two consecutive iterations. Hence, two techniques are proposed to give Hadoop the ability to pass information from an iteration to the next. The first technique uses a data structure referred to as an Extended Vector (EV), that appends the extra information to the original data vector. The second technique stores the extra information on files where each file is referred to as a Bounds File (BF). To evaluate the two proposed techniques, two K-Means variants are implemented on Hadoop using the two techniques. Each variant is tested against variable number of clusters, dimensions, data points, and mappers. Furthermore, the performance of various implementations of K-Means on Hadoop and Spark is investigated. The results show a significant improvement on the efficiency of the new implementations compared to the Lloyd’s K-Means on Hadoop with real and artificial datasets

Central Archive at the University of Reading

Fast Color Quantization Using Weighted Sort-Means Clustering

Author: Balasubramanian
Bing
Chang
Cheng
Dekker
Deng
Deng
Drineas
Equitz
Forgy
Gentile
Heckbert
Hu
Hu
Huang
Joy
Kanjanawanishkul
Kanungo
Kasuga
Kolen
Kuo
Linde
Lloyd
M. Emre Celebi
Orchard
Ozdemir
Papamarkos
Schaefer
Scheunders
Sirisathitkul
Wan
Xiang
Xiang
Yang
Yang
Publication venue: 'The Optical Society'
Publication date: 01/01/2009
Field of study

Color quantization is an important operation with numerous applications in graphics and image processing. Most quantization methods are essentially based on data clustering algorithms. However, despite its popularity as a general purpose clustering algorithm, k-means has not received much respect in the color quantization literature because of its high computational requirements and sensitivity to initialization. In this paper, a fast color quantization method based on k-means is presented. The method involves several modifications to the conventional (batch) k-means algorithm including data reduction, sample weighting, and the use of triangle inequality to speed up the nearest neighbor search. Experiments on a diverse set of images demonstrate that, with the proposed modifications, k-means becomes very competitive with state-of-the-art color quantization methods in terms of both effectiveness and efficiency.Comment: 30 pages, 2 figures, 4 table

arXiv.org e-Print Archive

CiteSeerX

Crossref

Entropy-scaling search of massive biological data

Author: Berger Bonnie
Daniels Noah M.
Danko David Christian
Yu Y. William
Publication venue: 'Elsevier BV'
Publication date: 01/06/2015
Field of study

Many datasets exhibit a well-defined structure that can be exploited to design faster search tools, but it is not always clear when such acceleration is possible. Here, we introduce a framework for similarity search based on characterizing a dataset's entropy and fractal dimension. We prove that searching scales in time with metric entropy (number of covering hyperspheres), if the fractal dimension of the dataset is low, and scales in space with the sum of metric entropy and information-theoretic entropy (randomness of the data). Using these ideas, we present accelerated versions of standard tools, with no loss in specificity and little loss in sensitivity, for use in three domains---high-throughput drug screening (Ammolite, 150x speedup), metagenomics (MICA, 3.5x speedup of DIAMOND [3,700x BLASTX]), and protein structure search (esFragBag, 10x speedup of FragBag). Our framework can be used to achieve "compressive omics," and the general theory can be readily applied to data science problems outside of biology.Comment: Including supplement: 41 pages, 6 figures, 4 tables, 1 bo

arXiv.org e-Print Archive

Elsevier - Publisher Connector

DSpace@MIT

Crossref

PubMed Central

Faster K-Means Cluster Estimation

Author: A Likas
DT Pham
SP Lloyd
T Kanungo
Publication venue
Publication date: 17/01/2017
Field of study

There has been considerable work on improving popular clustering algorithm `K-means' in terms of mean squared error (MSE) and speed, both. However, most of the k-means variants tend to compute distance of each data point to each cluster centroid for every iteration. We propose a fast heuristic to overcome this bottleneck with only marginal increase in MSE. We observe that across all iterations of K-means, a data point changes its membership only among a small subset of clusters. Our heuristic predicts such clusters for each data point by looking at nearby clusters after the first iteration of k-means. We augment well known variants of k-means with our heuristic to demonstrate effectiveness of our heuristic. For various synthetic and real-world datasets, our heuristic achieves speed-up of up-to 3 times when compared to efficient variants of k-means.Comment: 6 pages, Accepted at ECIR 201

arXiv.org e-Print Archive

Crossref

Indexing Metric Spaces for Exact Similarity Search

Author: Chen Lu
Gao Yunjun
Jensen Christian S.
Li Zheng
Miao Xiaoye
Song Xuan
Zhu Yifan
Publication venue
Publication date: 07/05/2020
Field of study

With the continued digitalization of societal processes, we are seeing an explosion in available data. This is referred to as big data. In a research setting, three aspects of the data are often viewed as the main sources of challenges when attempting to enable value creation from big data: volume, velocity and variety. Many studies address volume or velocity, while much fewer studies concern the variety. Metric space is ideal for addressing variety because it can accommodate any type of data as long as its associated distance notion satisfies the triangle inequality. To accelerate search in metric space, a collection of indexing techniques for metric data have been proposed. However, existing surveys each offers only a narrow coverage, and no comprehensive empirical study of those techniques exists. We offer a survey of all the existing metric indexes that can support exact similarity search, by i) summarizing all the existing partitioning, pruning and validation techniques used for metric indexes, ii) providing the time and storage complexity analysis on the index construction, and iii) report on a comprehensive empirical comparison of their similarity query processing performance. Here, empirical comparisons are used to evaluate the index performance during search as it is hard to see the complexity analysis differences on the similarity query processing and the query performance depends on the pruning and validation abilities related to the data distribution. This article aims at revealing different strengths and weaknesses of different indexing techniques in order to offer guidance on selecting an appropriate indexing technique for a given setting, and directing the future research for metric indexes

arXiv.org e-Print Archive

VBN