1 research outputs found
Contraction Clustering (RASTER): A Very Fast Big Data Algorithm for Sequential and Parallel Density-Based Clustering in Linear Time, Constant Memory, and a Single Pass
Clustering is an essential data mining tool for analyzing and grouping
similar objects. In big data applications, however, many clustering algorithms
are infeasible due to their high memory requirements and/or unfavorable runtime
complexity. In contrast, Contraction Clustering (RASTER) is a single-pass
algorithm for identifying density-based clusters with linear time complexity.
Due to its favorable runtime and the fact that its memory requirements are
constant, this algorithm is highly suitable for big data applications where the
amount of data to be processed is huge. It consists of two steps: (1) a
contraction step which projects objects onto tiles and (2) an agglomeration
step which groups tiles into clusters. This algorithm is extremely fast in both
sequential and parallel execution. Our quantitative evaluation shows that a
sequential implementation of RASTER performs significantly better than various
standard clustering algorithms. Furthermore, the parallel speedup is
significant: on a contemporary workstation, an implementation in Rust processes
a batch of 500 million points with 1 million clusters in less than 50 seconds
on one core. With 8 cores, the algorithm is about four times faster.Comment: 19 pages; journal paper extending a previous conference publication
(cf. https://doi.org/10.1007/978-3-319-72926-8_6