2 research outputs found
FINEX: A Fast Index for Exact & Flexible Density-Based Clustering (Extended Version with Proofs)*
Density-based clustering aims to find groups of similar objects (i.e.,
clusters) in a given dataset. Applications include, e.g., process mining and
anomaly detection. It comes with two user parameters ({\epsilon}, MinPts) that
determine the clustering result, but are typically unknown in advance. Thus,
users need to interactively test various settings until satisfying clusterings
are found. However, existing solutions suffer from the following limitations:
(a) Ineffective pruning of expensive neighborhood computations. (b) Approximate
clustering, where objects are falsely labeled noise. (c) Restricted parameter
tuning that is limited to {\epsilon} whereas MinPts is constant, which reduces
the explorable clusterings. (d) Inflexibility in terms of applicable data types
and distance functions. We propose FINEX, a linear-space index that overcomes
these limitations. Our index provides exact clusterings and can be queried with
either of the two parameters. FINEX avoids neighborhood computations where
possible and reduces the complexities of the remaining computations by
leveraging fundamental properties of density-based clusters. Hence, our
solution is effcient and flexible regarding data types and distance functions.
Moreover, FINEX respects the original and straightforward notion of
density-based clustering. In our experiments on 12 large real-world datasets
from various domains, FINEX frequently outperforms state-of-the-art techniques
for exact clustering by orders of magnitude
Fast Euclidean OPTICS with bounded precision in low dimensional space
OPTICS is a popular method for visualizing multidimensional clusters. All the existing implementations of this method have a time complexity ofO(n2)-where n is the size of the input dataset-and thus, may not be suitable for datasets of large volumes. This paper alleviates the problem by resorting to approximation with guarantees. The main result is a new algorithm that runs in O(n logn) time under anyfixed dimensionality, and computes a visualization that has provably small discrepancies from that of OPTICS. As a side product, our algorithm gives an index structure that occupies linear space, and supports the cluster group-by query with nearoptimal cost. The quality of the cluster visualizations produced by our techniques and the efficiency of the proposed algorithms are demonstrated with an empirical evaluation on real data