Clustering is a fundamental task in data science with wide-ranging
applications. In k-medoids clustering, cluster centers must be actual
datapoints and arbitrary distance metrics may be used; these features allow for
greater interpretability of the cluster centers and the clustering of exotic
objects in k-medoids clustering, respectively. k-medoids clustering has
recently grown in popularity due to the discovery of more efficient k-medoids
algorithms. In particular, recent research has proposed BanditPAM, a randomized
k-medoids algorithm with state-of-the-art complexity and clustering accuracy.
In this paper, we present BanditPAM++, which accelerates BanditPAM via two
algorithmic improvements, and is O(k) faster than BanditPAM in complexity and
substantially faster than BanditPAM in wall-clock runtime. First, we
demonstrate that BanditPAM has a special structure that allows the reuse of
clustering information within each iteration. Second, we demonstrate
that BanditPAM has additional structure that permits the reuse of information
across different iterations. These observations inspire our proposed
algorithm, BanditPAM++, which returns the same clustering solutions as
BanditPAM but often several times faster. For example, on the CIFAR10 dataset,
BanditPAM++ returns the same results as BanditPAM but runs over 10×
faster. Finally, we provide a high-performance C++ implementation of
BanditPAM++, callable from Python and R, that may be of interest to
practitioners at https://github.com/motiwari/BanditPAM. Auxiliary code to
reproduce all of our experiments via a one-line script is available at
https://github.com/ThrunGroup/BanditPAM_plusplus_experiments.Comment: NeurIPS 202