13 research outputs found

    Subquadratic High-Dimensional Hierarchical Clustering

    Get PDF
    International audienceWe consider the widely-used average-linkage, single-linkage, and Ward's methods for computing hierarchical clusterings of high-dimensional Euclidean inputs. It is easy to show that there is no efficient implementation of these algorithms in high dimensional Euclidean space since it implicitly requires to solve the closest pair problem, a notoriously difficult problem. However, how fast can these algorithms be implemented if we allow approxima-tion? More precisely: these algorithms successively merge the clusters that are at closest average (for average-linkage), minimum distance (for single-linkage), or inducing the least sum-of-square error (for Ward's). We ask whether one could obtain a significant running-time improvement if the algorithm can merge γ-approximate closest clusters (namely, clusters that are at distance (average, minimum , or sum-of-square error) at most γ times the distance of the closest clusters). We show that one can indeed take advantage of the relaxation and compute the approximate hierarchical clustering tree using r Opnq γ-approximate nearest neighbor queries. This leads to an algorithm running in time r Opndq`n 1`Op1{γq for d-dimensional Euclidean space. We then provide experiments showing that these algorithms perform as well as the non-approximate version for classic classification tasks while achieving a significant speed-up

    Objective-Based Hierarchical Clustering of Deep Embedding Vectors

    Full text link
    We initiate a comprehensive experimental study of objective-based hierarchical clustering methods on massive datasets consisting of deep embedding vectors from computer vision and NLP applications. This includes a large variety of image embedding (ImageNet, ImageNetV2, NaBirds), word embedding (Twitter, Wikipedia), and sentence embedding (SST-2) vectors from several popular recent models (e.g. ResNet, ResNext, Inception V3, SBERT). Our study includes datasets with up to 4.54.5 million entries with embedding dimensions up to 20482048. In order to address the challenge of scaling up hierarchical clustering to such large datasets we propose a new practical hierarchical clustering algorithm B++&C. It gives a 5%/20% improvement on average for the popular Moseley-Wang (MW) / Cohen-Addad et al. (CKMM) objectives (normalized) compared to a wide range of classic methods and recent heuristics. We also introduce a theoretical algorithm B2SAT&C which achieves a 0.740.74-approximation for the CKMM objective in polynomial time. This is the first substantial improvement over the trivial 2/32/3-approximation achieved by a random binary tree. Prior to this work, the best poly-time approximation of 2/3+0.0004\approx 2/3 + 0.0004 was due to Charikar et al. (SODA'19)
    corecore