Many fields are experiencing a Big Data explosion, with data collection rates
outpacing the rate of computing performance improvements predicted by Moore's
Law.
Researchers are often interested in similarity search on such data.
We present CAKES (CLAM-Accelerated K-NN Entropy Scaling Search), a novel
algorithm for k-nearest-neighbor (k-NN) search which leverages geometric
and topological properties inherent in large datasets.
CAKES assumes the manifold hypothesis and performs best when data occupy a
low dimensional manifold, even if the data occupy a very high dimensional
embedding space.
We demonstrate performance improvements ranging from hundreds to tens of
thousands of times faster when compared to state-of-the-art approaches such as
FAISS and HNSW, when benchmarked on 5 standard datasets.
Unlike locality-sensitive hashing approaches, CAKES can work with any
user-defined distance function.
When data occupy a metric space, CAKES exhibits perfect recall.Comment: As submitted to IEEE Big Data 202