14 research outputs found
Binary Adaptive Embeddings from Order Statistics of Random Projections
We use some of the largest order statistics of the random projections of a
reference signal to construct a binary embedding that is adapted to signals
correlated with such signal. The embedding is characterized from the analytical
standpoint and shown to provide improved performance on tasks such as
classification in a reduced-dimensionality space
FLASH: Randomized Algorithms Accelerated over CPU-GPU for Ultra-High Dimensional Similarity Search
We present FLASH (\textbf{F}ast \textbf{L}SH \textbf{A}lgorithm for
\textbf{S}imilarity search accelerated with \textbf{H}PC), a similarity search
system for ultra-high dimensional datasets on a single machine, that does not
require similarity computations and is tailored for high-performance computing
platforms. By leveraging a LSH style randomized indexing procedure and
combining it with several principled techniques, such as reservoir sampling,
recent advances in one-pass minwise hashing, and count based estimations, we
reduce the computational and parallelization costs of similarity search, while
retaining sound theoretical guarantees.
We evaluate FLASH on several real, high-dimensional datasets from different
domains, including text, malicious URL, click-through prediction, social
networks, etc. Our experiments shed new light on the difficulties associated
with datasets having several million dimensions. Current state-of-the-art
implementations either fail on the presented scale or are orders of magnitude
slower than FLASH. FLASH is capable of computing an approximate k-NN graph,
from scratch, over the full webspam dataset (1.3 billion nonzeros) in less than
10 seconds. Computing a full k-NN graph in less than 10 seconds on the webspam
dataset, using brute-force (), will require at least 20 teraflops. We
provide CPU and GPU implementations of FLASH for replicability of our results
A Note on "Efficient Task-Specific Data Valuation for Nearest Neighbor Algorithms"
Data valuation is a growing research field that studies the influence of
individual data points for machine learning (ML) models. Data Shapley, inspired
by cooperative game theory and economics, is an effective method for data
valuation. However, it is well-known that the Shapley value (SV) can be
computationally expensive. Fortunately, Jia et al. (2019) showed that for
K-Nearest Neighbors (KNN) models, the computation of Data Shapley is
surprisingly simple and efficient.
In this note, we revisit the work of Jia et al. (2019) and propose a more
natural and interpretable utility function that better reflects the performance
of KNN models. We derive the corresponding calculation procedure for the Data
Shapley of KNN classifiers/regressors with the new utility functions. Our new
approach, dubbed soft-label KNN-SV, achieves the same time complexity as the
original method. We further provide an efficient approximation algorithm for
soft-label KNN-SV based on locality sensitive hashing (LSH). Our experimental
results demonstrate that Soft-label KNN-SV outperforms the original method on
most datasets in the task of mislabeled data detection, making it a better
baseline for future work on data valuation
The role of local dimensionality measures in benchmarking nearest neighbor search
This paper reconsiders common benchmarking approaches to nearest neighbor search. It is shown that the concepts of local intrinsic dimensionality (LID), local relative contrast (RC), and query expansion allow to choose query sets of a wide range of difficulty for real-world datasets. Moreover, the effect of the distribution of these dimensionality measures on the running time performance of implementations is empirically studied. To this end, different visualization concepts are introduced that allow to get a more fine-grained overview of the inner workings of nearest neighbor search principles. Interactive visualizations are available on the companion website.1 The paper closes with remarks about the diversity of datasets commonly used for nearest neighbor search benchmarking. It is shown that such real-world datasets are not diverse: results on a single dataset predict results on all other datasets well
Hashing for Similarity Search: A Survey
Similarity search (nearest neighbor search) is a problem of pursuing the data
items whose distances to a query item are the smallest from a large database.
Various methods have been developed to address this problem, and recently a lot
of efforts have been devoted to approximate search. In this paper, we present a
survey on one of the main solutions, hashing, which has been widely studied
since the pioneering work locality sensitive hashing. We divide the hashing
algorithms two main categories: locality sensitive hashing, which designs hash
functions without exploring the data distribution and learning to hash, which
learns hash functions according the data distribution, and review them from
various aspects, including hash function design and distance measure and search
scheme in the hash coding space
AdANNS: A Framework for Adaptive Semantic Search
Web-scale search systems learn an encoder to embed a given query which is
then hooked into an approximate nearest neighbor search (ANNS) pipeline to
retrieve similar data points. To accurately capture tail queries and data
points, learned representations typically are rigid, high-dimensional vectors
that are generally used as-is in the entire ANNS pipeline and can lead to
computationally expensive retrieval. In this paper, we argue that instead of
rigid representations, different stages of ANNS can leverage adaptive
representations of varying capacities to achieve significantly better
accuracy-compute trade-offs, i.e., stages of ANNS that can get away with more
approximate computation should use a lower-capacity representation of the same
data point. To this end, we introduce AdANNS, a novel ANNS design framework
that explicitly leverages the flexibility of Matryoshka Representations. We
demonstrate state-of-the-art accuracy-compute trade-offs using novel
AdANNS-based key ANNS building blocks like search data structures (AdANNS-IVF)
and quantization (AdANNS-OPQ). For example on ImageNet retrieval, AdANNS-IVF is
up to 1.5% more accurate than the rigid representations-based IVF at the same
compute budget; and matches accuracy while being up to 90x faster in wall-clock
time. For Natural Questions, 32-byte AdANNS-OPQ matches the accuracy of the
64-byte OPQ baseline constructed using rigid representations -- same accuracy
at half the cost! We further show that the gains from AdANNS translate to
modern-day composite ANNS indices that combine search structures and
quantization. Finally, we demonstrate that AdANNS can enable inference-time
adaptivity for compute-aware search on ANNS indices built non-adaptively on
matryoshka representations. Code is open-sourced at
https://github.com/RAIVNLab/AdANNS.Comment: 25 pages, 15 figures. NeurIPS 2023 camera ready publicatio