17 research outputs found
Faster Cover Trees
Abstract The cover tree data structure speeds up exact nearest neighbor queries over arbitrary metric spaces On standard benchmark datasets, we reduce the number of distance computations by 10-50%. On a large-scale bioinformatics dataset, we reduce the number of distance computations by 71%. On a large-scale image dataset, our parallel algorithm with 16 cores reduces tree construction time from 3.5 hours to 12 minutes
CAPS: A Practical Partition Index for Filtered Similarity Search
With the surging popularity of approximate near-neighbor search (ANNS),
driven by advances in neural representation learning, the ability to serve
queries accompanied by a set of constraints has become an area of intense
interest. While the community has recently proposed several algorithms for
constrained ANNS, almost all of these methods focus on integration with
graph-based indexes, the predominant class of algorithms achieving
state-of-the-art performance in latency-recall tradeoffs. In this work, we take
a different approach and focus on developing a constrained ANNS algorithm via
space partitioning as opposed to graphs. To that end, we introduce Constrained
Approximate Partitioned Search (CAPS), an index for ANNS with filters via space
partitions that not only retains the benefits of a partition-based algorithm
but also outperforms state-of-the-art graph-based constrained search techniques
in recall-latency tradeoffs, with only 10% of the index size.Comment: 14 page
CoverBLIP: accelerated and scalable iterative matched-filtering for Magnetic Resonance Fingerprint reconstruction
Current popular methods for Magnetic Resonance Fingerprint (MRF) recovery are
bottlenecked by the heavy computations of a matched-filtering step due to the
growing size and complexity of the fingerprint dictionaries in multi-parametric
quantitative MRI applications. We address this shortcoming by arranging
dictionary atoms in the form of cover tree structures and adopt the
corresponding fast approximate nearest neighbour searches to accelerate
matched-filtering. For datasets belonging to smooth low-dimensional manifolds
cover trees offer search complexities logarithmic in terms of data population.
With this motivation we propose an iterative reconstruction algorithm, named
CoverBLIP, to address large-size MRF problems where the fingerprint dictionary
i.e. discrete manifold of Bloch responses, encodes several intrinsic NMR
parameters. We study different forms of convergence for this algorithm and we
show that provided with a notion of embedding, the inexact and non-convex
iterations of CoverBLIP linearly convergence toward a near-global solution with
the same order of accuracy as using exact brute-force searches. Our further
examinations on both synthetic and real-world datasets and using different
sampling strategies, indicates between 2 to 3 orders of magnitude reduction in
total search computations. Cover trees are robust against the
curse-of-dimensionality and therefore CoverBLIP provides a notion of
scalability -- a consistent gain in time-accuracy performance-- for searching
high-dimensional atoms which may not be easily preprocessed (i.e. for
dimensionality reduction) due to the increasing degrees of non-linearities
appearing in the emerging multi-parametric MRF dictionaries
Efficient Joinable Table Discovery in Data Lakes: A High-Dimensional Similarity-Based Approach
Finding joinable tables in data lakes is key procedure in many applications
such as data integration, data augmentation, data analysis, and data market.
Traditional approaches that find equi-joinable tables are unable to deal with
misspellings and different formats, nor do they capture any semantic joins. In
this paper, we propose PEXESO, a framework for joinable table discovery in data
lakes. We embed textual values as high-dimensional vectors and join columns
under similarity predicates on high-dimensional vectors, hence to address the
limitations of equi-join approaches and identify more meaningful results. To
efficiently find joinable tables with similarity, we propose a block-and-verify
method that utilizes pivot-based filtering. A partitioning technique is
developed to cope with the case when the data lake is large and the index
cannot fit in main memory. An experimental evaluation on real datasets shows
that our solution identifies substantially more tables than equi-joins and
outperforms other similarity-based options, and the join results are useful in
data enrichment for machine learning tasks. The experiments also demonstrate
the efficiency of the proposed method.Comment: Full version of paper in ICDE 202
Improving Dual-Encoder Training through Dynamic Indexes for Negative Mining
Dual encoder models are ubiquitous in modern classification and retrieval.
Crucial for training such dual encoders is an accurate estimation of gradients
from the partition function of the softmax over the large output space; this
requires finding negative targets that contribute most significantly ("hard
negatives"). Since dual encoder model parameters change during training, the
use of traditional static nearest neighbor indexes can be sub-optimal. These
static indexes (1) periodically require expensive re-building of the index,
which in turn requires (2) expensive re-encoding of all targets using updated
model parameters. This paper addresses both of these challenges. First, we
introduce an algorithm that uses a tree structure to approximate the softmax
with provable bounds and that dynamically maintains the tree. Second, we
approximate the effect of a gradient update on target encodings with an
efficient Nystrom low-rank approximation. In our empirical study on datasets
with over twenty million targets, our approach cuts error by half in relation
to oracle brute-force negative mining. Furthermore, our method surpasses prior
state-of-the-art while using 150x less accelerator memory.Comment: To appear at AISTATS 202
Numerically Stable Sparse Gaussian Processes via Minimum Separation using Cover Trees
Gaussian processes are frequently deployed as part of larger machine learning
and decision-making systems, for instance in geospatial modeling, Bayesian
optimization, or in latent Gaussian models. Within a system, the Gaussian
process model needs to perform in a stable and reliable manner to ensure it
interacts correctly with other parts of the system. In this work, we study the
numerical stability of scalable sparse approximations based on inducing points.
To do so, we first review numerical stability, and illustrate typical
situations in which Gaussian process models can be unstable. Building on
stability theory originally developed in the interpolation literature, we
derive sufficient and in certain cases necessary conditions on the inducing
points for the computations performed to be numerically stable. For
low-dimensional tasks such as geospatial modeling, we propose an automated
method for computing inducing points satisfying these conditions. This is done
via a modification of the cover tree data structure, which is of independent
interest. We additionally propose an alternative sparse approximation for
regression with a Gaussian likelihood which trades off a small amount of
performance to further improve stability. We provide illustrative examples
showing the relationship between stability of calculations and predictive
performance of inducing point methods on spatial tasks