17 research outputs found

    Faster Cover Trees

    Get PDF
    Abstract The cover tree data structure speeds up exact nearest neighbor queries over arbitrary metric spaces On standard benchmark datasets, we reduce the number of distance computations by 10-50%. On a large-scale bioinformatics dataset, we reduce the number of distance computations by 71%. On a large-scale image dataset, our parallel algorithm with 16 cores reduces tree construction time from 3.5 hours to 12 minutes

    CAPS: A Practical Partition Index for Filtered Similarity Search

    Full text link
    With the surging popularity of approximate near-neighbor search (ANNS), driven by advances in neural representation learning, the ability to serve queries accompanied by a set of constraints has become an area of intense interest. While the community has recently proposed several algorithms for constrained ANNS, almost all of these methods focus on integration with graph-based indexes, the predominant class of algorithms achieving state-of-the-art performance in latency-recall tradeoffs. In this work, we take a different approach and focus on developing a constrained ANNS algorithm via space partitioning as opposed to graphs. To that end, we introduce Constrained Approximate Partitioned Search (CAPS), an index for ANNS with filters via space partitions that not only retains the benefits of a partition-based algorithm but also outperforms state-of-the-art graph-based constrained search techniques in recall-latency tradeoffs, with only 10% of the index size.Comment: 14 page

    CoverBLIP: accelerated and scalable iterative matched-filtering for Magnetic Resonance Fingerprint reconstruction

    Get PDF
    Current popular methods for Magnetic Resonance Fingerprint (MRF) recovery are bottlenecked by the heavy computations of a matched-filtering step due to the growing size and complexity of the fingerprint dictionaries in multi-parametric quantitative MRI applications. We address this shortcoming by arranging dictionary atoms in the form of cover tree structures and adopt the corresponding fast approximate nearest neighbour searches to accelerate matched-filtering. For datasets belonging to smooth low-dimensional manifolds cover trees offer search complexities logarithmic in terms of data population. With this motivation we propose an iterative reconstruction algorithm, named CoverBLIP, to address large-size MRF problems where the fingerprint dictionary i.e. discrete manifold of Bloch responses, encodes several intrinsic NMR parameters. We study different forms of convergence for this algorithm and we show that provided with a notion of embedding, the inexact and non-convex iterations of CoverBLIP linearly convergence toward a near-global solution with the same order of accuracy as using exact brute-force searches. Our further examinations on both synthetic and real-world datasets and using different sampling strategies, indicates between 2 to 3 orders of magnitude reduction in total search computations. Cover trees are robust against the curse-of-dimensionality and therefore CoverBLIP provides a notion of scalability -- a consistent gain in time-accuracy performance-- for searching high-dimensional atoms which may not be easily preprocessed (i.e. for dimensionality reduction) due to the increasing degrees of non-linearities appearing in the emerging multi-parametric MRF dictionaries

    Efficient Joinable Table Discovery in Data Lakes: A High-Dimensional Similarity-Based Approach

    Full text link
    Finding joinable tables in data lakes is key procedure in many applications such as data integration, data augmentation, data analysis, and data market. Traditional approaches that find equi-joinable tables are unable to deal with misspellings and different formats, nor do they capture any semantic joins. In this paper, we propose PEXESO, a framework for joinable table discovery in data lakes. We embed textual values as high-dimensional vectors and join columns under similarity predicates on high-dimensional vectors, hence to address the limitations of equi-join approaches and identify more meaningful results. To efficiently find joinable tables with similarity, we propose a block-and-verify method that utilizes pivot-based filtering. A partitioning technique is developed to cope with the case when the data lake is large and the index cannot fit in main memory. An experimental evaluation on real datasets shows that our solution identifies substantially more tables than equi-joins and outperforms other similarity-based options, and the join results are useful in data enrichment for machine learning tasks. The experiments also demonstrate the efficiency of the proposed method.Comment: Full version of paper in ICDE 202

    Improving Dual-Encoder Training through Dynamic Indexes for Negative Mining

    Full text link
    Dual encoder models are ubiquitous in modern classification and retrieval. Crucial for training such dual encoders is an accurate estimation of gradients from the partition function of the softmax over the large output space; this requires finding negative targets that contribute most significantly ("hard negatives"). Since dual encoder model parameters change during training, the use of traditional static nearest neighbor indexes can be sub-optimal. These static indexes (1) periodically require expensive re-building of the index, which in turn requires (2) expensive re-encoding of all targets using updated model parameters. This paper addresses both of these challenges. First, we introduce an algorithm that uses a tree structure to approximate the softmax with provable bounds and that dynamically maintains the tree. Second, we approximate the effect of a gradient update on target encodings with an efficient Nystrom low-rank approximation. In our empirical study on datasets with over twenty million targets, our approach cuts error by half in relation to oracle brute-force negative mining. Furthermore, our method surpasses prior state-of-the-art while using 150x less accelerator memory.Comment: To appear at AISTATS 202

    Numerically Stable Sparse Gaussian Processes via Minimum Separation using Cover Trees

    Full text link
    Gaussian processes are frequently deployed as part of larger machine learning and decision-making systems, for instance in geospatial modeling, Bayesian optimization, or in latent Gaussian models. Within a system, the Gaussian process model needs to perform in a stable and reliable manner to ensure it interacts correctly with other parts of the system. In this work, we study the numerical stability of scalable sparse approximations based on inducing points. To do so, we first review numerical stability, and illustrate typical situations in which Gaussian process models can be unstable. Building on stability theory originally developed in the interpolation literature, we derive sufficient and in certain cases necessary conditions on the inducing points for the computations performed to be numerically stable. For low-dimensional tasks such as geospatial modeling, we propose an automated method for computing inducing points satisfying these conditions. This is done via a modification of the cover tree data structure, which is of independent interest. We additionally propose an alternative sparse approximation for regression with a Gaussian likelihood which trades off a small amount of performance to further improve stability. We provide illustrative examples showing the relationship between stability of calculations and predictive performance of inducing point methods on spatial tasks