3,590 research outputs found

    Tight Lower Bounds for Data-Dependent Locality-Sensitive Hashing

    Get PDF
    We prove a tight lower bound for the exponent ρ\rho for data-dependent Locality-Sensitive Hashing schemes, recently used to design efficient solutions for the cc-approximate nearest neighbor search. In particular, our lower bound matches the bound of ρ12c1+o(1)\rho\le \frac{1}{2c-1}+o(1) for the 1\ell_1 space, obtained via the recent algorithm from [Andoni-Razenshteyn, STOC'15]. In recent years it emerged that data-dependent hashing is strictly superior to the classical Locality-Sensitive Hashing, when the hash function is data-independent. In the latter setting, the best exponent has been already known: for the 1\ell_1 space, the tight bound is ρ=1/c\rho=1/c, with the upper bound from [Indyk-Motwani, STOC'98] and the matching lower bound from [O'Donnell-Wu-Zhou, ITCS'11]. We prove that, even if the hashing is data-dependent, it must hold that ρ12c1o(1)\rho\ge \frac{1}{2c-1}-o(1). To prove the result, we need to formalize the exact notion of data-dependent hashing that also captures the complexity of the hash functions (in addition to their collision properties). Without restricting such complexity, we would allow for obviously infeasible solutions such as the Voronoi diagram of a dataset. To preclude such solutions, we require our hash functions to be succinct. This condition is satisfied by all the known algorithmic results.Comment: 16 pages, no figure

    Off the Beaten Path: Let's Replace Term-Based Retrieval with k-NN Search

    Full text link
    Retrieval pipelines commonly rely on a term-based search to obtain candidate records, which are subsequently re-ranked. Some candidates are missed by this approach, e.g., due to a vocabulary mismatch. We address this issue by replacing the term-based search with a generic k-NN retrieval algorithm, where a similarity function can take into account subtle term associations. While an exact brute-force k-NN search using this similarity function is slow, we demonstrate that an approximate algorithm can be nearly two orders of magnitude faster at the expense of only a small loss in accuracy. A retrieval pipeline using an approximate k-NN search can be more effective and efficient than the term-based pipeline. This opens up new possibilities for designing effective retrieval pipelines. Our software (including data-generating code) and derivative data based on the Stack Overflow collection is available online

    Entropy-scaling search of massive biological data

    Get PDF
    Many datasets exhibit a well-defined structure that can be exploited to design faster search tools, but it is not always clear when such acceleration is possible. Here, we introduce a framework for similarity search based on characterizing a dataset's entropy and fractal dimension. We prove that searching scales in time with metric entropy (number of covering hyperspheres), if the fractal dimension of the dataset is low, and scales in space with the sum of metric entropy and information-theoretic entropy (randomness of the data). Using these ideas, we present accelerated versions of standard tools, with no loss in specificity and little loss in sensitivity, for use in three domains---high-throughput drug screening (Ammolite, 150x speedup), metagenomics (MICA, 3.5x speedup of DIAMOND [3,700x BLASTX]), and protein structure search (esFragBag, 10x speedup of FragBag). Our framework can be used to achieve "compressive omics," and the general theory can be readily applied to data science problems outside of biology.Comment: Including supplement: 41 pages, 6 figures, 4 tables, 1 bo
    corecore