699 research outputs found

    Indexing Metric Spaces for Exact Similarity Search

    Full text link
    With the continued digitalization of societal processes, we are seeing an explosion in available data. This is referred to as big data. In a research setting, three aspects of the data are often viewed as the main sources of challenges when attempting to enable value creation from big data: volume, velocity and variety. Many studies address volume or velocity, while much fewer studies concern the variety. Metric space is ideal for addressing variety because it can accommodate any type of data as long as its associated distance notion satisfies the triangle inequality. To accelerate search in metric space, a collection of indexing techniques for metric data have been proposed. However, existing surveys each offers only a narrow coverage, and no comprehensive empirical study of those techniques exists. We offer a survey of all the existing metric indexes that can support exact similarity search, by i) summarizing all the existing partitioning, pruning and validation techniques used for metric indexes, ii) providing the time and storage complexity analysis on the index construction, and iii) report on a comprehensive empirical comparison of their similarity query processing performance. Here, empirical comparisons are used to evaluate the index performance during search as it is hard to see the complexity analysis differences on the similarity query processing and the query performance depends on the pruning and validation abilities related to the data distribution. This article aims at revealing different strengths and weaknesses of different indexing techniques in order to offer guidance on selecting an appropriate indexing technique for a given setting, and directing the future research for metric indexes

    Off the Beaten Path: Let's Replace Term-Based Retrieval with k-NN Search

    Full text link
    Retrieval pipelines commonly rely on a term-based search to obtain candidate records, which are subsequently re-ranked. Some candidates are missed by this approach, e.g., due to a vocabulary mismatch. We address this issue by replacing the term-based search with a generic k-NN retrieval algorithm, where a similarity function can take into account subtle term associations. While an exact brute-force k-NN search using this similarity function is slow, we demonstrate that an approximate algorithm can be nearly two orders of magnitude faster at the expense of only a small loss in accuracy. A retrieval pipeline using an approximate k-NN search can be more effective and efficient than the term-based pipeline. This opens up new possibilities for designing effective retrieval pipelines. Our software (including data-generating code) and derivative data based on the Stack Overflow collection is available online

    Using parallel pivot vs. clustering-based techniques for web engines

    Get PDF
    Web Engines are a useful tool for searching information in the Web. But a great part of this information is non-textual and for that case a metric space is used. A metric space is a set where a notion of distance (called a metric) between elements of the set is defined. In this paper we present an efficient parallelization of a pivot-based method devised for this purpose which is called the Sparse Spatial Selection (SSS) strategy and we compare it with a clustering-based method, a parallel implementation of the Spatial Approximation Tree (SAT). We show that SAT compares favourably against the pivot data structures SSS. The experimental results were obtained on a highperformance cluster and using several metric spaces, that shows load balance parallel strategies for the SAT. The implementations are built upon the BSP parallel computing model, which shows efficient performance for this application domain and allows a precise evaluation of algorithms.VIII Workshop de Procesamiento Distribuido y ParaleloRed de Universidades con Carreras en Informática (RedUNCI

    Using parallel pivot vs. clustering-based techniques for web engines

    Get PDF
    Web Engines are a useful tool for searching information in the Web. But a great part of this information is non-textual and for that case a metric space is used. A metric space is a set where a notion of distance (called a metric) between elements of the set is defined. In this paper we present an efficient parallelization of a pivot-based method devised for this purpose which is called the Sparse Spatial Selection (SSS) strategy and we compare it with a clustering-based method, a parallel implementation of the Spatial Approximation Tree (SAT). We show that SAT compares favourably against the pivot data structures SSS. The experimental results were obtained on a highperformance cluster and using several metric spaces, that shows load balance parallel strategies for the SAT. The implementations are built upon the BSP parallel computing model, which shows efficient performance for this application domain and allows a precise evaluation of algorithms.VIII Workshop de Procesamiento Distribuido y ParaleloRed de Universidades con Carreras en Informática (RedUNCI

    A Learned Index for Exact Similarity Search in Metric Spaces

    Full text link
    Indexing is an effective way to support efficient query processing in large databases. Recently the concept of learned index has been explored actively to replace or supplement traditional index structures with machine learning models to reduce storage and search costs. However, accurate and efficient similarity query processing in high-dimensional metric spaces remains to be an open challenge. In this paper, a novel indexing approach called LIMS is proposed to use data clustering and pivot-based data transformation techniques to build learned indexes for efficient similarity query processing in metric spaces. The underlying data is partitioned into clusters such that each cluster follows a relatively uniform data distribution. Data redistribution is achieved by utilizing a small number of pivots for each cluster. Similar data are mapped into compact regions and the mapped values are totally ordinal. Machine learning models are developed to approximate the position of each data record on the disk. Efficient algorithms are designed for processing range queries and nearest neighbor queries based on LIMS, and for index maintenance with dynamic updates. Extensive experiments on real-world and synthetic datasets demonstrate the superiority of LIMS compared with traditional indexes and state-of-the-art learned indexes.Comment: 14 pages, 14 figures, submitted to Transactions on Knowledge and Data Engineerin

    HD-Index: Pushing the Scalability-Accuracy Boundary for Approximate kNN Search in High-Dimensional Spaces

    Full text link
    Nearest neighbor searching of large databases in high-dimensional spaces is inherently difficult due to the curse of dimensionality. A flavor of approximation is, therefore, necessary to practically solve the problem of nearest neighbor search. In this paper, we propose a novel yet simple indexing scheme, HD-Index, to solve the problem of approximate k-nearest neighbor queries in massive high-dimensional databases. HD-Index consists of a set of novel hierarchical structures called RDB-trees built on Hilbert keys of database objects. The leaves of the RDB-trees store distances of database objects to reference objects, thereby allowing efficient pruning using distance filters. In addition to triangular inequality, we also use Ptolemaic inequality to produce better lower bounds. Experiments on massive (up to billion scale) high-dimensional (up to 1000+) datasets show that HD-Index is effective, efficient, and scalable.Comment: PVLDB 11(8):906-919, 201
    corecore