262 research outputs found

    Sparse spatial selection for novelty-based search result diversification

    Get PDF
    Abstract. Novelty-based diversification approaches aim to produce a diverse ranking by directly comparing the retrieved documents. However, since such approaches are typically greedy, they require O(n 2) documentdocument comparisons in order to diversify a ranking of n documents. In this work, we propose to model novelty-based diversification as a similarity search in a sparse metric space. In particular, we exploit the triangle inequality property of metric spaces in order to drastically reduce the number of required document-document comparisons. Thorough experiments using three TREC test collections show that our approach is at least as effective as existing novelty-based diversification approaches, while improving their efficiency by an order of magnitude.

    Off the Beaten Path: Let's Replace Term-Based Retrieval with k-NN Search

    Full text link
    Retrieval pipelines commonly rely on a term-based search to obtain candidate records, which are subsequently re-ranked. Some candidates are missed by this approach, e.g., due to a vocabulary mismatch. We address this issue by replacing the term-based search with a generic k-NN retrieval algorithm, where a similarity function can take into account subtle term associations. While an exact brute-force k-NN search using this similarity function is slow, we demonstrate that an approximate algorithm can be nearly two orders of magnitude faster at the expense of only a small loss in accuracy. A retrieval pipeline using an approximate k-NN search can be more effective and efficient than the term-based pipeline. This opens up new possibilities for designing effective retrieval pipelines. Our software (including data-generating code) and derivative data based on the Stack Overflow collection is available online

    Ptolemaic Indexing

    Full text link
    This paper discusses a new family of bounds for use in similarity search, related to those used in metric indexing, but based on Ptolemy's inequality, rather than the metric axioms. Ptolemy's inequality holds for the well-known Euclidean distance, but is also shown here to hold for quadratic form metrics in general, with Mahalanobis distance as an important special case. The inequality is examined empirically on both synthetic and real-world data sets and is also found to hold approximately, with a very low degree of error, for important distances such as the angular pseudometric and several Lp norms. Indexing experiments demonstrate a highly increased filtering power compared to existing, triangular methods. It is also shown that combining the Ptolemaic and triangular filtering can lead to better results than using either approach on its own

    Efficient Document Indexing Using Pivot Tree

    Full text link
    We present a novel method for efficiently searching top-k neighbors for documents represented in high dimensional space of terms based on the cosine similarity. Mostly, documents are stored as bag-of-words tf-idf representation. One of the most used ways of computing similarity between a pair of documents is cosine similarity between the vector representations, but cosine similarity is not a metric distance measure as it doesn't follow triangle inequality, therefore most metric searching methods can not be applied directly. We propose an efficient method for indexing documents using a pivot tree that leads to efficient retrieval. We also study the relation between precision and efficiency for the proposed method and compare it with a state of the art in the area of document searching based on inner product.Comment: 6 Pages, 2 Figure

    Dynamic selection of suitable pivots for similarity search in metric spaces

    Get PDF
    This paper presents a data structure based on Sparse Spatial Selection (SSS) for similarity searching. An algorithm that tries periodically to adjust pivots to the use of database index is presented. This index is dynamic. In this way, it is possible to improve the amount of discriminations done by the pivots. So, the primary objective of indexes is achieved: to reduce the number of distance function evaluations, as it is showed in the experimentationVI Workshop Bases de Datos y Minería de Datos (WBD)Red de Universidades con Carreras en Informática (RedUNCI

    Modelling Efficient Novelty-based Search Result Diversification in Metric Spaces

    Get PDF
    Novelty-based diversification provides a way to tackle ambiguous queries by re-ranking a set of retrieved documents. Current approaches are typically greedy, requiring O(n2) document–document comparisons in order to diversify a ranking of n documents. In this article, we introduce a new approach for novelty-based search result diversification to reduce the overhead incurred by document–document comparisons. To this end, we model novelty promotion as a similarity search in a metric space, exploiting the properties of this space to efficiently identify novel documents. We investigate three different approaches: pivoting-based, clustering-based, and permutation-based. In the first two, a novel document is one that lies outside the range of a pivot or outside a cluster. In the latter, a novel document is one that has a different signature (i.e., the documentʼs relative distance to a distinguished set of fixed objects called permutants) compared to previously selected documents. Thorough experiments using two TREC test collections for diversity evaluation, as well as a large sample of the query stream of a commercial search engine show that our approaches perform at least as effectively as well-known novelty-based diversification approaches in the literature, while dramatically improving their efficiency.Fil: Gil Costa, Graciela Verónica. Yahoo; México. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico San Luis; ArgentinaFil: Santos, Rodrygo L. T.. University Of Glasgow; Reino UnidoFil: Macdonald, Craig. University Of Glasgow; Reino UnidoFil: Ounis, Iadh. University Of Glasgow; Reino Unid

    Using parallel pivot vs. clustering-based techniques for web engines

    Get PDF
    Web Engines are a useful tool for searching information in the Web. But a great part of this information is non-textual and for that case a metric space is used. A metric space is a set where a notion of distance (called a metric) between elements of the set is defined. In this paper we present an efficient parallelization of a pivot-based method devised for this purpose which is called the Sparse Spatial Selection (SSS) strategy and we compare it with a clustering-based method, a parallel implementation of the Spatial Approximation Tree (SAT). We show that SAT compares favourably against the pivot data structures SSS. The experimental results were obtained on a highperformance cluster and using several metric spaces, that shows load balance parallel strategies for the SAT. The implementations are built upon the BSP parallel computing model, which shows efficient performance for this application domain and allows a precise evaluation of algorithms.VIII Workshop de Procesamiento Distribuido y ParaleloRed de Universidades con Carreras en Informática (RedUNCI

    Using parallel pivot vs. clustering-based techniques for web engines

    Get PDF
    Web Engines are a useful tool for searching information in the Web. But a great part of this information is non-textual and for that case a metric space is used. A metric space is a set where a notion of distance (called a metric) between elements of the set is defined. In this paper we present an efficient parallelization of a pivot-based method devised for this purpose which is called the Sparse Spatial Selection (SSS) strategy and we compare it with a clustering-based method, a parallel implementation of the Spatial Approximation Tree (SAT). We show that SAT compares favourably against the pivot data structures SSS. The experimental results were obtained on a highperformance cluster and using several metric spaces, that shows load balance parallel strategies for the SAT. The implementations are built upon the BSP parallel computing model, which shows efficient performance for this application domain and allows a precise evaluation of algorithms.VIII Workshop de Procesamiento Distribuido y ParaleloRed de Universidades con Carreras en Informática (RedUNCI

    Using Apache Lucene to Search Vector of Locally Aggregated Descriptors

    Full text link
    Surrogate Text Representation (STR) is a profitable solution to efficient similarity search on metric space using conventional text search engines, such as Apache Lucene. This technique is based on comparing the permutations of some reference objects in place of the original metric distance. However, the Achilles heel of STR approach is the need to reorder the result set of the search according to the metric distance. This forces to use a support database to store the original objects, which requires efficient random I/O on a fast secondary memory (such as flash-based storages). In this paper, we propose to extend the Surrogate Text Representation to specifically address a class of visual metric objects known as Vector of Locally Aggregated Descriptors (VLAD). This approach is based on representing the individual sub-vectors forming the VLAD vector with the STR, providing a finer representation of the vector and enabling us to get rid of the reordering phase. The experiments on a publicly available dataset show that the extended STR outperforms the baseline STR achieving satisfactory performance near to the one obtained with the original VLAD vectors.Comment: In Proceedings of the 11th Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2016) - Volume 4: VISAPP, p. 383-39