20 research outputs found

    Enhancing Knowledge Bases with Quantity Facts

    Get PDF

    The Sweet Spot between Inverted Indices and Metric-Space Indexing for Top-K–List Similarity Search

    Get PDF
    We consider the problem of processing similarity queries over a set of top-k rankings where the query ranking and the similarity threshold are provided at query time. Spearman’s Footrule distance is used to compute the similarity between rankings, considering how well rankings agree on the positions (ranks) of ranked items (i.e., the L1 distance). This setup allows the application of metric index structures such as M- or BK-trees and, alternatively, enables the use of traditional inverted indices for retrieving rankings that overlap (in items) with the query. Although both techniques are reasonable, they come with individual drawbacks for our specific problem. In this paper, we propose a hybrid indexing strategy, which blends inverted indices and metric space indexing, resulting in a structure that resembles both indexing methods with tunable emphasis on one or the other. To find the sweet spot, we propose an assumption-lean but highly accurate (empirically validated) cost model through theoretical analysis. We further present optimizations to the inverted index component, for early termination and minimizing bookkeeping. The performance of the proposed algorithms, hybrid variants, and competitors is studied in a comprehensive evaluation using real-world benchmark data consisting of Web-search–result rankings and entity rankings based on Wikipedia

    Entity Recommendation Based on {Wikipedia}

    No full text

    Similarity Search Algorithms over Top-k Rankings and Class-Constrained Objects

    No full text
    In this thesis, we consider the problem of processing similarity queries over a dataset of top-k rankings and class constrained objects. Top-k rankings are the most natural and widely used technique to compress a large amount of information into a concise form. Spearman’s Footrule distance is used to compute the similarity between rankings, considering how well rankings agree on the positions (ranks) of ranked items. This setup allows the application of metric distance-based pruning strategies, and, alternatively, enables the use of traditional inverted indices for retrieving rankings that overlap in items. Although both techniques can be individually applied, we hypothesize that blending these two would lead to better performance. First, we formulate theoretical bounds over the rankings, based on Spearman's Footrule distance, which are essential for adapting existing, inverted index based techniques to the setting of top-k rankings. Further, we propose a hybrid indexing strategy, designed for efficiently processing similarity range queries, which incorporates inverted indices and metric space indices, such as M- or BK-trees, resulting in a structure that resembles both indexing methods with tunable emphasis on one or the other. Moreover, optimizations to the inverted index component are presented, for early termination and minimizing bookkeeping. As vast amounts of data are being generated on a daily bases, we further present a distributed, highly tunable, approach, implemented in Apache Spark, for efficiently processing similarity join queries over top-k rankings. To combine distance-based filtering with inverted indices, the algorithm works in several phases. The partial results are joined for the computation of the final result set. As the last contribution of the thesis, we consider processing k-nearest-neighbor (k-NN) queries over class-constrained objects, with the additional requirement that the result objects are of a specific type. We introduce the MISP index, which first indexes the objects by their (combination of) class belonging, followed by a similarity search sub index for each subset of objects. The number of such subsets can combinatorially explode, thus, we provide a cost model that analyzes the performance of the MISP index structure under different configurations, with the aim of finding the most efficient one for the dataset being searched

    Similarity Search Algorithms over Top-k Rankings and Class-Constrained Objects

    No full text
    In this thesis, we consider the problem of processing similarity queries over a dataset of top-k rankings and class constrained objects. Top-k rankings are the most natural and widely used technique to compress a large amount of information into a concise form. Spearman’s Footrule distance is used to compute the similarity between rankings, considering how well rankings agree on the positions (ranks) of ranked items. This setup allows the application of metric distance-based pruning strategies, and, alternatively, enables the use of traditional inverted indices for retrieving rankings that overlap in items. Although both techniques can be individually applied, we hypothesize that blending these two would lead to better performance. First, we formulate theoretical bounds over the rankings, based on Spearman's Footrule distance, which are essential for adapting existing, inverted index based techniques to the setting of top-k rankings. Further, we propose a hybrid indexing strategy, designed for efficiently processing similarity range queries, which incorporates inverted indices and metric space indices, such as M- or BK-trees, resulting in a structure that resembles both indexing methods with tunable emphasis on one or the other. Moreover, optimizations to the inverted index component are presented, for early termination and minimizing bookkeeping. As vast amounts of data are being generated on a daily bases, we further present a distributed, highly tunable, approach, implemented in Apache Spark, for efficiently processing similarity join queries over top-k rankings. To combine distance-based filtering with inverted indices, the algorithm works in several phases. The partial results are joined for the computation of the final result set. As the last contribution of the thesis, we consider processing k-nearest-neighbor (k-NN) queries over class-constrained objects, with the additional requirement that the result objects are of a specific type. We introduce the MISP index, which first indexes the objects by their (combination of) class belonging, followed by a similarity search sub index for each subset of objects. The number of such subsets can combinatorially explode, thus, we provide a cost model that analyzes the performance of the MISP index structure under different configurations, with the aim of finding the most efficient one for the dataset being searched

    {X-REC}: Cross-category Entity Recommendation

    No full text

    Linking {Wikipedia} Events to Past News

    No full text
    We consider the task of linking Wikipedia events to rele-vant news articles from the past. Descriptions of events are abundant in Wikipedia and systematically curated in year, decade, and century articles. To address this task, we develop a two-stage cascade approach that builds a query model from temporal expressions in a set of initially re-trieved documents. As baselines we consider several meth-ods that integrate publication dates and/or temporal expres-sions into a language modeling approach. Our experimen-tal evaluation on 50 randomly sampled Wikipedia events with crowd-sourced relevance assessments shows that the two-stage cascade approach outperforms the baselines. Our experimental testbed of queries and relevance assessments is made publicly available
    corecore