84,907 research outputs found

    The State-of-the-arts in Focused Search

    Get PDF
    The continuous influx of various text data on the Web requires search engines to improve their retrieval abilities for more specific information. The need for relevant results to a user’s topic of interest has gone beyond search for domain or type specific documents to more focused result (e.g. document fragments or answers to a query). The introduction of XML provides a format standard for data representation, storage, and exchange. It helps focused search to be carried out at different granularities of a structured document with XML markups. This report aims at reviewing the state-of-the-arts in focused search, particularly techniques for topic-specific document retrieval, passage retrieval, XML retrieval, and entity ranking. It is concluded with highlight of open problems

    Off the Beaten Path: Let's Replace Term-Based Retrieval with k-NN Search

    Full text link
    Retrieval pipelines commonly rely on a term-based search to obtain candidate records, which are subsequently re-ranked. Some candidates are missed by this approach, e.g., due to a vocabulary mismatch. We address this issue by replacing the term-based search with a generic k-NN retrieval algorithm, where a similarity function can take into account subtle term associations. While an exact brute-force k-NN search using this similarity function is slow, we demonstrate that an approximate algorithm can be nearly two orders of magnitude faster at the expense of only a small loss in accuracy. A retrieval pipeline using an approximate k-NN search can be more effective and efficient than the term-based pipeline. This opens up new possibilities for designing effective retrieval pipelines. Our software (including data-generating code) and derivative data based on the Stack Overflow collection is available online

    Term-Specific Eigenvector-Centrality in Multi-Relation Networks

    Get PDF
    Fuzzy matching and ranking are two information retrieval techniques widely used in web search. Their application to structured data, however, remains an open problem. This article investigates how eigenvector-centrality can be used for approximate matching in multi-relation graphs, that is, graphs where connections of many different types may exist. Based on an extension of the PageRank matrix, eigenvectors representing the distribution of a term after propagating term weights between related data items are computed. The result is an index which takes the document structure into account and can be used with standard document retrieval techniques. As the scheme takes the shape of an index transformation, all necessary calculations are performed during index tim

    The citation wake of publications detects Nobel laureates' papers

    Full text link
    For several decades, a leading paradigm of how to quantitatively assess scientific research has been the analysis of the aggregated citation information in a set of scientific publications. Although the representation of this information as a citation network has already been coined in the 1960s, it needed the systematic indexing of scientific literature to allow for impact metrics that actually made use of this network as a whole improving on the then prevailing metrics that were almost exclusively based on the number of direct citations. However, besides focusing on the assignment of credit, the paper citation network can also be studied in terms of the proliferation of scientific ideas. Here we introduce a simple measure based on the shortest-paths in the paper's in-component or, simply speaking, on the shape and size of the wake of a paper within the citation network. Applied to a citation network containing Physical Review publications from more than a century, our approach is able to detect seminal articles which have introduced concepts of obvious importance to the further development of physics. We observe a large fraction of papers co-authored by Nobel Prize laureates in physics among the top-ranked publications.Comment: 11 pages, 3 figure
    corecore