84,907 research outputs found
The State-of-the-arts in Focused Search
The continuous influx of various text data on the Web requires search engines to improve their retrieval abilities for more specific information. The need for relevant results to a user’s topic of interest has gone beyond search for domain or type specific documents to more focused result (e.g. document fragments or answers to a query). The introduction of XML provides a format standard for data representation, storage, and exchange. It helps focused search to be carried out at different granularities of a structured document with XML markups. This report aims at reviewing the state-of-the-arts in focused search, particularly techniques for topic-specific document retrieval, passage retrieval, XML retrieval, and entity ranking. It is concluded with highlight of open problems
Off the Beaten Path: Let's Replace Term-Based Retrieval with k-NN Search
Retrieval pipelines commonly rely on a term-based search to obtain candidate
records, which are subsequently re-ranked. Some candidates are missed by this
approach, e.g., due to a vocabulary mismatch. We address this issue by
replacing the term-based search with a generic k-NN retrieval algorithm, where
a similarity function can take into account subtle term associations. While an
exact brute-force k-NN search using this similarity function is slow, we
demonstrate that an approximate algorithm can be nearly two orders of magnitude
faster at the expense of only a small loss in accuracy. A retrieval pipeline
using an approximate k-NN search can be more effective and efficient than the
term-based pipeline. This opens up new possibilities for designing effective
retrieval pipelines. Our software (including data-generating code) and
derivative data based on the Stack Overflow collection is available online
Term-Specific Eigenvector-Centrality in Multi-Relation Networks
Fuzzy matching and ranking are two information retrieval techniques widely used in web search. Their application to structured data, however, remains an open problem. This article investigates how eigenvector-centrality can be used for approximate matching in multi-relation graphs, that is, graphs where connections of many different types may exist. Based on an extension of the PageRank matrix, eigenvectors representing the distribution of a term after propagating term weights between related data items are computed. The result is an index which takes the document structure into account and can be used with standard document retrieval techniques. As the scheme takes the shape of an index transformation, all necessary calculations are performed during index tim
The citation wake of publications detects Nobel laureates' papers
For several decades, a leading paradigm of how to quantitatively assess
scientific research has been the analysis of the aggregated citation
information in a set of scientific publications. Although the representation of
this information as a citation network has already been coined in the 1960s, it
needed the systematic indexing of scientific literature to allow for impact
metrics that actually made use of this network as a whole improving on the then
prevailing metrics that were almost exclusively based on the number of direct
citations. However, besides focusing on the assignment of credit, the paper
citation network can also be studied in terms of the proliferation of
scientific ideas. Here we introduce a simple measure based on the
shortest-paths in the paper's in-component or, simply speaking, on the shape
and size of the wake of a paper within the citation network. Applied to a
citation network containing Physical Review publications from more than a
century, our approach is able to detect seminal articles which have introduced
concepts of obvious importance to the further development of physics. We
observe a large fraction of papers co-authored by Nobel Prize laureates in
physics among the top-ranked publications.Comment: 11 pages, 3 figure
- …