19,995 research outputs found
Scalable and Effective Generative Information Retrieval
Recent research has shown that transformer networks can be used as
differentiable search indexes by representing each document as a sequences of
document ID tokens. These generative retrieval models cast the retrieval
problem to a document ID generation problem for each given query. Despite their
elegant design, existing generative retrieval models only perform well on
artificially-constructed and small-scale collections. This has led to serious
skepticism in the research community on their real-world impact. This paper
represents an important milestone in generative retrieval research by showing,
for the first time, that generative retrieval models can be trained to perform
effectively on large-scale standard retrieval benchmarks. For doing so, we
propose RIPOR- an optimization framework for generative retrieval that can be
adopted by any encoder-decoder architecture. RIPOR is designed based on two
often-overlooked fundamental design considerations in generative retrieval.
First, given the sequential decoding nature of document ID generation,
assigning accurate relevance scores to documents based on the whole document ID
sequence is not sufficient. To address this issue, RIPOR introduces a novel
prefix-oriented ranking optimization algorithm. Second, initial document IDs
should be constructed based on relevance associations between queries and
documents, instead of the syntactic and semantic information in the documents.
RIPOR addresses this issue using a relevance-based document ID construction
approach that quantizes relevance-based representations learned for documents.
Evaluation on MSMARCO and TREC Deep Learning Track reveals that RIPOR surpasses
state-of-the-art generative retrieval models by a large margin (e.g., 30.5% MRR
improvements on MS MARCO Dev Set), and perform better on par with popular dense
retrieval models
Proximity Full-Text Search with a Response Time Guarantee by Means of Additional Indexes
Full-text search engines are important tools for information retrieval. Term
proximity is an important factor in relevance score measurement. In a proximity
full-text search, we assume that a relevant document contains query terms near
each other, especially if the query terms are frequently occurring words. A
methodology for high-performance full-text query execution is discussed. We
build additional indexes to achieve better efficiency. For a word that occurs
in the text, we include in the indexes some information about nearby words.
What types of additional indexes do we use? How do we use them? These questions
are discussed in this work. We present the results of experiments showing that
the average time of search query execution is 44-45 times less than that
required when using ordinary inverted indexes.
This is a pre-print of a contribution "Veretennikov A.B. Proximity Full-Text
Search with a Response Time Guarantee by Means of Additional Indexes" published
in "Arai K., Kapoor S., Bhatia R. (eds) Intelligent Systems and Applications.
IntelliSys 2018. Advances in Intelligent Systems and Computing, vol 868"
published by Springer, Cham. The final authenticated version is available
online at: https://doi.org/10.1007/978-3-030-01054-6_66. The work was supported
by Act 211 Government of the Russian Federation, contract no 02.A03.21.0006.Comment: Alexander B. Veretennikov. Chair of Calculation Mathematics and
Computer Science, INSM. Ural Federal Universit
Spatio-textual indexing for geographical search on the web
Many web documents refer to specific geographic localities and many
people include geographic context in queries to web search engines. Standard
web search engines treat the geographical terms in the same way as other terms.
This can result in failure to find relevant documents that refer to the place of
interest using alternative related names, such as those of included or nearby
places. This can be overcome by associating text indexing with spatial indexing
methods that exploit geo-tagging procedures to categorise documents with
respect to geographic space. We describe three methods for spatio-textual
indexing based on multiple spatially indexed text indexes, attaching spatial
indexes to the document occurrences of a text index, and merging text index
access results with results of access to a spatial index of documents. These
schemes are compared experimentally with a conventional text index search
engine, using a collection of geo-tagged web documents, and are shown to be
able to compete in speed and storage performance with pure text indexing
Building a domain-specific document collection for evaluating metadata effects on information retrieval
This paper describes the development of a structured document collection containing user-generated text and numerical metadata for exploring the exploitation of metadata in information retrieval (IR). The collection consists of more than 61,000 documents extracted from YouTube video pages on basketball in general and NBA (National Basketball Association) in particular, together with a set of 40 topics and their relevance judgements. In addition, a collection of nearly 250,000 user profiles related to the NBA collection is available. Several baseline IR experiments report the effect of using video-associated metadata on retrieval effectiveness. The results
surprisingly show that searching the videos titles only performs significantly better than searching additional metadata text fields of the videos such as the tags or the description
Index ordering by query-independent measures
Conventional approaches to information retrieval search through all applicable entries in an inverted file for a particular collection in order to find those documents with the highest scores. For particularly large collections this may be extremely time consuming.
A solution to this problem is to only search a limited amount of the collection at query-time, in order to speed up the retrieval process. In doing this we can also limit the loss in retrieval efficacy (in terms of accuracy of results). The way we achieve this is to firstly identify the most “important” documents within the collection, and sort documents within inverted file lists in order of this “importance”. In this way we limit the amount of information to be searched at query time by eliminating documents of lesser importance, which not only makes the search more efficient, but also limits loss in retrieval accuracy. Our experiments, carried out on the TREC Terabyte collection, report significant savings, in terms of number of postings examined, without significant loss of effectiveness when based on several measures of importance used in isolation, and in combination. Our results point to several ways in which the computation cost of searching large collections of documents can be significantly reduced
Using NLP to build the hypertextuel network of a back-of-the-book index
Relying on the idea that back-of-the-book indexes are traditional devices for
navigation through large documents, we have developed a method to build a
hypertextual network that helps the navigation in a document. Building such an
hypertextual network requires selecting a list of descriptors, identifying the
relevant text segments to associate with each descriptor and finally ranking
the descriptors and reference segments by relevance order. We propose a
specific document segmentation method and a relevance measure for information
ranking. The algorithms are tested on 4 corpora (of different types and
domains) without human intervention or any semantic knowledge
- …