36,324 research outputs found
The State-of-the-arts in Focused Search
The continuous influx of various text data on the Web requires search engines to improve their retrieval abilities for more specific information. The need for relevant results to a userās topic of interest has gone beyond search for domain or type specific documents to more focused result (e.g. document fragments or answers to a query). The introduction of XML provides a format standard for data representation, storage, and exchange. It helps focused search to be carried out at different granularities of a structured document with XML markups. This report aims at reviewing the state-of-the-arts in focused search, particularly techniques for topic-specific document retrieval, passage retrieval, XML retrieval, and entity ranking. It is concluded with highlight of open problems
Deeper Text Understanding for IR with Contextual Neural Language Modeling
Neural networks provide new possibilities to automatically learn complex
language patterns and query-document relations. Neural IR models have achieved
promising results in learning query-document relevance patterns, but few
explorations have been done on understanding the text content of a query or a
document. This paper studies leveraging a recently-proposed contextual neural
language model, BERT, to provide deeper text understanding for IR. Experimental
results demonstrate that the contextual text representations from BERT are more
effective than traditional word embeddings. Compared to bag-of-words retrieval
models, the contextual language model can better leverage language structures,
bringing large improvements on queries written in natural languages. Combining
the text understanding ability with search knowledge leads to an enhanced
pre-trained BERT model that can benefit related search tasks where training
data are limited.Comment: In proceedings of SIGIR 201
Table Search Using a Deep Contextualized Language Model
Pretrained contextualized language models such as BERT have achieved
impressive results on various natural language processing benchmarks.
Benefiting from multiple pretraining tasks and large scale training corpora,
pretrained models can capture complex syntactic word relations. In this paper,
we use the deep contextualized language model BERT for the task of ad hoc table
retrieval. We investigate how to encode table content considering the table
structure and input length limit of BERT. We also propose an approach that
incorporates features from prior literature on table retrieval and jointly
trains them with BERT. In experiments on public datasets, we show that our best
approach can outperform the previous state-of-the-art method and BERT baselines
with a large margin under different evaluation metrics.Comment: Accepted at SIGIR 2020 (Long
Retrieve-and-Read: Multi-task Learning of Information Retrieval and Reading Comprehension
This study considers the task of machine reading at scale (MRS) wherein,
given a question, a system first performs the information retrieval (IR) task
of finding relevant passages in a knowledge source and then carries out the
reading comprehension (RC) task of extracting an answer span from the passages.
Previous MRS studies, in which the IR component was trained without considering
answer spans, struggled to accurately find a small number of relevant passages
from a large set of passages. In this paper, we propose a simple and effective
approach that incorporates the IR and RC tasks by using supervised multi-task
learning in order that the IR component can be trained by considering answer
spans. Experimental results on the standard benchmark, answering SQuAD
questions using the full Wikipedia as the knowledge source, showed that our
model achieved state-of-the-art performance. Moreover, we thoroughly evaluated
the individual contributions of our model components with our new Japanese
dataset and SQuAD. The results showed significant improvements in the IR task
and provided a new perspective on IR for RC: it is effective to teach which
part of the passage answers the question rather than to give only a relevance
score to the whole passage.Comment: 10 pages, 6 figure. Accepted as a full paper at CIKM 201
Training Curricula for Open Domain Answer Re-Ranking
In precision-oriented tasks like answer ranking, it is more important to rank
many relevant answers highly than to retrieve all relevant answers. It follows
that a good ranking strategy would be to learn how to identify the easiest
correct answers first (i.e., assign a high ranking score to answers that have
characteristics that usually indicate relevance, and a low ranking score to
those with characteristics that do not), before incorporating more complex
logic to handle difficult cases (e.g., semantic matching or reasoning). In this
work, we apply this idea to the training of neural answer rankers using
curriculum learning. We propose several heuristics to estimate the difficulty
of a given training sample. We show that the proposed heuristics can be used to
build a training curriculum that down-weights difficult samples early in the
training process. As the training process progresses, our approach gradually
shifts to weighting all samples equally, regardless of difficulty. We present a
comprehensive evaluation of our proposed idea on three answer ranking datasets.
Results show that our approach leads to superior performance of two leading
neural ranking architectures, namely BERT and ConvKNRM, using both pointwise
and pairwise losses. When applied to a BERT-based ranker, our method yields up
to a 4% improvement in MRR and a 9% improvement in P@1 (compared to the model
trained without a curriculum). This results in models that can achieve
comparable performance to more expensive state-of-the-art techniques.Comment: Accepted at SIGIR 2020 (long
Exploring accumulative query expansion for relevance feedback
For the participation of Dublin City University (DCU) in the Relevance Feedback (RF) track of INEX 2010, we investigated the relation between the length of relevant text passages and the number of RF terms. In our experiments, relevant passages are segmented into non-overlapping windows of xed length which are sorted by similarity with the query. In each retrieval iteration, we extend the current query with the most frequent terms extracted from these word windows. The number of feedback terms corresponds to a constant number, a number proportional to the length of relevant passages, and a number inversely proportional to the length of relevant passages, respectively. Retrieval experiments show a signicant increase in MAP for INEX 2008 training data and improved precisions at early recall levels for the 2010 topics as compared to the baseline Rocchio feedback
- ā¦