28,234 research outputs found
PACRR: A Position-Aware Neural IR Model for Relevance Matching
In order to adopt deep learning for information retrieval, models are needed
that can capture all relevant information required to assess the relevance of a
document to a given user query. While previous works have successfully captured
unigram term matches, how to fully employ position-dependent information such
as proximity and term dependencies has been insufficiently explored. In this
work, we propose a novel neural IR model named PACRR aiming at better modeling
position-dependent interactions between a query and a document. Extensive
experiments on six years' TREC Web Track data confirm that the proposed model
yields better results under multiple benchmarks.Comment: To appear in EMNLP201
Index ordering by query-independent measures
Conventional approaches to information retrieval search through all applicable entries in an inverted file for a particular collection in order to find those documents with the highest scores. For particularly large collections this may be extremely time consuming.
A solution to this problem is to only search a limited amount of the collection at query-time, in order to speed up the retrieval process. In doing this we can also limit the loss in retrieval efficacy (in terms of accuracy of results). The way we achieve this is to firstly identify the most “important” documents within the collection, and sort documents within inverted file lists in order of this “importance”. In this way we limit the amount of information to be searched at query time by eliminating documents of lesser importance, which not only makes the search more efficient, but also limits loss in retrieval accuracy. Our experiments, carried out on the TREC Terabyte collection, report significant savings, in terms of number of postings examined, without significant loss of effectiveness when based on several measures of importance used in isolation, and in combination. Our results point to several ways in which the computation cost of searching large collections of documents can be significantly reduced
Detecting missing content queries in an SMS-Based HIV/AIDS FAQ retrieval system
Automated Frequently Asked Question (FAQ) answering systems use pre-stored sets of question-answer pairs as an information source to answer natural language questions posed by the users. The main problem with this kind of information source is that there is no guarantee that there will be a relevant question-answer pair for all user queries. In this paper, we propose to deploy a binary classifier in an existing SMS-Based HIV/AIDS FAQ retrieval system to detect user queries that do not have the relevant question-answer pair in the FAQ document collection. Before deploying such a classifier, we first evaluate different feature sets for training in order to determine the sets of features that can build a model that yields the best classification accuracy. We carry out our evaluation using seven different feature sets generated from a query log before and after retrieval by the FAQ retrieval system. Our results suggest that, combining different feature sets markedly improves the classification accuracy
Content-Based Weak Supervision for Ad-Hoc Re-Ranking
One challenge with neural ranking is the need for a large amount of
manually-labeled relevance judgments for training. In contrast with prior work,
we examine the use of weak supervision sources for training that yield pseudo
query-document pairs that already exhibit relevance (e.g., newswire
headline-content pairs and encyclopedic heading-paragraph pairs). We also
propose filtering techniques to eliminate training samples that are too far out
of domain using two techniques: a heuristic-based approach and novel supervised
filter that re-purposes a neural ranker. Using several leading neural ranking
architectures and multiple weak supervision datasets, we show that these
sources of training pairs are effective on their own (outperforming prior weak
supervision techniques), and that filtering can further improve performance.Comment: SIGIR 2019 (short paper
Modeling Temporal Evidence from External Collections
Newsworthy events are broadcast through multiple mediums and prompt the
crowds to produce comments on social media. In this paper, we propose to
leverage on this behavioral dynamics to estimate the most relevant time periods
for an event (i.e., query). Recent advances have shown how to improve the
estimation of the temporal relevance of such topics. In this approach, we build
on two major novelties. First, we mine temporal evidences from hundreds of
external sources into topic-based external collections to improve the
robustness of the detection of relevant time periods. Second, we propose a
formal retrieval model that generalizes the use of the temporal dimension
across different aspects of the retrieval process. In particular, we show that
temporal evidence of external collections can be used to (i) infer a topic's
temporal relevance, (ii) select the query expansion terms, and (iii) re-rank
the final results for improved precision. Experiments with TREC Microblog
collections show that the proposed time-aware retrieval model makes an
effective and extensive use of the temporal dimension to improve search results
over the most recent temporal models. Interestingly, we observe a strong
correlation between precision and the temporal distribution of retrieved and
relevant documents.Comment: To appear in WSDM 201
- …