71 research outputs found
One-Shot Labeling for Automatic Relevance Estimation
Dealing with unjudged documents ("holes") in relevance assessments is a
perennial problem when evaluating search systems with offline experiments.
Holes can reduce the apparent effectiveness of retrieval systems during
evaluation and introduce biases in models trained with incomplete data. In this
work, we explore whether large language models can help us fill such holes to
improve offline evaluations. We examine an extreme, albeit common, evaluation
setting wherein only a single known relevant document per query is available
for evaluation. We then explore various approaches for predicting the relevance
of unjudged documents with respect to a query and the known relevant document,
including nearest neighbor, supervised, and prompting techniques. We find that
although the predictions of these One-Shot Labelers (1SL) frequently disagree
with human assessments, the labels they produce yield a far more reliable
ranking of systems than the single labels do alone. Specifically, the strongest
approaches can consistently reach system ranking correlations of over 0.86 with
the full rankings over a variety of measures. Meanwhile, the approach
substantially increases the reliability of t-tests due to filling holes in
relevance assessments, giving researchers more confidence in results they find
to be significant. Alongside this work, we release an easy-to-use software
package to enable the use of 1SL for evaluation of other ad-hoc collections or
systems.Comment: SIGIR 202
CEDR: Contextualized Embeddings for Document Ranking
Although considerable attention has been given to neural ranking
architectures recently, far less attention has been paid to the term
representations that are used as input to these models. In this work, we
investigate how two pretrained contextualized language models (ELMo and BERT)
can be utilized for ad-hoc document ranking. Through experiments on TREC
benchmarks, we find that several existing neural ranking architectures can
benefit from the additional context provided by contextualized language models.
Furthermore, we propose a joint approach that incorporates BERT's
classification vector into existing neural models and show that it outperforms
state-of-the-art ad-hoc ranking baselines. We call this joint approach CEDR
(Contextualized Embeddings for Document Ranking). We also address practical
challenges in using these models for ranking, including the maximum input
length imposed by BERT and runtime performance impacts of contextualized
language models.Comment: Appeared in SIGIR 2019, 4 page
Adapting Learned Sparse Retrieval for Long Documents
Learned sparse retrieval (LSR) is a family of neural retrieval methods that
transform queries and documents into sparse weight vectors aligned with a
vocabulary. While LSR approaches like Splade work well for short passages, it
is unclear how well they handle longer documents. We investigate existing
aggregation approaches for adapting LSR to longer documents and find that
proximal scoring is crucial for LSR to handle long documents. To leverage this
property, we proposed two adaptations of the Sequential Dependence Model (SDM)
to LSR: ExactSDM and SoftSDM. ExactSDM assumes only exact query term
dependence, while SoftSDM uses potential functions that model the dependence of
query terms and their expansion terms (i.e., terms identified using a
transformer's masked language modeling head).
Experiments on the MSMARCO Document and TREC Robust04 datasets demonstrate
that both ExactSDM and SoftSDM outperform existing LSR aggregation approaches
for different document length constraints. Surprisingly, SoftSDM does not
provide any performance benefits over ExactSDM. This suggests that soft
proximity matching is not necessary for modeling term dependence in LSR.
Overall, this study provides insights into handling long documents with LSR,
proposing adaptations that improve its performance.Comment: SIGIR 202
A Unified Framework for Learned Sparse Retrieval
Learned sparse retrieval (LSR) is a family of first-stage retrieval methods
that are trained to generate sparse lexical representations of queries and
documents for use with an inverted index. Many LSR methods have been recently
introduced, with Splade models achieving state-of-the-art performance on
MSMarco. Despite similarities in their model architectures, many LSR methods
show substantial differences in effectiveness and efficiency. Differences in
the experimental setups and configurations used make it difficult to compare
the methods and derive insights. In this work, we analyze existing LSR methods
and identify key components to establish an LSR framework that unifies all LSR
methods under the same perspective. We then reproduce all prominent methods
using a common codebase and re-train them in the same environment, which allows
us to quantify how components of the framework affect effectiveness and
efficiency. We find that (1) including document term weighting is most
important for a method's effectiveness, (2) including query weighting has a
small positive impact, and (3) document expansion and query expansion have a
cancellation effect. As a result, we show how removing query expansion from a
state-of-the-art model can reduce latency significantly while maintaining
effectiveness on MSMarco and TripClick benchmarks. Our code is publicly
available at https://github.com/thongnt99/learned-sparse-retrieva
Re-Rank - Expand - Repeat: Adaptive Query Expansion for Document Retrieval Using Words and Entities
Sparse and dense pseudo-relevance feedback (PRF) approaches perform poorly on
challenging queries due to low precision in first-pass retrieval. However,
recent advances in neural language models (NLMs) can re-rank relevant documents
to top ranks, even when few are in the re-ranking pool. This paper first
addresses the problem of poor pseudo-relevance feedback by simply applying
re-ranking prior to query expansion and re-executing this query. We find that
this change alone can improve the retrieval effectiveness of sparse and dense
PRF approaches by 5-8%. Going further, we propose a new expansion model, Latent
Entity Expansion (LEE), a fine-grained word and entity-based relevance
modelling incorporating localized features. Finally, we include an "adaptive"
component to the retrieval process, which iteratively refines the re-ranking
pool during scoring using the expansion model, i.e. we "re-rank - expand -
repeat". Using LEE, we achieve (to our knowledge) the best NDCG, MAP and R@1000
results on the TREC Robust 2004 and CODEC adhoc document datasets,
demonstrating a significant advancement in expansion effectiveness
Efficient Document Re-Ranking for Transformers by Precomputing Term Representations
Deep pretrained transformer networks are effective at various ranking tasks,
such as question answering and ad-hoc document ranking. However, their
computational expenses deem them cost-prohibitive in practice. Our proposed
approach, called PreTTR (Precomputing Transformer Term Representations),
considerably reduces the query-time latency of deep transformer networks (up to
a 42x speedup on web document ranking) making these networks more practical to
use in a real-time ranking scenario. Specifically, we precompute part of the
document term representations at indexing time (without a query), and merge
them with the query representation at query time to compute the final ranking
score. Due to the large size of the token representations, we also propose an
effective approach to reduce the storage requirement by training a compression
layer to match attention scores. Our compression technique reduces the storage
required up to 95% and it can be applied without a substantial degradation in
ranking performance.Comment: Accepted at SIGIR 2020 (long
RSDD-Time: Temporal Annotation of Self-Reported Mental Health Diagnoses
Self-reported diagnosis statements have been widely employed in studying
language related to mental health in social media. However, existing research
has largely ignored the temporality of mental health diagnoses. In this work,
we introduce RSDD-Time: a new dataset of 598 manually annotated self-reported
depression diagnosis posts from Reddit that include temporal information about
the diagnosis. Annotations include whether a mental health condition is present
and how recently the diagnosis happened. Furthermore, we include exact temporal
spans that relate to the date of diagnosis. This information is valuable for
various computational methods to examine mental health through social media
because one's mental health state is not static. We also test several baseline
classification and extraction approaches, which suggest that extracting
temporal information from self-reported diagnosis statements is challenging.Comment: 6 pages, accepted for publication at the CLPsych workshop at
NAACL-HLT 201
- …