487 research outputs found
How Different are Pre-trained Transformers for Text Ranking?
In recent years, large pre-trained transformers have led to substantial gains
in performance over traditional retrieval models and feedback approaches.
However, these results are primarily based on the MS Marco/TREC Deep Learning
Track setup, with its very particular setup, and our understanding of why and
how these models work better is fragmented at best. We analyze effective
BERT-based cross-encoders versus traditional BM25 ranking for the passage
retrieval task where the largest gains have been observed, and investigate two
main questions. On the one hand, what is similar? To what extent does the
neural ranker already encompass the capacity of traditional rankers? Is the
gain in performance due to a better ranking of the same documents (prioritizing
precision)? On the other hand, what is different? Can it retrieve effectively
documents missed by traditional systems (prioritizing recall)? We discover
substantial differences in the notion of relevance identifying strengths and
weaknesses of BERT that may inspire research for future improvement. Our
results contribute to our understanding of (black-box) neural rankers relative
to (well-understood) traditional rankers, help understand the particular
experimental setting of MS-Marco-based test collections.Comment: ECIR 202
How Well Do Text Embedding Models Understand Syntax?
Text embedding models have significantly contributed to advancements in
natural language processing by adeptly capturing semantic properties of textual
data. However, the ability of these models to generalize across a wide range of
syntactic contexts remains under-explored. In this paper, we first develop an
evaluation set, named \textbf{SR}, to scrutinize the capability for syntax
understanding of text embedding models from two crucial syntactic aspects:
Structural heuristics, and Relational understanding among concepts, as revealed
by the performance gaps in previous studies. Our findings reveal that existing
text embedding models have not sufficiently addressed these syntactic
understanding challenges, and such ineffectiveness becomes even more apparent
when evaluated against existing benchmark datasets. Furthermore, we conduct
rigorous analysis to unearth factors that lead to such limitations and examine
why previous evaluations fail to detect such ineffectiveness. Lastly, we
propose strategies to augment the generalization ability of text embedding
models in diverse syntactic scenarios. This study serves to highlight the
hurdles associated with syntactic generalization and provides pragmatic
guidance for boosting model performance across varied syntactic contexts.Comment: Accepted to EMNLP-Findings 2023, datasets and code are release
ABNIRML: Analyzing the Behavior of Neural IR Models
Numerous studies have demonstrated the effectiveness of pretrained
contextualized language models such as BERT and T5 for ad-hoc search. However,
it is not well-understood why these methods are so effective, what makes some
variants more effective than others, and what pitfalls they may have. We
present a new comprehensive framework for Analyzing the Behavior of Neural IR
ModeLs (ABNIRML), which includes new types of diagnostic tests that allow us to
probe several characteristics---such as sensitivity to word order---that are
not addressed by previous techniques. To demonstrate the value of the
framework, we conduct an extensive empirical study that yields insights into
the factors that contribute to the neural model's gains, and identify potential
unintended biases the models exhibit. We find evidence that recent neural
ranking models have fundamentally different characteristics from prior ranking
models. For instance, these models can be highly influenced by altered document
word order, sentence order and inflectional endings. They can also exhibit
unexpected behaviors when additional content is added to documents, or when
documents are expressed with different levels of fluency or formality. We find
that these differences can depend on the architecture and not just the
underlying language model
Towards Debiasing Fact Verification Models
Fact verification requires validating a claim in the context of evidence. We
show, however, that in the popular FEVER dataset this might not necessarily be
the case. Claim-only classifiers perform competitively with top evidence-aware
models. In this paper, we investigate the cause of this phenomenon, identifying
strong cues for predicting labels solely based on the claim, without
considering any evidence. We create an evaluation set that avoids those
idiosyncrasies. The performance of FEVER-trained models significantly drops
when evaluated on this test set. Therefore, we introduce a regularization
method which alleviates the effect of bias in the training data, obtaining
improvements on the newly created test set. This work is a step towards a more
sound evaluation of reasoning capabilities in fact verification models.Comment: EMNLP IJCNLP 201
Explainable Information Retrieval: A Survey
Explainable information retrieval is an emerging research area aiming to make
transparent and trustworthy information retrieval systems. Given the increasing
use of complex machine learning models in search systems, explainability is
essential in building and auditing responsible information retrieval models.
This survey fills a vital gap in the otherwise topically diverse literature of
explainable information retrieval. It categorizes and discusses recent
explainability methods developed for different application domains in
information retrieval, providing a common framework and unifying perspectives.
In addition, it reflects on the common concern of evaluating explanations and
highlights open challenges and opportunities.Comment: 35 pages, 10 figures. Under revie
- …