97 research outputs found
Dense Text Retrieval based on Pretrained Language Models: A Survey
Text retrieval is a long-standing research topic on information seeking,
where a system is required to return relevant information resources to user's
queries in natural language. From classic retrieval methods to learning-based
ranking functions, the underlying retrieval models have been continually
evolved with the ever-lasting technical innovation. To design effective
retrieval models, a key point lies in how to learn the text representation and
model the relevance matching. The recent success of pretrained language models
(PLMs) sheds light on developing more capable text retrieval approaches by
leveraging the excellent modeling capacity of PLMs. With powerful PLMs, we can
effectively learn the representations of queries and texts in the latent
representation space, and further construct the semantic matching function
between the dense vectors for relevance modeling. Such a retrieval approach is
referred to as dense retrieval, since it employs dense vectors (a.k.a.,
embeddings) to represent the texts. Considering the rapid progress on dense
retrieval, in this survey, we systematically review the recent advances on
PLM-based dense retrieval. Different from previous surveys on dense retrieval,
we take a new perspective to organize the related work by four major aspects,
including architecture, training, indexing and integration, and summarize the
mainstream techniques for each aspect. We thoroughly survey the literature, and
include 300+ related reference papers on dense retrieval. To support our
survey, we create a website for providing useful resources, and release a code
repertory and toolkit for implementing dense retrieval models. This survey aims
to provide a comprehensive, practical reference focused on the major progress
for dense text retrieval
NeuralMind-UNICAMP at 2022 TREC NeuCLIR: Large Boring Rerankers for Cross-lingual Retrieval
This paper reports on a study of cross-lingual information retrieval (CLIR)
using the mT5-XXL reranker on the NeuCLIR track of TREC 2022. Perhaps the
biggest contribution of this study is the finding that despite the mT5 model
being fine-tuned only on query-document pairs of the same language it proved to
be viable for CLIR tasks, where query-document pairs are in different
languages, even in the presence of suboptimal first-stage retrieval
performance. The results of the study show outstanding performance across all
tasks and languages, leading to a high number of winning positions. Finally,
this study provides valuable insights into the use of mT5 in CLIR tasks and
highlights its potential as a viable solution. For reproduction refer to
https://github.com/unicamp-dl/NeuCLIR22-mT
Recommended from our members
Modeling the Multi-mode Distribution in Self-Supervised Language Models
Self-supervised large language models (LMs) have become a highly-influential and foundational tool for many NLP models. For this reason, their expressivity is an important topic of study. In near-universal practice, given the language context, the model predicts a word from the vocabulary using a single embedded vector representation of both context and dictionary entries. Note that the context sometimes implies that the distribution over predicted words should be multi-modal in embedded space. However, the context’s single-vector representation provably fails to capture such a distribution. To address this limitation, we propose to represent context with multiple vector embeddings, which we term facets. This is distinct from previous work on multi-sense vocabulary embeddings, which employs multiple vectors for the dictionary entries, not the context.
In this dissertation, we first present the theoretical limitations of the single context embedding in LMs and how the theoretical analyses suggest new alternative softmax layers that encode a context as multiple embeddings. The proposed alternatives achieve better perplexity than the mixture of softmax (MoS), especially given an ambiguous context, without adding significant computational cost to LMs. Our approaches also let GPT-2 learn to properly copy the entities from the context, which increases the coherence of the generated text without requiring any labels.
In addition to predicting the next word, we also use multiple CLS embeddings to improve state-of-the-art pretraining methods for BERT on natural language understanding (NLU) benchmarks without introducing significant extra parameters or computations, especially when the training datasets are small. Furthermore, we show that our multi-facet embeddings improve the sequential recommendation, scientific paper embeddings, measurement of sentence similarity, distantly supervised relation extraction, unsupervised text pattern entailment detection, and cold-start citation recommendation. Finally, we use the multiple vector embeddings to predict the future topics of a context, and build on the basis, we propose a novel interactive language generation framework
No Parameter Left Behind: How Distillation and Model Size Affect Zero-Shot Retrieval
Recent work has shown that small distilled language models are strong
competitors to models that are orders of magnitude larger and slower in a wide
range of information retrieval tasks. This has made distilled and dense models,
due to latency constraints, the go-to choice for deployment in real-world
retrieval applications. In this work, we question this practice by showing that
the number of parameters and early query-document interaction play a
significant role in the generalization ability of retrieval models. Our
experiments show that increasing model size results in marginal gains on
in-domain test sets, but much larger gains in new domains never seen during
fine-tuning. Furthermore, we show that rerankers largely outperform dense ones
of similar size in several tasks. Our largest reranker reaches the state of the
art in 12 of the 18 datasets of the Benchmark-IR (BEIR) and surpasses the
previous state of the art by 3 average points. Finally, we confirm that
in-domain effectiveness is not a good indicator of zero-shot effectiveness.
Code is available at
https://github.com/guilhermemr04/scaling-zero-shot-retrieval.gi
Multiword expressions at length and in depth
The annual workshop on multiword expressions takes place since 2001 in conjunction with major computational linguistics conferences and attracts the attention of an ever-growing community working on a variety of languages, linguistic phenomena and related computational processing issues. MWE 2017 took place in Valencia, Spain, and represented a vibrant panorama of the current research landscape on the computational treatment of multiword expressions, featuring many high-quality submissions. Furthermore, MWE 2017 included the first shared task on multilingual identification of verbal multiword expressions. The shared task, with extended communal work, has developed important multilingual resources and mobilised several research groups in computational linguistics worldwide. This book contains extended versions of selected papers from the workshop. Authors worked hard to include detailed explanations, broader and deeper analyses, and new exciting results, which were thoroughly reviewed by an internationally renowned committee. We hope that this distinctly joint effort will provide a meaningful and useful snapshot of the multilingual state of the art in multiword expressions modelling and processing, and will be a point point of reference for future work
CoRE-CoG: Conversational Recommendation of Entities using Constrained Generation
End-to-end conversational recommendation systems (CRS) generate responses by
leveraging both dialog history and a knowledge base (KB). A CRS mainly faces
three key challenges: (1) at each turn, it must decide if recommending a KB
entity is appropriate; if so, it must identify the most relevant KB entity to
recommend; and finally, it must recommend the entity in a fluent utterance that
is consistent with the conversation history. Recent CRSs do not pay sufficient
attention to these desiderata, often generating unfluent responses or not
recommending (relevant) entities at the right turn. We introduce a new CRS we
call CoRE-CoG. CoRE-CoG addresses the limitations in prior systems by
implementing (1) a recommendation trigger that decides if the system utterance
should include an entity, (2) a type pruning module that improves the relevance
of recommended entities, and (3) a novel constrained response generator to make
recommendations while maintaining fluency. Together, these modules ensure
simultaneous accurate recommendation decisions and fluent system utterances.
Experiments with recent benchmarks show the superiority particularly on
conditional generation sub-tasks with close to 10 F1 and 4 Recall@1 percent
points gain over baselines.Comment: 12 Page
- …