5 research outputs found
NeuralMind-UNICAMP at 2022 TREC NeuCLIR: Large Boring Rerankers for Cross-lingual Retrieval
This paper reports on a study of cross-lingual information retrieval (CLIR)
using the mT5-XXL reranker on the NeuCLIR track of TREC 2022. Perhaps the
biggest contribution of this study is the finding that despite the mT5 model
being fine-tuned only on query-document pairs of the same language it proved to
be viable for CLIR tasks, where query-document pairs are in different
languages, even in the presence of suboptimal first-stage retrieval
performance. The results of the study show outstanding performance across all
tasks and languages, leading to a high number of winning positions. Finally,
this study provides valuable insights into the use of mT5 in CLIR tasks and
highlights its potential as a viable solution. For reproduction refer to
https://github.com/unicamp-dl/NeuCLIR22-mT
In Defense of Cross-Encoders for Zero-Shot Retrieval
Bi-encoders and cross-encoders are widely used in many state-of-the-art
retrieval pipelines. In this work we study the generalization ability of these
two types of architectures on a wide range of parameter count on both in-domain
and out-of-domain scenarios. We find that the number of parameters and early
query-document interactions of cross-encoders play a significant role in the
generalization ability of retrieval models. Our experiments show that
increasing model size results in marginal gains on in-domain test sets, but
much larger gains in new domains never seen during fine-tuning. Furthermore, we
show that cross-encoders largely outperform bi-encoders of similar size in
several tasks. In the BEIR benchmark, our largest cross-encoder surpasses a
state-of-the-art bi-encoder by more than 4 average points. Finally, we show
that using bi-encoders as first-stage retrievers provides no gains in
comparison to a simpler retriever such as BM25 on out-of-domain tasks. The code
is available at
https://github.com/guilhermemr04/scaling-zero-shot-retrieval.gitComment: arXiv admin note: substantial text overlap with arXiv:2206.0287
InPars-v2: Large Language Models as Efficient Dataset Generators for Information Retrieval
Recently, InPars introduced a method to efficiently use large language models
(LLMs) in information retrieval tasks: via few-shot examples, an LLM is induced
to generate relevant queries for documents. These synthetic query-document
pairs can then be used to train a retriever. However, InPars and, more
recently, Promptagator, rely on proprietary LLMs such as GPT-3 and FLAN to
generate such datasets. In this work we introduce InPars-v2, a dataset
generator that uses open-source LLMs and existing powerful rerankers to select
synthetic query-document pairs for training. A simple BM25 retrieval pipeline
followed by a monoT5 reranker finetuned on InPars-v2 data achieves new
state-of-the-art results on the BEIR benchmark. To allow researchers to further
improve our method, we open source the code, synthetic data, and finetuned
models: https://github.com/zetaalphavector/inPars/tree/master/tp
No Parameter Left Behind: How Distillation and Model Size Affect Zero-Shot Retrieval
Recent work has shown that small distilled language models are strong
competitors to models that are orders of magnitude larger and slower in a wide
range of information retrieval tasks. This has made distilled and dense models,
due to latency constraints, the go-to choice for deployment in real-world
retrieval applications. In this work, we question this practice by showing that
the number of parameters and early query-document interaction play a
significant role in the generalization ability of retrieval models. Our
experiments show that increasing model size results in marginal gains on
in-domain test sets, but much larger gains in new domains never seen during
fine-tuning. Furthermore, we show that rerankers largely outperform dense ones
of similar size in several tasks. Our largest reranker reaches the state of the
art in 12 of the 18 datasets of the Benchmark-IR (BEIR) and surpasses the
previous state of the art by 3 average points. Finally, we confirm that
in-domain effectiveness is not a good indicator of zero-shot effectiveness.
Code is available at
https://github.com/guilhermemr04/scaling-zero-shot-retrieval.gi
Catálogo Taxonômico da Fauna do Brasil: setting the baseline knowledge on the animal diversity in Brazil
The limited temporal completeness and taxonomic accuracy of species lists, made available in a traditional manner in scientific publications, has always represented a problem. These lists are invariably limited to a few taxonomic groups and do not represent up-to-date knowledge of all species and classifications. In this context, the Brazilian megadiverse fauna is no exception, and the Catálogo Taxonômico da Fauna do Brasil (CTFB) (http://fauna.jbrj.gov.br/), made public in 2015, represents a database on biodiversity anchored on a list of valid and expertly recognized scientific names of animals in Brazil. The CTFB is updated in near real time by a team of more than 800 specialists. By January 1, 2024, the CTFB compiled 133,691 nominal species, with 125,138 that were considered valid. Most of the valid species were arthropods (82.3%, with more than 102,000 species) and chordates (7.69%, with over 11,000 species). These taxa were followed by a cluster composed of Mollusca (3,567 species), Platyhelminthes (2,292 species), Annelida (1,833 species), and Nematoda (1,447 species). All remaining groups had less than 1,000 species reported in Brazil, with Cnidaria (831 species), Porifera (628 species), Rotifera (606 species), and Bryozoa (520 species) representing those with more than 500 species. Analysis of the CTFB database can facilitate and direct efforts towards the discovery of new species in Brazil, but it is also fundamental in providing the best available list of valid nominal species to users, including those in science, health, conservation efforts, and any initiative involving animals. The importance of the CTFB is evidenced by the elevated number of citations in the scientific literature in diverse areas of biology, law, anthropology, education, forensic science, and veterinary science, among others