InPars-v2: Large Language Models as Efficient Dataset Generators for
  Information Retrieval

Abonizio, Hugo; Bonifacio, Luiz; Fadaee, Marzieh; Jeronymo, Vitor; Lotufo, Roberto; Nogueira, Rodrigo; Zavrel, Jakub

InPars-v2: Large Language Models as Efficient Dataset Generators for Information Retrieval

Authors: Hugo Abonizio
Luiz Bonifacio
Marzieh Fadaee
Vitor Jeronymo
Roberto Lotufo
Rodrigo Nogueira
Jakub Zavrel
Publication date: 26 May 2023
Publisher

Abstract

Recently, InPars introduced a method to efficiently use large language models (LLMs) in information retrieval tasks: via few-shot examples, an LLM is induced to generate relevant queries for documents. These synthetic query-document pairs can then be used to train a retriever. However, InPars and, more recently, Promptagator, rely on proprietary LLMs such as GPT-3 and FLAN to generate such datasets. In this work we introduce InPars-v2, a dataset generator that uses open-source LLMs and existing powerful rerankers to select synthetic query-document pairs for training. A simple BM25 retrieval pipeline followed by a monoT5 reranker finetuned on InPars-v2 data achieves new state-of-the-art results on the BEIR benchmark. To allow researchers to further improve our method, we open source the code, synthetic data, and finetuned models: https://github.com/zetaalphavector/inPars/tree/master/tp

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2301.01820

Last time updated on 02/02/2023