Unsupervised Multilingual Dense Retrieval via Generative Pseudo Labeling

Chen, Yun-Nung; Hsu, Chen-Yu; Hsu, Tsu-Yuan; Huang, Chao-Wei; Li, Chen-An

Unsupervised Multilingual Dense Retrieval via Generative Pseudo Labeling

Authors: Yun-Nung Chen
Chen-Yu Hsu
Tsu-Yuan Hsu
Chao-Wei Huang
Chen-An Li
Publication date: 6 March 2024
Publisher

Abstract

Dense retrieval methods have demonstrated promising performance in multilingual information retrieval, where queries and documents can be in different languages. However, dense retrievers typically require a substantial amount of paired data, which poses even greater challenges in multilingual scenarios. This paper introduces UMR, an Unsupervised Multilingual dense Retriever trained without any paired data. Our approach leverages the sequence likelihood estimation capabilities of multilingual language models to acquire pseudo labels for training dense retrievers. We propose a two-stage framework which iteratively improves the performance of multilingual dense retrievers. Experimental results on two benchmark datasets show that UMR outperforms supervised baselines, showcasing the potential of training multilingual retrievers without paired data, thereby enhancing their practicality. Our source code, data, and models are publicly available at https://github.com/MiuLab/UMRComment: Accepted to Findings of EACL 202

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2403.03516

Last time updated on 26/09/2024