Automatic Spanish translation of SQuAD dataset for multi-lingual question answering

Carrino, Casimiro Pio; Rodríguez Fonollosa, José Adrián; Ruiz Costa-Jussà, Marta

Automatic Spanish translation of SQuAD dataset for multi-lingual question answering

Authors: Casimiro Pio Carrino
José Adrián Rodríguez Fonollosa
Marta Ruiz Costa-Jussà
Publication date: 1 January 2020
Publisher: European Language Resources Association (ELRA)

Abstract

Recently, multilingual question answering became a crucial research topic, and it is receiving increased interest in the NLP community.However, the unavailability of large-scale datasets makes it challenging to train multilingual QA systems with performance comparableto the English ones. In this work, we develop the Translate Align Retrieve (TAR) method to automatically translate the Stanford QuestionAnswering Dataset (SQuAD) v1.1 to Spanish. We then used this dataset to train Spanish QA systems by fine-tuning a Multilingual-BERTmodel. Finally, we evaluated our QA models with the recently proposed MLQA and XQuAD benchmarks for cross-lingual ExtractiveQA. Experimental results show that our models outperform the previous Multilingual-BERT baselines achieving the new state-of-the-artvalues of 68.1 F1 on the Spanish MLQA corpus and 77.6 F1 on the Spanish XQuAD corpus. The resulting, synthetically generatedSQuAD-es v1.1 corpora, with almost 100% of data contained in the original English version, to the best of our knowledge, is the firstlarge-scale QA training resource for Spanish.This work is supported in part by the Spanish Ministe-rio de Econom ́ıa y Competitividad, the European RegionalDevelopment Fund and the Agencia Estatal de Investi-gaci ́on, through the postdoctoral senior grant Ram ́on y Ca-jal (FEDER/MINECO) amd the project PCIN-2017-079(AEI/MINECO).Peer ReviewedPostprint (published version

Similar works

Full text

Open in the Core reader

Download PDF

Available Versions

UPCommons. Portal del coneixement obert de la UPC

oai:upcommons.upc.edu:2117/329...

Last time updated on 29/09/2020