Search CORE

58 research outputs found

Automatic Spanish translation of SQuAD dataset for multi-lingual question answering

Author: Carrino Casimiro Pio
Rodríguez Fonollosa José Adrián
Ruiz Costa-Jussà Marta
Publication venue: European Language Resources Association (ELRA)
Publication date: 01/01/2020
Field of study

Recently, multilingual question answering became a crucial research topic, and it is receiving increased interest in the NLP community.However, the unavailability of large-scale datasets makes it challenging to train multilingual QA systems with performance comparableto the English ones. In this work, we develop the Translate Align Retrieve (TAR) method to automatically translate the Stanford QuestionAnswering Dataset (SQuAD) v1.1 to Spanish. We then used this dataset to train Spanish QA systems by fine-tuning a Multilingual-BERTmodel. Finally, we evaluated our QA models with the recently proposed MLQA and XQuAD benchmarks for cross-lingual ExtractiveQA. Experimental results show that our models outperform the previous Multilingual-BERT baselines achieving the new state-of-the-artvalues of 68.1 F1 on the Spanish MLQA corpus and 77.6 F1 on the Spanish XQuAD corpus. The resulting, synthetically generatedSQuAD-es v1.1 corpora, with almost 100% of data contained in the original English version, to the best of our knowledge, is the firstlarge-scale QA training resource for Spanish.This work is supported in part by the Spanish Ministe-rio de Econom ́ıa y Competitividad, the European RegionalDevelopment Fund and the Agencia Estatal de Investi-gaci ́on, through the postdoctoral senior grant Ram ́on y Ca-jal (FEDER/MINECO) amd the project PCIN-2017-079(AEI/MINECO).Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

Data Augmentation for Conversational AI

Author: Hasibi Faegheh
Kanoulas Evangelos
Soudani Heydar
Publication venue
Publication date: 09/09/2023
Field of study

Advancements in conversational systems have revolutionized information access, surpassing the limitations of single queries. However, developing dialogue systems requires a large amount of training data, which is a challenge in low-resource domains and languages. Traditional data collection methods like crowd-sourcing are labor-intensive and time-consuming, making them ineffective in this context. Data augmentation (DA) is an affective approach to alleviate the data scarcity problem in conversational systems. This tutorial provides a comprehensive and up-to-date overview of DA approaches in the context of conversational systems. It highlights recent advances in conversation augmentation, open domain and task-oriented conversation generation, and different paradigms of evaluating these models. We also discuss current challenges and future directions in order to help researchers and practitioners to further advance the field in this area

arXiv.org e-Print Archive

Recommended from our members

Grounded and Consistent Question Answering

Author: Alberti Christopher Brian
Publication venue
Publication date: 01/01/2023
Field of study

This thesis describes advancements in question answering along three general directions: model architecture extensions, explainable question answering, and data augmentation. Chapter 2 describes the first state-of-the-art model for the Natural Questions dataset based on pretrained transformers. Chapters 3 and 4 describe extensions to the model architecture designed to accommodate long textual inputs and multimodal text+image inputs, establishing new state-of-the-art results on the Natural Questions and on the VCR dataset. Chapter 5 shows that significant improvements can be obtained with data augmentation on the SQuAD and Natural Questions dataset, introducing roundtrip consistency as a simple heuristic to improve the quality of synthetic data. In Chapters 6 and 7 we explore explainable question answering, demonstrating the usefulness of a new concrete kind of structured explanations, QED, and proposing a semantic analysis of why-questions in the Natural Questions, as a way of better understanding the nature of real world explanations. Finally, in Chapters 8 and 9 we delve into more exploratory data augmentation techniques for question answering. We look respectively at how straight-through gradients can be utilized to optimize roundtrip consistency in a pipeline of models on the fly, and at how very recent large language models like PaLM can be used to generate synthetic question answering datasets for new languages given as few as five representative examples per language

Columbia University Academic Commons

PAXQA: Generating Cross-lingual Question Answering Examples at Training Scale

Author: Callison-Burch Chris
Li Bryan
Publication venue
Publication date: 24/04/2023
Field of study

Existing question answering (QA) systems owe much of their success to large, high-quality training data. Such annotation efforts are costly, and the difficulty compounds in the cross-lingual setting. Therefore, prior cross-lingual QA work has focused on releasing evaluation datasets, and then applying zero-shot methods as baselines. In this work, we propose a synthetic data generation method for cross-lingual QA which leverages indirect supervision from existing parallel corpora. Our method termed PAXQA ({P}rojecting {a}nnotations for cross-lingual ({x}) QA) decomposes cross-lingual QA into two stages. In the first stage, we apply a question generation (QG) model to the English side. In the second stage, we apply annotation projection to translate both the questions and answers. To better translate questions, we propose a novel use of lexically-constrained machine translation, in which constrained entities are extracted from the parallel bitexts. We release cross-lingual QA datasets across 4 languages, totaling 662K QA examples. We then show that extractive QA models fine-tuned on these datasets outperform both zero-shot and prior synthetic data generation models, showing the sufficient quality of our generations. We find that the largest performance gains are for cross-lingual directions with non-English questions and English contexts. Ablation studies show that our dataset generation method is relatively robust to noise from automatic word alignments

arXiv.org e-Print Archive

Exploring the Viability of Synthetic Query Generation for Relevance Prediction

Author: Bendersky Mike
Chaudhary Aditi
Hashimoto Kazuma
Najork Marc
Raman Karthik
Srinivasan Krishna
Publication venue
Publication date: 16/06/2023
Field of study

Query-document relevance prediction is a critical problem in Information Retrieval systems. This problem has increasingly been tackled using (pretrained) transformer-based models which are finetuned using large collections of labeled data. However, in specialized domains such as e-commerce and healthcare, the viability of this approach is limited by the dearth of large in-domain data. To address this paucity, recent methods leverage these powerful models to generate high-quality task and domain-specific synthetic data. Prior work has largely explored synthetic data generation or query generation (QGen) for Question-Answering (QA) and binary (yes/no) relevance prediction, where for instance, the QGen models are given a document, and trained to generate a query relevant to that document. However in many problems, we have a more fine-grained notion of relevance than a simple yes/no label. Thus, in this work, we conduct a detailed study into how QGen approaches can be leveraged for nuanced relevance prediction. We demonstrate that -- contrary to claims from prior works -- current QGen approaches fall short of the more conventional cross-domain transfer-learning approaches. Via empirical studies spanning 3 public e-commerce benchmarks, we identify new shortcomings of existing QGen approaches -- including their inability to distinguish between different grades of relevance. To address this, we introduce label-conditioned QGen models which incorporates knowledge about the different relevance. While our experiments demonstrate that these modifications help improve performance of QGen techniques, we also find that QGen approaches struggle to capture the full nuance of the relevance label space and as a result the generated queries are not faithful to the desired relevance label.Comment: In Proceedings of ACM SIGIRWorkshop on eCommerce (SIGIR eCom 23

arXiv.org e-Print Archive