58 research outputs found
Automatic Spanish translation of SQuAD dataset for multi-lingual question answering
Recently, multilingual question answering became a crucial research topic, and it is receiving increased interest in the NLP community.However, the unavailability of large-scale datasets makes it challenging to train multilingual QA systems with performance comparableto the English ones. In this work, we develop the Translate Align Retrieve (TAR) method to automatically translate the Stanford QuestionAnswering Dataset (SQuAD) v1.1 to Spanish. We then used this dataset to train Spanish QA systems by fine-tuning a Multilingual-BERTmodel. Finally, we evaluated our QA models with the recently proposed MLQA and XQuAD benchmarks for cross-lingual ExtractiveQA. Experimental results show that our models outperform the previous Multilingual-BERT baselines achieving the new state-of-the-artvalues of 68.1 F1 on the Spanish MLQA corpus and 77.6 F1 on the Spanish XQuAD corpus. The resulting, synthetically generatedSQuAD-es v1.1 corpora, with almost 100% of data contained in the original English version, to the best of our knowledge, is the firstlarge-scale QA training resource for Spanish.This work is supported in part by the Spanish Ministe-rio de Econom ́ıa y Competitividad, the European RegionalDevelopment Fund and the Agencia Estatal de Investi-gaci ́on, through the postdoctoral senior grant Ram ́on y Ca-jal (FEDER/MINECO) amd the project PCIN-2017-079(AEI/MINECO).Peer ReviewedPostprint (published version
Data Augmentation for Conversational AI
Advancements in conversational systems have revolutionized information
access, surpassing the limitations of single queries. However, developing
dialogue systems requires a large amount of training data, which is a challenge
in low-resource domains and languages. Traditional data collection methods like
crowd-sourcing are labor-intensive and time-consuming, making them ineffective
in this context. Data augmentation (DA) is an affective approach to alleviate
the data scarcity problem in conversational systems. This tutorial provides a
comprehensive and up-to-date overview of DA approaches in the context of
conversational systems. It highlights recent advances in conversation
augmentation, open domain and task-oriented conversation generation, and
different paradigms of evaluating these models. We also discuss current
challenges and future directions in order to help researchers and practitioners
to further advance the field in this area
Recommended from our members
Grounded and Consistent Question Answering
This thesis describes advancements in question answering along three general directions: model architecture extensions, explainable question answering, and data augmentation.
Chapter 2 describes the first state-of-the-art model for the Natural Questions dataset based on pretrained transformers. Chapters 3 and 4 describe extensions to the model architecture designed to accommodate long textual inputs and multimodal text+image inputs, establishing new state-of-the-art results on the Natural Questions and on the VCR dataset.
Chapter 5 shows that significant improvements can be obtained with data augmentation on the SQuAD and Natural Questions dataset, introducing roundtrip consistency as a simple heuristic to improve the quality of synthetic data. In Chapters 6 and 7 we explore explainable question answering, demonstrating the usefulness of a new concrete kind of structured explanations, QED, and proposing a semantic analysis of why-questions in the Natural Questions, as a way of better understanding the nature of real world explanations.
Finally, in Chapters 8 and 9 we delve into more exploratory data augmentation techniques for question answering. We look respectively at how straight-through gradients can be utilized to optimize roundtrip consistency in a pipeline of models on the fly, and at how very recent large language models like PaLM can be used to generate synthetic question answering datasets for new languages given as few as five representative examples per language
PAXQA: Generating Cross-lingual Question Answering Examples at Training Scale
Existing question answering (QA) systems owe much of their success to large,
high-quality training data. Such annotation efforts are costly, and the
difficulty compounds in the cross-lingual setting. Therefore, prior
cross-lingual QA work has focused on releasing evaluation datasets, and then
applying zero-shot methods as baselines. In this work, we propose a synthetic
data generation method for cross-lingual QA which leverages indirect
supervision from existing parallel corpora. Our method termed PAXQA
({P}rojecting {a}nnotations for cross-lingual ({x}) QA) decomposes
cross-lingual QA into two stages. In the first stage, we apply a question
generation (QG) model to the English side. In the second stage, we apply
annotation projection to translate both the questions and answers. To better
translate questions, we propose a novel use of lexically-constrained machine
translation, in which constrained entities are extracted from the parallel
bitexts. We release cross-lingual QA datasets across 4 languages, totaling 662K
QA examples. We then show that extractive QA models fine-tuned on these
datasets outperform both zero-shot and prior synthetic data generation models,
showing the sufficient quality of our generations. We find that the largest
performance gains are for cross-lingual directions with non-English questions
and English contexts. Ablation studies show that our dataset generation method
is relatively robust to noise from automatic word alignments
Exploring the Viability of Synthetic Query Generation for Relevance Prediction
Query-document relevance prediction is a critical problem in Information
Retrieval systems. This problem has increasingly been tackled using
(pretrained) transformer-based models which are finetuned using large
collections of labeled data. However, in specialized domains such as e-commerce
and healthcare, the viability of this approach is limited by the dearth of
large in-domain data. To address this paucity, recent methods leverage these
powerful models to generate high-quality task and domain-specific synthetic
data. Prior work has largely explored synthetic data generation or query
generation (QGen) for Question-Answering (QA) and binary (yes/no) relevance
prediction, where for instance, the QGen models are given a document, and
trained to generate a query relevant to that document. However in many
problems, we have a more fine-grained notion of relevance than a simple yes/no
label. Thus, in this work, we conduct a detailed study into how QGen approaches
can be leveraged for nuanced relevance prediction. We demonstrate that --
contrary to claims from prior works -- current QGen approaches fall short of
the more conventional cross-domain transfer-learning approaches. Via empirical
studies spanning 3 public e-commerce benchmarks, we identify new shortcomings
of existing QGen approaches -- including their inability to distinguish between
different grades of relevance. To address this, we introduce label-conditioned
QGen models which incorporates knowledge about the different relevance. While
our experiments demonstrate that these modifications help improve performance
of QGen techniques, we also find that QGen approaches struggle to capture the
full nuance of the relevance label space and as a result the generated queries
are not faithful to the desired relevance label.Comment: In Proceedings of ACM SIGIRWorkshop on eCommerce (SIGIR eCom 23
- …