Search CORE

10 research outputs found

The TechQA Dataset

Author: Castelli Vittorio
Chakravarti Rishav
Dana Saswati
Ferritto Anthony
Florian Radu
Franz Martin
Garg Dinesh
Khandelwal Dinesh
McCarley Scott
McCawley Mike
Nasr Mohamed
Pan Lin
Pendus Cezar
Pitrelli John
Pujar Saurabh
Roukos Salim
Sakrajda Andrzej
Sil Avirup
Uceda-Sosa Rosario
Ward Todd
Zhang Rong
Publication venue
Publication date: 07/11/2019
Field of study

We introduce TechQA, a domain-adaptation question answering dataset for the technical support domain. The TechQA corpus highlights two real-world issues from the automated customer support domain. First, it contains actual questions posed by users on a technical forum, rather than questions generated specifically for a competition or a task. Second, it has a real-world size -- 600 training, 310 dev, and 490 evaluation question/answer pairs -- thus reflecting the cost of creating large labeled datasets with actual data. Consequently, TechQA is meant to stimulate research in domain adaptation rather than being a resource to build QA systems from scratch. The dataset was obtained by crawling the IBM Developer and IBM DeveloperWorks forums for questions with accepted answers that appear in a published IBM Technote---a technical document that addresses a specific technical issue. We also release a collection of the 801,998 publicly available Technotes as of April 4, 2019 as a companion resource that might be used for pretraining, to learn representations of the IT domain language.Comment: Long version of conference paper to be submitte

arXiv.org e-Print Archive

Crossref

$FastDoc$ : Domain-Specific Fast Pre-training Technique using Document-Level Metadata and Taxonomy

Author: Butala Yash Parag
Ganguly Niloy
Goyal Pawan
Kapadnis Manav Nitin
Nandy Abhilash
Patnaik Sohan
Publication venue
Publication date: 14/11/2023
Field of study

As the demand for sophisticated Natural Language Processing (NLP) models continues to grow, so does the need for efficient pre-training techniques. Current NLP models undergo resource-intensive pre-training. In response, we introduce

FastDoc

(Fast Pre-training Technique using Document-Level Metadata and Taxonomy), a novel approach designed to significantly reduce computational demands.

FastDoc

leverages document metadata and domain-specific taxonomy as supervision signals. It involves continual pre-training of an open-domain transformer encoder using sentence-level embeddings, followed by fine-tuning using token-level embeddings. We evaluate

FastDoc

on six tasks across nine datasets spanning three distinct domains. Remarkably,

FastDoc

achieves remarkable compute reductions of approximately 1,000x, 4,500x, 500x compared to competitive approaches in Customer Support, Scientific, and Legal domains, respectively. Importantly, these efficiency gains do not compromise performance relative to competitive baselines. Furthermore, reduced pre-training data mitigates catastrophic forgetting, ensuring consistent performance in open-domain scenarios.

FastDoc

offers a promising solution for resource-efficient pre-training, with potential applications spanning various domains.Comment: 38 pages, 7 figure

arXiv.org e-Print Archive

Improving Low-Resource Question Answering using Active Learning in Multiple Stages

Author: Bartezzaghi Andrea
Bogojeska Jasmina
Malossi A. Cristiano I.
Schmidt Maximilian
Vu Thang
Publication venue
Publication date: 27/11/2022
Field of study

Neural approaches have become very popular in the domain of Question Answering, however they require a large amount of annotated data. Furthermore, they often yield very good performance but only in the domain they were trained on. In this work we propose a novel approach that combines data augmentation via question-answer generation with Active Learning to improve performance in low resource settings, where the target domains are diverse in terms of difficulty and similarity to the source domain. We also investigate Active Learning for question answering in different stages, overall reducing the annotation effort of humans. For this purpose, we consider target domains in realistic settings, with an extremely low amount of annotated samples but with many unlabeled documents, which we assume can be obtained with little effort. Additionally, we assume sufficient amount of labeled data from the source domain is available. We perform extensive experiments to find the best setup for incorporating domain experts. Our findings show that our novel approach, where humans are incorporated as early as possible in the process, boosts performance in the low-resource, domain-specific setting, allowing for low-labeling-effort question answering systems in new, specialized domains. They further demonstrate how human annotation affects the performance of QA depending on the stage it is performed.Comment: 16 pages, 8 figure

arXiv.org e-Print Archive

Can LLMs Augment Low-Resource Reading Comprehension Datasets? Opportunities and Challenges

Author: Aynaou Houda
Chadha Aman
Chowdhury Arijit Ghosh
Ramanan Karthik Venkat
Samuel Vinay
Publication venue
Publication date: 21/09/2023
Field of study

Large Language Models (LLMs) have demonstrated impressive zero shot performance on a wide range of NLP tasks, demonstrating the ability to reason and apply commonsense. A relevant application is to use them for creating high quality synthetic datasets for downstream tasks. In this work, we probe whether GPT-4 can be used to augment existing extractive reading comprehension datasets. Automating data annotation processes has the potential to save large amounts of time, money and effort that goes into manually labelling datasets. In this paper, we evaluate the performance of GPT-4 as a replacement for human annotators for low resource reading comprehension tasks, by comparing performance after fine tuning, and the cost associated with annotation. This work serves to be the first analysis of LLMs as synthetic data augmenters for QA systems, highlighting the unique opportunities and challenges. Additionally, we release augmented versions of low resource datasets, that will allow the research community to create further benchmarks for evaluation of generated datasets.Comment: 5 pages, 1 figure, 3 table

arXiv.org e-Print Archive

MPMQA: Multimodal Question Answering on Product Manuals

Author: Hu Anwen
Hu Shuo
Jin Qin
Zhang Jing
Zhang Liang
Publication venue
Publication date: 19/04/2023
Field of study

Visual contents, such as illustrations and images, play a big role in product manual understanding. Existing Product Manual Question Answering (PMQA) datasets tend to ignore visual contents and only retain textual parts. In this work, to emphasize the importance of multimodal contents, we propose a Multimodal Product Manual Question Answering (MPMQA) task. For each question, MPMQA requires the model not only to process multimodal contents but also to provide multimodal answers. To support MPMQA, a large-scale dataset PM209 is constructed with human annotations, which contains 209 product manuals from 27 well-known consumer electronic brands. Human annotations include 6 types of semantic regions for manual contents and 22,021 pairs of question and answer. Especially, each answer consists of a textual sentence and related visual regions from manuals. Taking into account the length of product manuals and the fact that a question is always related to a small number of pages, MPMQA can be naturally split into two subtasks: retrieving most related pages and then generating multimodal answers. We further propose a unified model that can perform these two subtasks all together and achieve comparable performance with multiple task-specific models. The PM209 dataset is available at https://github.com/AIM3-RUC/MPMQA

arXiv.org e-Print Archive

Exploring the Utility of Dutch Question Answering Datasets for Human Resource Contact Centres

Author: Brinkhuis Matthieu
Schraagen Marijn
Spruit Marco
van Dijk Friso Willem
van Toledo Chaïm
Publication venue
Publication date: 01/10/2022
Field of study

We explore the use case of question answering (QA) by a contact centre for 130,000 Dutch government employees in the domain of questions about human resources (HR). HR questions can be answered using personnel files or general documentation, with the latter being the focus of the current research. We created a Dutch HR QA dataset with over 300 questions in the format of the Squad 2.0 dataset, which distinguishes between answerable and unanswerable questions. We applied various BERT-based models, either directly or after finetuning on the new dataset. The F1-scores reached 0.47 for unanswerable questions and 1.0 for answerable questions depending on the topic; however, large variations in scores were observed. We conclude more data are needed to further improve the performance of this task

Directory of Open Access Journals

Leiden University Scholary Publications

Utrecht University Repository

Leveraging Feedback in Conversational Question Answering Systems

Author: Campos Tejedor Jon Ander
Publication venue
Publication date: 29/06/2023
Field of study

172 p.Tesi honen helburua martxan jarri eta geroko sistemek gizakiekin duten elkarregina erabiltzeada, gizakien feedbacka sistementzako ikasketa eta egokitzapen seinale bezala erabiliz.Elkarrizketa sistemek martxan jartzerakoan jasaten duten domeinu aldaketan jartzen dugufokua. Helburu honetarako, feedback bitar esplizituaren kasua aztertzen dugu, hau baitagizakientzat feedbacka emateko seinale errazena.Sistemak martxan jarri eta gero hobetzeko, lehenik eta behin DoQA izeneko galdera-erantzunmotako elkarriketez osatutako datu multzo bat eraiki dugu. Datu multzo honekcrowdsourcing bidez jasotako 2.437 dialogo ditu. Aurreko lanekin konparatuz gero, DoQAkbenetazko informazio beharrak islatzen ditu, datu multzo barneko elkarrizketak naturalagoaketa koherenteagoak izanik. Datu multzo sortu eta gero, feedback-weighted learning (FWL)izeneko algoritmo bat diseinatu dugu, feedback bitarra bakarrik erabiliz aurretikentrenatutako sistema gainbegiratu bat hobetzeko gai dena. Azkenik, algoritmo honen mugakaztertzen ditugu jasotako feedbacka zaratatsua den kasuetarako eta FWL moldatzen dugueszenatoki zaratsuari aurre egiteko. Kasu honetan lortzen ditugun emaitza negatiboakerakusten dute erabiltzaileetatik jasotako feedback zaratsua modelatzearen erronka, hauebaztea oraindik ikerkuntza galdera ireki bat delarik

Archivo Digital para la Docencia y la Investigación

Recommended from our members

Mitigating Data Scarcity for Neural Language Models

Author: Van Hoang Nguyen Hung
Van Hoang Nguyen Hung
Publication venue: The University of Arizona.
Publication date: 01/01/2023
Field of study

In recent years, pretrained neural language models (PNLMs) have taken the field of natural language processing by storm, achieving new benchmarks and state-of-theart performances. These models often rely heavily on annotated data, which may not always be available. Data scarcity are commonly found in specialized domains, such as medical, or in low-resource languages that are underexplored by AI research. In this dissertation, we focus on mitigating data scarcity using data augmentation and neural ensemble learning techniques for neural language models. In both research directions, we implement neural network algorithms and evaluate their impact on assisting neural language models in downstream NLP tasks. Specifically, for data augmentation, we explore two techniques: 1) creating positive training data by moving an answer span around its original context and 2) using text simplification techniques to introduce a variety of writing styles to the original training data. Our results indicate that these simple and effective solutions improve the performance of neural language models considerably in low-resource NLP domains and tasks. For neural ensemble learning, we use a multi-label neural classifier to select the best prediction outcome from a variety of individual pretrained neural language models trained for a low-resource medical text simplification task

The University of Arizona