1,873 research outputs found
Development of a Corpus for UserÂbased Scientific Question Answering
Tese de mestrado, Bioinformática e Biologia Computacional, Universidade de Lisboa, Faculdade de Ciências, 2021In recent years Question & Answering (QA) tasks became particularly relevant in the research field of natural language understanding. However, the lack of good quality datasets has been an important limiting factor in the quest for better models. Particularly in the biomedical domain, the scarcity of gold standard labelled datasets has been a recognized obstacle given its idiosyncrasies and complexities often require the participation of skilled domain¬specific experts in producing such datasets. To address this issue, a method for automatically gather Question¬Answer pairs from online QA biomedical forums has been suggested yielding a corpus named BiQA. The authors describe several strategies to validate this new dataset but a human manual verification has not been conducted. With this in mind, this dissertation was set out with the objectives of performing a manual verification of a sample of 1200 questions of BiQA and also to expanding these questions, by adding features, into a new corpus of text ¬ BiQA2 ¬ with the goal of contributing with a new corpusfor biomedical QA research. Regarding the manual verification of BiQA, a methodology for its characterization was laid out and allowed the identification of an array of potential problems related to the nature of its questions and answers aptness for which possible improvement solutions were presented. Concomitantly, the proposed new BiQA2 corpus ¬ created upon the validated questions and answers from the perused samples from BiQA ¬ builds new features similar to those observed in other biomedical corpus such as the BioASQ dataset. Both BiQA and BiQA2 were applied to deep learning strategies previously submitted to the BioASQ competition to assess their performance as a source of training data. Although the results achieved with the models created using BiQA2 exhibit limited capability pertaining to the BioASQ challenge, they also show some potential to contribute positively to model training in tasks such as Document re-ranking and answering to ‘yes/no’ questions
Poisoning Retrieval Corpora by Injecting Adversarial Passages
Dense retrievers have achieved state-of-the-art performance in various
information retrieval tasks, but to what extent can they be safely deployed in
real-world applications? In this work, we propose a novel attack for dense
retrieval systems in which a malicious user generates a small number of
adversarial passages by perturbing discrete tokens to maximize similarity with
a provided set of training queries. When these adversarial passages are
inserted into a large retrieval corpus, we show that this attack is highly
effective in fooling these systems to retrieve them for queries that were not
seen by the attacker. More surprisingly, these adversarial passages can
directly generalize to out-of-domain queries and corpora with a high success
attack rate -- for instance, we find that 50 generated passages optimized on
Natural Questions can mislead >94% of questions posed in financial documents or
online forums. We also benchmark and compare a range of state-of-the-art dense
retrievers, both unsupervised and supervised. Although different systems
exhibit varying levels of vulnerability, we show they can all be successfully
attacked by injecting up to 500 passages, a small fraction compared to a
retrieval corpus of millions of passages.Comment: EMNLP 2023. Our code is available at
https://github.com/princeton-nlp/corpus-poisonin
Task-specific Objectives of Pre-trained Language Models for Dialogue Adaptation
Pre-trained Language Models (PrLMs) have been widely used as backbones in
lots of Natural Language Processing (NLP) tasks. The common process of
utilizing PrLMs is first pre-training on large-scale general corpora with
task-independent LM training objectives, then fine-tuning on task datasets with
task-specific training objectives. Pre-training in a task-independent way
enables the models to learn language representations, which is universal to
some extent, but fails to capture crucial task-specific features in the
meantime. This will lead to an incompatibility between pre-training and
fine-tuning. To address this issue, we introduce task-specific pre-training on
in-domain task-related corpora with task-specific objectives. This procedure is
placed between the original two stages to enhance the model understanding
capacity of specific tasks. In this work, we focus on Dialogue-related Natural
Language Processing (DrNLP) tasks and design a Dialogue-Adaptive Pre-training
Objective (DAPO) based on some important qualities for assessing dialogues
which are usually ignored by general LM pre-training objectives. PrLMs with
DAPO on a large in-domain dialogue corpus are then fine-tuned for downstream
DrNLP tasks. Experimental results show that models with DAPO surpass those with
general LM pre-training objectives and other strong baselines on downstream
DrNLP tasks
A Survey on Biomedical Text Summarization with Pre-trained Language Model
The exponential growth of biomedical texts such as biomedical literature and
electronic health records (EHRs), provides a big challenge for clinicians and
researchers to access clinical information efficiently. To address the problem,
biomedical text summarization has been proposed to support clinical information
retrieval and management, aiming at generating concise summaries that distill
key information from single or multiple biomedical documents. In recent years,
pre-trained language models (PLMs) have been the de facto standard of various
natural language processing tasks in the general domain. Most recently, PLMs
have been further investigated in the biomedical field and brought new insights
into the biomedical text summarization task. In this paper, we systematically
summarize recent advances that explore PLMs for biomedical text summarization,
to help understand recent progress, challenges, and future directions. We
categorize PLMs-based approaches according to how they utilize PLMs and what
PLMs they use. We then review available datasets, recent approaches and
evaluation metrics of the task. We finally discuss existing challenges and
promising future directions. To facilitate the research community, we line up
open resources including available datasets, recent approaches, codes,
evaluation metrics, and the leaderboard in a public project:
https://github.com/KenZLuo/Biomedical-Text-Summarization-Survey/tree/master.Comment: 19 pages, 6 figures, TKDE under revie
Designing a Healthcare QA Assistant: A Knowledge Based Approach
Question answer (QA) assistants are vital tools to address users’ information needs in healthcare. Knowledge graphs (KGs) and language models (LMs) have shown promise in building QA systems, but face challenges in their integration, and performance. Motivated thus, we take the case of a specific disease, skin eczema, to design a QA system combining KG and LM approaches. We present design iterations for systematically developing the KG, then fine-tuning a LM, and finally carrying out joint reasoning over both. We observe that while KGs are effective for fact finding, fine-tuned LMs perform better at answering complex queries. Initial results suggest that combining KG and LM approaches can improve the performance of the system. Our study contributes by laying out the design steps and developing a QA system that addresses various gaps in the related literature. Our future plan is to refine these techniques towards building a full-fledged healthcare QA assistant
Rhetorical relations for information retrieval
Typically, every part in most coherent text has some plausible reason for its
presence, some function that it performs to the overall semantics of the text.
Rhetorical relations, e.g. contrast, cause, explanation, describe how the parts
of a text are linked to each other. Knowledge about this socalled discourse
structure has been applied successfully to several natural language processing
tasks. This work studies the use of rhetorical relations for Information
Retrieval (IR): Is there a correlation between certain rhetorical relations and
retrieval performance? Can knowledge about a document's rhetorical relations be
useful to IR? We present a language model modification that considers
rhetorical relations when estimating the relevance of a document to a query.
Empirical evaluation of different versions of our model on TREC settings shows
that certain rhetorical relations can benefit retrieval effectiveness notably
(> 10% in mean average precision over a state-of-the-art baseline)
- …