26 research outputs found
MDDial: A Multi-turn Differential Diagnosis Dialogue Dataset with Reliability Evaluation
Dialogue systems for Automatic Differential Diagnosis (ADD) have a wide range
of real-life applications. These dialogue systems are promising for providing
easy access and reducing medical costs. Building end-to-end ADD dialogue
systems requires dialogue training datasets. However, to the best of our
knowledge, there is no publicly available ADD dialogue dataset in English
(although non-English datasets exist). Driven by this, we introduce MDDial, the
first differential diagnosis dialogue dataset in English which can aid to build
and evaluate end-to-end ADD dialogue systems. Additionally, earlier studies
present the accuracy of diagnosis and symptoms either individually or as a
combined weighted score. This method overlooks the connection between the
symptoms and the diagnosis. We introduce a unified score for the ADD system
that takes into account the interplay between symptoms and diagnosis. This
score also indicates the system's reliability. To the end, we train two
moderate-size of language models on MDDial. Our experiments suggest that while
these language models can perform well on many natural language understanding
tasks, including dialogue tasks in the general domain, they struggle to relate
relevant symptoms and disease and thus have poor performance on MDDial. MDDial
will be released publicly to aid the study of ADD dialogue research
How Many Data Samples is an Additional Instruction Worth?
Recently introduced instruction-paradigm empowers non-expert users to
leverage NLP resources by defining a new task in natural language.
Instruction-tuned models have significantly outperformed multitask learning
models (without instruction); however they are far from state-of-the-art
task-specific models. Conventional approaches to improve model performance via
creating datasets with large number of task instances or architectural changes
in the model may not be feasible for non-expert users. However, they can write
alternate instructions to represent an instruction task. Is
Instruction-augmentation helpful? We augment a subset of tasks in the expanded
version of NATURAL INSTRUCTIONS with additional instructions and find that it
significantly improves model performance (up to 35%), especially in the
low-data regime. Our results indicate that an additional instruction can be
equivalent to ~200 data samples on average across tasks.Comment: EACL 2023 Finding
Towards LogiGLUE: A Brief Survey and A Benchmark for Analyzing Logical Reasoning Capabilities of Language Models
Logical reasoning is fundamental for humans yet presents a substantial
challenge in the domain of Artificial Intelligence. Initially, researchers used
Knowledge Representation and Reasoning (KR) systems that did not scale and
required non-trivial manual effort. Recently, the emergence of large language
models (LLMs) has demonstrated the ability to overcome various limitations of
formal Knowledge Representation (KR) systems. Consequently, there's a growing
interest in using LLMs for logical reasoning via natural language. This work
strives to understand the proficiency of LLMs in logical reasoning by offering
a brief review of the latest progress in this area; with a focus on the logical
reasoning datasets, tasks, and the methods adopted to utilize LLMs for
reasoning. To offer a thorough analysis, we have compiled a benchmark titled
LogiGLUE. This includes 24 varied datasets encompassing deductive, abductive,
and inductive reasoning. Utilizing LogiGLUE as a foundation, we have trained an
instruction fine-tuned language model, resulting in LogiT5. We study
single-task training, multi-task training, and "chain-of-thought" knowledge
distillation fine-tuning technique to assess the performance of model across
the different logical reasoning categories. We also assess various LLMs using
LogiGLUE, and the findings indicate that LLMs excel most in abductive
reasoning, followed by deductive reasoning, while they are least effective at
inductive reasoning. We aim to shed light on the capabilities and potential
pathways for enhancing logical reasoning proficiency in LLMs, paving the way
for more advanced and nuanced developments in this critical field.Comment: Work in progres
TarGEN: Targeted Data Generation with Large Language Models
The rapid advancement of large language models (LLMs) has sparked interest in
data synthesis techniques, aiming to generate diverse and high-quality
synthetic datasets. However, these synthetic datasets often suffer from a lack
of diversity and added noise. In this paper, we present TarGEN, a multi-step
prompting strategy for generating high-quality synthetic datasets utilizing a
LLM. An advantage of TarGEN is its seedless nature; it does not require
specific task instances, broadening its applicability beyond task replication.
We augment TarGEN with a method known as self-correction empowering LLMs to
rectify inaccurately labeled instances during dataset creation, ensuring
reliable labels. To assess our technique's effectiveness, we emulate 8 tasks
from the SuperGLUE benchmark and finetune various language models, including
encoder-only, encoder-decoder, and decoder-only models on both synthetic and
original training sets. Evaluation on the original test set reveals that models
trained on datasets generated by TarGEN perform approximately 1-2% points
better than those trained on original datasets (82.84% via syn. vs. 81.12% on
og. using Flan-T5). When incorporating instruction tuning, the performance
increases to 84.54% on synthetic data vs. 81.49% on original data by Flan-T5. A
comprehensive analysis of the synthetic dataset compared to the original
dataset reveals that the synthetic dataset demonstrates similar or higher
levels of dataset complexity and diversity. Furthermore, the synthetic dataset
displays a bias level that aligns closely with the original dataset. Finally,
when pre-finetuned on our synthetic SuperGLUE dataset, T5-3B yields impressive
results on the OpenLLM leaderboard, surpassing the model trained on the
Self-Instruct dataset by 4.14% points. We hope that TarGEN can be helpful for
quality data generation and reducing the human efforts to create complex
benchmarks.Comment: 10 pages, 6 tables, 5 figures, 5 pages references, 17 pages appendi
Can NLP Models 'Identify', 'Distinguish', and 'Justify' Questions that Don't have a Definitive Answer?
Though state-of-the-art (SOTA) NLP systems have achieved remarkable
performance on a variety of language understanding tasks, they primarily focus
on questions that have a correct and a definitive answer. However, in
real-world applications, users often ask questions that don't have a definitive
answer. Incorrectly answering such questions certainly hampers a system's
reliability and trustworthiness. Can SOTA models accurately identify such
questions and provide a reasonable response?
To investigate the above question, we introduce QnotA, a dataset consisting
of five different categories of questions that don't have definitive answers.
Furthermore, for each QnotA instance, we also provide a corresponding QA
instance i.e. an alternate question that ''can be'' answered. With this data,
we formulate three evaluation tasks that test a system's ability to 'identify',
'distinguish', and 'justify' QnotA questions. Through comprehensive
experiments, we show that even SOTA models including GPT-3 and Flan T5 do not
fare well on these tasks and lack considerably behind the human performance
baseline. We conduct a thorough analysis which further leads to several
interesting findings. Overall, we believe our work and findings will encourage
and facilitate further research in this important area and help develop more
robust models.Comment: TrustNLP Workshop at ACL 202
Is a Question Decomposition Unit All We Need?
Large Language Models (LMs) have achieved state-of-the-art performance on
many Natural Language Processing (NLP) benchmarks. With the growing number of
new benchmarks, we build bigger and more complex LMs. However, building new LMs
may not be an ideal option owing to the cost, time and environmental impact
associated with it. We explore an alternative route: can we modify data by
expressing it in terms of the model's strengths, so that a question becomes
easier for models to answer? We investigate if humans can decompose a hard
question into a set of simpler questions that are relatively easier for models
to solve. We analyze a range of datasets involving various forms of reasoning
and find that it is indeed possible to significantly improve model performance
(24% for GPT3 and 29% for RoBERTa-SQuAD along with a symbolic calculator) via
decomposition. Our approach provides a viable option to involve people in NLP
research in a meaningful way. Our findings indicate that Human-in-the-loop
Question Decomposition (HQD) can potentially provide an alternate path to
building large LMs.Comment: 16 page
Control of Collagen Stability and Heterotrimer Specificity through Repulsive Electrostatic Interactions
Charge-pair interactions between acidic and basic residues on the surface of collagen can promote stability as well as control specificity of molecular recognition. Heterotrimeric collagen peptides have been engineered de novo using either rational or computational methods, which in both cases optimize networks of favorable charge-pair interactions in the target structure. Less understood is the role of electrostatic repulsion between groups of like charge in destabilizing structure or directing molecular recognition. To study this, we apply a “charge crowding” approach, where repulsive interactions between multiple aspartate side chains are found to destabilize the homotrimer states in triple helical peptide system and can be utilized to promote the formation of heterotrimers. Neutralizing surface charge by increasing salt concentration or decreasing pH can enhance homotrimer stability, confirming the role of charge crowding on the destabilization of homotrimers via electrostatic repulsion. Charge crowding may be used in conjunction with other approaches to create specific collagen heterotrimers