22 research outputs found
Understanding BLOOM: An empirical study on diverse NLP tasks
In this work, we present an evaluation of smaller BLOOM model variants
(350m/560m and 1b3/1b7) on various natural language processing tasks. This
includes GLUE - language understanding, prompt-based zero-shot and few-shot
text classification and extraction, question answering, prompt-based text
generation, and multi-lingual text classification to understand model
strengths/weaknesses and behavior. Empirical results show that BLOOM variants
under-perform on all GLUE tasks (except WNLI), question-answering, and text
generation. The variants bloom for WNLI, with an accuracy of 56.3%, and for
prompt-based few-shot text extraction on MIT Movies and ATIS datasets. The
BLOOM variants on average have 7% greater accuracy over GPT-2 and GPT-Neo
models on Director and Airline Name extraction from MIT Movies and ATIS
datasets, respectively
An Empirical Study on Instance Selection Strategies in Self-training for Sentiment Analysis
Sentiment analysis is a crucial task in natural language processing that
involves identifying and extracting subjective sentiment from text.
Self-training has recently emerged as an economical and efficient technique for
developing sentiment analysis models by leveraging a small amount of labeled
data and a larger amount of unlabeled data. However, the performance of a
self-training procedure heavily relies on the choice of the instance selection
strategy, which has not been studied thoroughly. This paper presents an
empirical study on various instance selection strategies for self-training on
two public sentiment datasets, and investigates the influence of the strategy
and hyper-parameters on the performance of self-training in various few-shot
settings.Comment: 6 pages, 2 figure
HeySQuAD: A Spoken Question Answering Dataset
Human-spoken questions are critical to evaluating the performance of spoken
question answering (SQA) systems that serve several real-world use cases
including digital assistants. We present a new large-scale community-shared SQA
dataset, HeySQuAD that consists of 76k human-spoken questions and 97k
machine-generated questions and corresponding textual answers derived from the
SQuAD QA dataset. The goal of HeySQuAD is to measure the ability of machines to
understand noisy spoken questions and answer the questions accurately. To this
end, we run extensive benchmarks on the human-spoken and machine-generated
questions to quantify the differences in noise from both sources and its
subsequent impact on the model and answering accuracy. Importantly, for the
task of SQA, where we want to answer human-spoken questions, we observe that
training using the transcribed human-spoken and original SQuAD questions leads
to significant improvements (12.51%) over training using only the original
SQuAD textual questions
Entity-Enriched Neural Models for Clinical Question Answering
We explore state-of-the-art neural models for question answering on
electronic medical records and improve their ability to generalize better on
previously unseen (paraphrased) questions at test time. We enable this by
learning to predict logical forms as an auxiliary task along with the main task
of answer span detection. The predicted logical forms also serve as a rationale
for the answer. Further, we also incorporate medical entity information in
these models via the ERNIE architecture. We train our models on the large-scale
emrQA dataset and observe that our multi-task entity-enriched models generalize
to paraphrased questions ~5% better than the baseline BERT model