22 research outputs found

    Understanding BLOOM: An empirical study on diverse NLP tasks

    Full text link
    In this work, we present an evaluation of smaller BLOOM model variants (350m/560m and 1b3/1b7) on various natural language processing tasks. This includes GLUE - language understanding, prompt-based zero-shot and few-shot text classification and extraction, question answering, prompt-based text generation, and multi-lingual text classification to understand model strengths/weaknesses and behavior. Empirical results show that BLOOM variants under-perform on all GLUE tasks (except WNLI), question-answering, and text generation. The variants bloom for WNLI, with an accuracy of 56.3%, and for prompt-based few-shot text extraction on MIT Movies and ATIS datasets. The BLOOM variants on average have 7% greater accuracy over GPT-2 and GPT-Neo models on Director and Airline Name extraction from MIT Movies and ATIS datasets, respectively

    An Empirical Study on Instance Selection Strategies in Self-training for Sentiment Analysis

    Full text link
    Sentiment analysis is a crucial task in natural language processing that involves identifying and extracting subjective sentiment from text. Self-training has recently emerged as an economical and efficient technique for developing sentiment analysis models by leveraging a small amount of labeled data and a larger amount of unlabeled data. However, the performance of a self-training procedure heavily relies on the choice of the instance selection strategy, which has not been studied thoroughly. This paper presents an empirical study on various instance selection strategies for self-training on two public sentiment datasets, and investigates the influence of the strategy and hyper-parameters on the performance of self-training in various few-shot settings.Comment: 6 pages, 2 figure

    HeySQuAD: A Spoken Question Answering Dataset

    Full text link
    Human-spoken questions are critical to evaluating the performance of spoken question answering (SQA) systems that serve several real-world use cases including digital assistants. We present a new large-scale community-shared SQA dataset, HeySQuAD that consists of 76k human-spoken questions and 97k machine-generated questions and corresponding textual answers derived from the SQuAD QA dataset. The goal of HeySQuAD is to measure the ability of machines to understand noisy spoken questions and answer the questions accurately. To this end, we run extensive benchmarks on the human-spoken and machine-generated questions to quantify the differences in noise from both sources and its subsequent impact on the model and answering accuracy. Importantly, for the task of SQA, where we want to answer human-spoken questions, we observe that training using the transcribed human-spoken and original SQuAD questions leads to significant improvements (12.51%) over training using only the original SQuAD textual questions

    Entity-Enriched Neural Models for Clinical Question Answering

    Full text link
    We explore state-of-the-art neural models for question answering on electronic medical records and improve their ability to generalize better on previously unseen (paraphrased) questions at test time. We enable this by learning to predict logical forms as an auxiliary task along with the main task of answer span detection. The predicted logical forms also serve as a rationale for the answer. Further, we also incorporate medical entity information in these models via the ERNIE architecture. We train our models on the large-scale emrQA dataset and observe that our multi-task entity-enriched models generalize to paraphrased questions ~5% better than the baseline BERT model

    Advancing Seq2seq with Joint Paraphrase Learning

    No full text
    corecore