1,124 research outputs found
Crowdsourcing Multiple Choice Science Questions
We present a novel method for obtaining high-quality, domain-targeted
multiple choice questions from crowd workers. Generating these questions can be
difficult without trading away originality, relevance or diversity in the
answer options. Our method addresses these problems by leveraging a large
corpus of domain-specific text and a small set of existing questions. It
produces model suggestions for document selection and answer distractor choice
which aid the human question generation process. With this method we have
assembled SciQ, a dataset of 13.7K multiple choice science exam questions
(Dataset available at http://allenai.org/data.html). We demonstrate that the
method produces in-domain questions by providing an analysis of this new
dataset and by showing that humans cannot distinguish the crowdsourced
questions from original questions. When using SciQ as additional training data
to existing questions, we observe accuracy improvements on real science exams.Comment: accepted for the Workshop on Noisy User-generated Text (W-NUT) 201
Revita: a System for Language Learning and Supporting Endangered Languages
We describe a computational system for language learning and supporting endangered languages. The platform provides the user an opportunity to improve her competency through active language use. The platform currently works with several endangered Finno-Ugric languages, as well as with Yakut, and Finnish, Swedish, and Russian. This paper describes the current stage of ongoing development.Peer reviewe
DISTO: Evaluating Textual Distractors for Multi-Choice Questions using Negative Sampling based Approach
Multiple choice questions (MCQs) are an efficient and common way to assess
reading comprehension (RC). Every MCQ needs a set of distractor answers that
are incorrect, but plausible enough to test student knowledge. Distractor
generation (DG) models have been proposed, and their performance is typically
evaluated using machine translation (MT) metrics. However, MT metrics often
misjudge the suitability of generated distractors. We propose DISTO: the first
learned evaluation metric for generated distractors. We validate DISTO by
showing its scores correlate highly with human ratings of distractor quality.
At the same time, DISTO ranks the performance of state-of-the-art DG models
very differently from MT-based metrics, showing that MT metrics should not be
used for distractor evaluation
Training Datasets for Machine Reading Comprehension and Their Limitations
Neural networks are a powerful model class to learn machine Reading Comprehen- sion (RC), yet they crucially depend on the availability of suitable training datasets. In this thesis we describe methods for data collection, evaluate the performance of established models, and examine a number of model behaviours and dataset limita- tions. We first describe the creation of a data resource for the science exam QA do- main, and compare existing models on the resulting dataset. The collected ques- tions are plausible – non-experts can distinguish them from real exam questions with 55% accuracy – and using them as additional training data leads to improved model scores on real science exam questions. Second, we describe and apply a distant supervision dataset construction method for multi-hop RC across documents. We identify and mitigate several dataset assembly pitfalls – a lack of unanswerable candidates, label imbalance, and spurious correlations between documents and particular candidates – which often leave shallow predictive cues for the answer. Furthermore we demonstrate that se- lecting relevant document combinations is a critical performance bottleneck on the datasets created. We thus investigate Pseudo-Relevance Feedback, which leads to improvements compared to TF-IDF-based document combination selection both in retrieval metrics and answer accuracy. Third, we investigate model undersensitivity: model predictions do not change when given adversarially altered questions in SQUAD2.0 and NEWSQA, even though they should. We characterise affected samples, and show that the phe- nomenon is related to a lack of structurally similar but unanswerable samples during training: data augmentation reduces the adversarial error rate, e.g. from 51.7% to 20.7% for a BERT model on SQUAD2.0, and improves robustness also in other settings. Finally we explore efficient formal model verification via Interval Bound Propagation (IBP) to measure and address model undersensitivity, and show that using an IBP-derived auxiliary loss can improve verification rates, e.g. from 2.8% to 18.4% on the SNLI test set
A Survey of Natural Language Generation
This paper offers a comprehensive review of the research on Natural Language
Generation (NLG) over the past two decades, especially in relation to
data-to-text generation and text-to-text generation deep learning methods, as
well as new applications of NLG technology. This survey aims to (a) give the
latest synthesis of deep learning research on the NLG core tasks, as well as
the architectures adopted in the field; (b) detail meticulously and
comprehensively various NLG tasks and datasets, and draw attention to the
challenges in NLG evaluation, focusing on different evaluation methods and
their relationships; (c) highlight some future emphasis and relatively recent
research issues that arise due to the increasing synergy between NLG and other
artificial intelligence areas, such as computer vision, text and computational
creativity.Comment: Accepted by ACM Computing Survey (CSUR) 202
CAR: Conceptualization-Augmented Reasoner for Zero-Shot Commonsense Question Answering
The task of zero-shot commonsense question answering evaluates models on
their capacity to reason about general scenarios beyond those presented in
specific datasets. Existing approaches for tackling this task leverage external
knowledge from CommonSense Knowledge Bases (CSKBs) by pretraining the model on
synthetic QA pairs constructed from CSKBs. In these approaches, negative
examples (distractors) are formulated by randomly sampling from CSKBs using
fairly primitive keyword constraints. However, two bottlenecks limit these
approaches: the inherent incompleteness of CSKBs limits the semantic coverage
of synthetic QA pairs, and the lack of human annotations makes the sampled
negative examples potentially uninformative and contradictory. To tackle these
limitations above, we propose Conceptualization-Augmented Reasoner (CAR), a
zero-shot commonsense question-answering framework that fully leverages the
power of conceptualization. Specifically, CAR abstracts a commonsense knowledge
triple to many higher-level instances, which increases the coverage of CSKB and
expands the ground-truth answer space, reducing the likelihood of selecting
false-negative distractors. Extensive experiments demonstrate that CAR more
robustly generalizes to answering questions about zero-shot commonsense
scenarios than existing methods, including large language models, such as
GPT3.5 and ChatGPT. Our codes, data, and model checkpoints are available at
https://github.com/HKUST-KnowComp/CAR
Generalized Relation Modeling for Transformer Tracking
Compared with previous two-stream trackers, the recent one-stream tracking
pipeline, which allows earlier interaction between the template and search
region, has achieved a remarkable performance gain. However, existing
one-stream trackers always let the template interact with all parts inside the
search region throughout all the encoder layers. This could potentially lead to
target-background confusion when the extracted feature representations are not
sufficiently discriminative. To alleviate this issue, we propose a generalized
relation modeling method based on adaptive token division. The proposed method
is a generalized formulation of attention-based relation modeling for
Transformer tracking, which inherits the merits of both previous two-stream and
one-stream pipelines whilst enabling more flexible relation modeling by
selecting appropriate search tokens to interact with template tokens. An
attention masking strategy and the Gumbel-Softmax technique are introduced to
facilitate the parallel computation and end-to-end learning of the token
division module. Extensive experiments show that our method is superior to the
two-stream and one-stream pipelines and achieves state-of-the-art performance
on six challenging benchmarks with a real-time running speed.Comment: Accepted by CVPR 2023. Code and models are publicly available at
https://github.com/Little-Podi/GR
- …