58 research outputs found
Constructing Datasets for Multi-hop Reading Comprehension Across Documents
Most Reading Comprehension methods limit themselves to queries which can be
answered using a single sentence, paragraph, or document. Enabling models to
combine disjoint pieces of textual evidence would extend the scope of machine
comprehension methods, but currently there exist no resources to train and test
this capability. We propose a novel task to encourage the development of models
for text understanding across multiple documents and to investigate the limits
of existing methods. In our task, a model learns to seek and combine evidence -
effectively performing multi-hop (alias multi-step) inference. We devise a
methodology to produce datasets for this task, given a collection of
query-answer pairs and thematically linked documents. Two datasets from
different domains are induced, and we identify potential pitfalls and devise
circumvention strategies. We evaluate two previously proposed competitive
models and find that one can integrate information across documents. However,
both models struggle to select relevant information, as providing documents
guaranteed to be relevant greatly improves their performance. While the models
outperform several strong baselines, their best accuracy reaches 42.9% compared
to human performance at 74.0% - leaving ample room for improvement.Comment: This paper directly corresponds to the TACL version
(https://transacl.org/ojs/index.php/tacl/article/view/1325) apart from minor
changes in wording, additional footnotes, and appendice
Wronging a Right: Generating Better Errors to Improve Grammatical Error Detection
Grammatical error correction, like other machine learning tasks, greatly
benefits from large quantities of high quality training data, which is
typically expensive to produce. While writing a program to automatically
generate realistic grammatical errors would be difficult, one could learn the
distribution of naturallyoccurring errors and attempt to introduce them into
other datasets. Initial work on inducing errors in this way using statistical
machine translation has shown promise; we investigate cheaply constructing
synthetic samples, given a small corpus of human-annotated data, using an
off-the-rack attentive sequence-to-sequence model and a straight-forward
post-processing procedure. Our approach yields error-filled artificial data
that helps a vanilla bi-directional LSTM to outperform the previous state of
the art at grammatical error detection, and a previously introduced model to
gain further improvements of over 5% score. When attempting to
determine if a given sentence is synthetic, a human annotator at best achieves
39.39 score, indicating that our model generates mostly human-like
instances.Comment: Accepted as a short paper at EMNLP 201
Convolutional 2D Knowledge Graph Embeddings
Link prediction for knowledge graphs is the task of predicting missing
relationships between entities. Previous work on link prediction has focused on
shallow, fast models which can scale to large knowledge graphs. However, these
models learn less expressive features than deep, multi-layer models -- which
potentially limits performance. In this work, we introduce ConvE, a multi-layer
convolutional network model for link prediction, and report state-of-the-art
results for several established datasets. We also show that the model is highly
parameter efficient, yielding the same performance as DistMult and R-GCN with
8x and 17x fewer parameters. Analysis of our model suggests that it is
particularly effective at modelling nodes with high indegree -- which are
common in highly-connected, complex knowledge graphs such as Freebase and
YAGO3. In addition, it has been noted that the WN18 and FB15k datasets suffer
from test set leakage, due to inverse relations from the training set being
present in the test set -- however, the extent of this issue has so far not
been quantified. We find this problem to be severe: a simple rule-based model
can achieve state-of-the-art results on both WN18 and FB15k. To ensure that
models are evaluated on datasets where simply exploiting inverse relations
cannot yield competitive results, we investigate and validate several commonly
used datasets -- deriving robust variants where necessary. We then perform
experiments on these robust datasets for our own and several previously
proposed models and find that ConvE achieves state-of-the-art Mean Reciprocal
Rank across most datasets.Comment: Extended AAAI2018 pape
Question and Answer Test-Train Overlap in Open-Domain Question Answering Datasets
Ideally Open-Domain Question Answering models should exhibit a number of
competencies, ranging from simply memorizing questions seen at training time,
to answering novel question formulations with answers seen during training, to
generalizing to completely novel questions with novel answers. However, single
aggregated test set scores do not show the full picture of what capabilities
models truly have. In this work, we perform a detailed study of the test sets
of three popular open-domain benchmark datasets with respect to these
competencies. We find that 60-70% of test-time answers are also present
somewhere in the training sets. We also find that 30% of test-set questions
have a near-duplicate paraphrase in their corresponding training sets. Using
these findings, we evaluate a variety of popular open-domain models to obtain
greater insight into what extent they can actually generalize, and what drives
their overall performance. We find that all models perform dramatically worse
on questions that cannot be memorized from training sets, with a mean absolute
performance difference of 63% between repeated and non-repeated data. Finally
we show that simple nearest-neighbor models out-perform a BART closed-book QA
model, further highlighting the role that training set memorization plays in
these benchmark
R4C: A Benchmark for Evaluating RC Systems to Get the Right Answer for the Right Reason
Recent studies have revealed that reading comprehension (RC) systems learn to
exploit annotation artifacts and other biases in current datasets. This
prevents the community from reliably measuring the progress of RC systems. To
address this issue, we introduce R4C, a new task for evaluating RC systems'
internal reasoning. R4C requires giving not only answers but also derivations:
explanations that justify predicted answers. We present a reliable,
crowdsourced framework for scalably annotating RC datasets with derivations. We
create and publicly release the R4C dataset, the first, quality-assured dataset
consisting of 4.6k questions, each of which is annotated with 3 reference
derivations (i.e. 13.8k derivations). Experiments show that our automatic
evaluation metrics using multiple reference derivations are reliable, and that
R4C assesses different skills from an existing benchmark.Comment: Accepted by ACL2020. See https://naoya-i.github.io/r4c/ for more
informatio
Relation Prediction as an Auxiliary Training Objective for Improving Multi-Relational Graph Representations
Learning good representations on multi-relational graphs is essential to knowledge base completion (KBC). In this paper, we propose a new self-supervised training objective for multi-relational
graph representation learning, via simply incorporating relation prediction into the commonly used
1vsAll objective. The new training objective contains not only terms for predicting the subject
and object of a given triple, but also a term for predicting the relation type. We analyse how this
new objective impacts multi-relational learning in KBC: experiments on a variety of datasets and
models show that relation prediction can significantly improve entity ranking, the most widely
used evaluation task for KBC, yielding a 6.1% increase in MRR and 9.9% increase in Hits@1
on FB15k-237 as well as a 3.1% increase in MRR and 3.4% in Hits@1 on Aristo-v4. Moreover,
we observe that the proposed objective is especially effective on highly multi-relational datasets,
i.e. datasets with a large number of predicates, and generates better representations when larger
embedding sizes are used
Training Adaptive Computation for Open-Domain Question Answering with Computational Constraints
Adaptive Computation (AC) has been shown to be effective in improving the
efficiency of Open-Domain Question Answering (ODQA) systems. However, current
AC approaches require tuning of all model parameters, and training
state-of-the-art ODQA models requires significant computational resources that
may not be available for most researchers. We propose Adaptive Passage Encoder,
an AC method that can be applied to an existing ODQA model and can be trained
efficiently on a single GPU. It keeps the parameters of the base ODQA model
fixed, but it overrides the default layer-by-layer computation of the encoder
with an AC policy that is trained to optimise the computational efficiency of
the model. Our experimental results show that our method improves upon a
state-of-the-art model on two datasets, and is also more accurate than previous
AC methods due to the stronger base ODQA model. All source code and datasets
are available at https://github.com/uclnlp/APE.Comment: 7 pages, 1 figure, to be published in ACL-IJCNLP 202
- β¦