3 research outputs found
NLITrans at SemEval-2018 Task 12: Transfer of Semantic Knowledge for Argument Comprehension
The Argument Reasoning Comprehension Task requires significant language
understanding and complex reasoning over world knowledge. We focus on transfer
of a sentence encoder to bootstrap more complicated models given the small size
of the dataset. Our best model uses a pre-trained BiLSTM to encode input
sentences, learns task-specific features for the argument and warrants, then
performs independent argument-warrant matching. This model achieves mean test
set accuracy of 64.43%. Encoder transfer yields a significant gain to our best
model over random initialization. Independent warrant matching effectively
doubles the size of the dataset and provides additional regularization. We
demonstrate that regularization comes from ignoring statistical correlations
between warrant features and position. We also report an experiment with our
best model that only matches warrants to reasons, ignoring claims. Relatively
low performance degradation suggests that our model is not necessarily learning
the intended task
Probing Neural Network Comprehension of Natural Language Arguments
We are surprised to find that BERT's peak performance of 77% on the Argument
Reasoning Comprehension Task reaches just three points below the average
untrained human baseline. However, we show that this result is entirely
accounted for by exploitation of spurious statistical cues in the dataset. We
analyze the nature of these cues and demonstrate that a range of models all
exploit them. This analysis informs the construction of an adversarial dataset
on which all models achieve random accuracy. Our adversarial dataset provides a
more robust assessment of argument comprehension and should be adopted as the
standard in future work.Comment: ACL 2019 (Updated Version
A Systematic Review of Reproducibility Research in Natural Language Processing
Against the background of what has been termed a reproducibility crisis in
science, the NLP field is becoming increasingly interested in, and
conscientious about, the reproducibility of its results. The past few years
have seen an impressive range of new initiatives, events and active research in
the area. However, the field is far from reaching a consensus about how
reproducibility should be defined, measured and addressed, with diversity of
views currently increasing rather than converging. With this focused
contribution, we aim to provide a wide-angle, and as near as possible complete,
snapshot of current work on reproducibility in NLP, delineating differences and
similarities, and providing pointers to common denominators.Comment: To be published in proceedings of EACL'2