49 research outputs found
Semi-supervised sequence tagging with bidirectional language models
Pre-trained word embeddings learned from unlabeled text have become a
standard component of neural network architectures for NLP tasks. However, in
most cases, the recurrent network that operates on word-level representations
to produce context sensitive representations is trained on relatively little
labeled data. In this paper, we demonstrate a general semi-supervised approach
for adding pre- trained context embeddings from bidirectional language models
to NLP systems and apply it to sequence labeling tasks. We evaluate our model
on two standard datasets for named entity recognition (NER) and chunking, and
in both cases achieve state of the art results, surpassing previous systems
that use other forms of transfer or joint learning with additional labeled data
and task specific gazetteers.Comment: To appear in ACL 201
Commonsense Knowledge Base Completion with Structural and Semantic Context
Automatic KB completion for commonsense knowledge graphs (e.g., ATOMIC and
ConceptNet) poses unique challenges compared to the much studied conventional
knowledge bases (e.g., Freebase). Commonsense knowledge graphs use free-form
text to represent nodes, resulting in orders of magnitude more nodes compared
to conventional KBs (18x more nodes in ATOMIC compared to Freebase
(FB15K-237)). Importantly, this implies significantly sparser graph structures
- a major challenge for existing KB completion methods that assume densely
connected graphs over a relatively smaller set of nodes. In this paper, we
present novel KB completion models that can address these challenges by
exploiting the structural and semantic context of nodes. Specifically, we
investigate two key ideas: (1) learning from local graph structure, using graph
convolutional networks and automatic graph densification and (2) transfer
learning from pre-trained language models to knowledge graphs for enhanced
contextual representation of knowledge. We describe our method to incorporate
information from both these sources in a joint model and provide the first
empirical results for KB completion on ATOMIC and evaluation with ranking
metrics on ConceptNet. Our results demonstrate the effectiveness of language
model representations in boosting link prediction performance and the
advantages of learning from local graph structure (+1.5 points in MRR for
ConceptNet) when training on subgraphs for computational efficiency. Further
analysis on model predictions shines light on the types of commonsense
knowledge that language models capture well.Comment: AAAI 202
UNICORN on RAINBOW: A Universal Commonsense Reasoning Model on a New Multitask Benchmark
Commonsense AI has long been seen as a near impossible goal -- until
recently. Now, research interest has sharply increased with an influx of new
benchmarks and models.
We propose two new ways to evaluate commonsense models, emphasizing their
generality on new tasks and building on diverse, recently introduced
benchmarks. First, we propose a new multitask benchmark, RAINBOW, to promote
research on commonsense models that generalize well over multiple tasks and
datasets. Second, we propose a novel evaluation, the cost equivalent curve,
that sheds new insight on how the choice of source datasets, pretrained
language models, and transfer learning methods impacts performance and data
efficiency.
We perform extensive experiments -- over 200 experiments encompassing 4800
models -- and report multiple valuable and sometimes surprising findings, e.g.,
that transfer almost always leads to better or equivalent performance if
following a particular recipe, that QA-based commonsense datasets transfer well
with each other, while commonsense knowledge graphs do not, and that perhaps
counter-intuitively, larger models benefit more from transfer than smaller
ones.
Last but not least, we introduce a new universal commonsense reasoning model,
UNICORN, that establishes new state-of-the-art performance across 8 popular
commonsense benchmarks, aNLI (87.3%), CosmosQA (91.8%), HellaSWAG (93.9%), PIQA
(90.1%), SocialIQa (83.2%), WinoGrande (86.6%), CycIC (94.0%) and CommonsenseQA
(79.3%).Comment: 27 pages, 19 figures, 34 tables. Accepted to AAAI 2021. For
associated code and data see https://github.com/allenai/rainbo
WinoGrande: An Adversarial Winograd Schema Challenge at Scale
The Winograd Schema Challenge (WSC) (Levesque, Davis, and Morgenstern 2011),
a benchmark for commonsense reasoning, is a set of 273 expert-crafted pronoun
resolution problems originally designed to be unsolvable for statistical models
that rely on selectional preferences or word associations. However, recent
advances in neural language models have already reached around 90% accuracy on
variants of WSC. This raises an important question whether these models have
truly acquired robust commonsense capabilities or whether they rely on spurious
biases in the datasets that lead to an overestimation of the true capabilities
of machine commonsense. To investigate this question, we introduce WinoGrande,
a large-scale dataset of 44k problems, inspired by the original WSC design, but
adjusted to improve both the scale and the hardness of the dataset. The key
steps of the dataset construction consist of (1) a carefully designed
crowdsourcing procedure, followed by (2) systematic bias reduction using a
novel AfLite algorithm that generalizes human-detectable word associations to
machine-detectable embedding associations. The best state-of-the-art methods on
WinoGrande achieve 59.4-79.1%, which are 15-35% below human performance of
94.0%, depending on the amount of the training data allowed. Furthermore, we
establish new state-of-the-art results on five related benchmarks - WSC
(90.1%), DPR (93.1%), COPA (90.6%), KnowRef (85.6%), and Winogender (97.1%).
These results have dual implications: on one hand, they demonstrate the
effectiveness of WinoGrande when used as a resource for transfer learning. On
the other hand, they raise a concern that we are likely to be overestimating
the true capabilities of machine commonsense across all these benchmarks. We
emphasize the importance of algorithmic bias reduction in existing and future
benchmarks to mitigate such overestimation
ATOMIC: An Atlas of Machine Commonsense for If-Then Reasoning
We present ATOMIC, an atlas of everyday commonsense reasoning, organized
through 877k textual descriptions of inferential knowledge. Compared to
existing resources that center around taxonomic knowledge, ATOMIC focuses on
inferential knowledge organized as typed if-then relations with variables
(e.g., "if X pays Y a compliment, then Y will likely return the compliment").
We propose nine if-then relation types to distinguish causes vs. effects,
agents vs. themes, voluntary vs. involuntary events, and actions vs. mental
states. By generatively training on the rich inferential knowledge described in
ATOMIC, we show that neural models can acquire simple commonsense capabilities
and reason about previously unseen events. Experimental results demonstrate
that multitask models that incorporate the hierarchical structure of if-then
relation types lead to more accurate inference compared to models trained in
isolation, as measured by both automatic and human evaluation.Comment: AAAI 2019 C