Natural Language Inference is a challenging task that has received
substantial attention, and state-of-the-art models now achieve impressive test
set performance in the form of accuracy scores. Here, we go beyond this single
evaluation metric to examine robustness to semantically-valid alterations to
the input data. We identify three factors - insensitivity, polarity and unseen
pairs - and compare their impact on three SNLI models under a variety of
conditions. Our results demonstrate a number of strengths and weaknesses in the
models' ability to generalise to new in-domain instances. In particular, while
strong performance is possible on unseen hypernyms, unseen antonyms are more
challenging for all the models. More generally, the models suffer from an
insensitivity to certain small but semantically significant alterations, and
are also often influenced by simple statistical correlations between words and
training labels. Overall, we show that evaluations of NLI models can benefit
from studying the influence of factors intrinsic to the models or found in the
dataset used.Comment: Accepted at NAACL 201