Language models (LMs) may appear insensitive to word order changes in natural
language understanding (NLU) tasks. In this paper, we propose that linguistic
redundancy can explain this phenomenon, whereby word order and other linguistic
cues such as case markers provide overlapping and thus redundant information.
Our hypothesis is that models exhibit insensitivity to word order when the
order provides redundant information, and the degree of insensitivity varies
across tasks. We quantify how informative word order is using mutual
information (MI) between unscrambled and scrambled sentences. Our results show
the effect that the less informative word order is, the more consistent the
model's predictions are between unscrambled and scrambled sentences. We also
find that the effect varies across tasks: for some tasks, like SST-2, LMs'
prediction is almost always consistent with the original one even if the
Pointwise-MI (PMI) changes, while for others, like RTE, the consistency is near
random when the PMI gets lower, i.e., word order is really important.Comment: 5 page