BiLSTM has been prevalently used as a core module for NER in a
sequence-labeling setup. State-of-the-art approaches use BiLSTM with additional
resources such as gazetteers, language-modeling, or multi-task supervision to
further improve NER. This paper instead takes a step back and focuses on
analyzing problems of BiLSTM itself and how exactly self-attention can bring
improvements. We formally show the limitation of (CRF-)BiLSTM in modeling
cross-context patterns for each word -- the XOR limitation. Then, we show that
two types of simple cross-structures -- self-attention and Cross-BiLSTM -- can
effectively remedy the problem. We test the practical impacts of the deficiency
on real-world NER datasets, OntoNotes 5.0 and WNUT 2017, with clear and
consistent improvements over the baseline, up to 8.7% on some of the
multi-token entity mentions. We give in-depth analyses of the improvements
across several aspects of NER, especially the identification of multi-token
mentions. This study should lay a sound foundation for future improvements on
sequence-labeling NER. (Source codes:
https://github.com/jacobvsdanniel/cross-ner)Comment: In proceedings of AAAI 202