21 research outputs found
Why Attention? Analyze BiLSTM Deficiency and Its Remedies in the Case of NER
BiLSTM has been prevalently used as a core module for NER in a
sequence-labeling setup. State-of-the-art approaches use BiLSTM with additional
resources such as gazetteers, language-modeling, or multi-task supervision to
further improve NER. This paper instead takes a step back and focuses on
analyzing problems of BiLSTM itself and how exactly self-attention can bring
improvements. We formally show the limitation of (CRF-)BiLSTM in modeling
cross-context patterns for each word -- the XOR limitation. Then, we show that
two types of simple cross-structures -- self-attention and Cross-BiLSTM -- can
effectively remedy the problem. We test the practical impacts of the deficiency
on real-world NER datasets, OntoNotes 5.0 and WNUT 2017, with clear and
consistent improvements over the baseline, up to 8.7% on some of the
multi-token entity mentions. We give in-depth analyses of the improvements
across several aspects of NER, especially the identification of multi-token
mentions. This study should lay a sound foundation for future improvements on
sequence-labeling NER. (Source codes:
https://github.com/jacobvsdanniel/cross-ner)Comment: In proceedings of AAAI 202
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling
Masked visual modeling (MVM) has been recently proven effective for visual
pre-training. While similar reconstructive objectives on video inputs (e.g.,
masked frame modeling) have been explored in video-language (VidL)
pre-training, previous studies fail to find a truly effective MVM strategy that
can largely benefit the downstream performance. In this work, we systematically
examine the potential of MVM in the context of VidL learning. Specifically, we
base our study on a fully end-to-end VIdeO-LanguagE Transformer (VIOLET), where
the supervision from MVM training can be backpropagated to the video pixel
space. In total, eight different reconstructive targets of MVM are explored,
from low-level pixel values and oriented gradients to high-level depth maps,
optical flow, discrete visual tokens, and latent visual features. We conduct
comprehensive experiments and provide insights into the factors leading to
effective MVM training, resulting in an enhanced model VIOLETv2. Empirically,
we show VIOLETv2 pre-trained with MVM objective achieves notable improvements
on 13 VidL benchmarks, ranging from video question answering, video captioning,
to text-to-video retrieval.Comment: CVPR'23; the first two authors contributed equally; code is available
at https://github.com/tsujuifu/pytorch_empirical-mv