Search CORE

21 research outputs found

Why Attention? Analyze BiLSTM Deficiency and Its Remedies in the Case of NER

Author: Fu Tsu-Jui
Li Peng-Hsuan
Ma Wei-Yun
Publication venue
Publication date: 03/04/2020
Field of study

BiLSTM has been prevalently used as a core module for NER in a sequence-labeling setup. State-of-the-art approaches use BiLSTM with additional resources such as gazetteers, language-modeling, or multi-task supervision to further improve NER. This paper instead takes a step back and focuses on analyzing problems of BiLSTM itself and how exactly self-attention can bring improvements. We formally show the limitation of (CRF-)BiLSTM in modeling cross-context patterns for each word -- the XOR limitation. Then, we show that two types of simple cross-structures -- self-attention and Cross-BiLSTM -- can effectively remedy the problem. We test the practical impacts of the deficiency on real-world NER datasets, OntoNotes 5.0 and WNUT 2017, with clear and consistent improvements over the baseline, up to 8.7% on some of the multi-token entity mentions. We give in-depth analyses of the improvements across several aspects of NER, especially the identification of multi-token mentions. This study should lay a sound foundation for future improvements on sequence-labeling NER. (Source codes: https://github.com/jacobvsdanniel/cross-ner)Comment: In proceedings of AAAI 202

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling

Author: Fu Tsu-Jui
Gan Zhe
Li Linjie
Lin Kevin
Liu Zicheng
Wang Lijuan
Wang William Yang
Publication venue
Publication date: 31/05/2023
Field of study

Masked visual modeling (MVM) has been recently proven effective for visual pre-training. While similar reconstructive objectives on video inputs (e.g., masked frame modeling) have been explored in video-language (VidL) pre-training, previous studies fail to find a truly effective MVM strategy that can largely benefit the downstream performance. In this work, we systematically examine the potential of MVM in the context of VidL learning. Specifically, we base our study on a fully end-to-end VIdeO-LanguagE Transformer (VIOLET), where the supervision from MVM training can be backpropagated to the video pixel space. In total, eight different reconstructive targets of MVM are explored, from low-level pixel values and oriented gradients to high-level depth maps, optical flow, discrete visual tokens, and latent visual features. We conduct comprehensive experiments and provide insights into the factors leading to effective MVM training, resulting in an enhanced model VIOLETv2. Empirically, we show VIOLETv2 pre-trained with MVM objective achieves notable improvements on 13 VidL benchmarks, ranging from video question answering, video captioning, to text-to-video retrieval.Comment: CVPR'23; the first two authors contributed equally; code is available at https://github.com/tsujuifu/pytorch_empirical-mv

arXiv.org e-Print Archive