Search CORE

470 research outputs found

Introduction to Transformers: an NLP Perspective

Author: Xiao Tong
Zhu Jingbo
Publication venue
Publication date: 29/11/2023
Field of study

Transformers have dominated empirical machine learning models of natural language processing. In this paper, we introduce basic concepts of Transformers and present key techniques that form the recent advances of these models. This includes a description of the standard Transformer architecture, a series of model refinements, and common applications. Given that Transformers and related deep learning techniques might be evolving in ways we have never seen, we cannot dive into all the model details or cover all the technical areas. Instead, we focus on just those concepts that are helpful for gaining a good understanding of Transformers and their variants. We also summarize the key ideas that impact this field, thereby yielding some insights into the strengths and limitations of these models.Comment: 119 pages and 21 figure

arXiv.org e-Print Archive

Towards Bidirectional Hierarchical Representations for Attention-Based Neural Machine Translation

Author: Chao Lidia S.
Wong Derek F.
Xiao Tong
Yang Baosong
Zhu Jingbo
Publication venue
Publication date: 01/01/2017
Field of study

This paper proposes a hierarchical attentional neural translation model which focuses on enhancing source-side hierarchical representations by covering both local and global semantic information using a bidirectional tree-based encoder. To maximize the predictive likelihood of target words, a weighted variant of an attention mechanism is used to balance the attentive information between lexical and phrase vectors. Using a tree-based rare word encoding, the proposed model is extended to sub-word level to alleviate the out-of-vocabulary (OOV) problem. Empirical results reveal that the proposed model significantly outperforms sequence-to-sequence attention-based and tree-based neural translation models in English-Chinese translation tasks.Comment: Accepted for publication at EMNLP 201

arXiv.org e-Print Archive

Crossref

Modality Adaption or Regularization? A Case Study on End-to-End Speech Translation

Author: Han Yuchen
Xiao Tong
Xu Chen
Zhu Jingbo
Publication venue
Publication date: 13/06/2023
Field of study

Pre-training and fine-tuning is a paradigm for alleviating the data scarcity problem in end-to-end speech translation (E2E ST). The commonplace "modality gap" between speech and text data often leads to inconsistent inputs between pre-training and fine-tuning. However, we observe that this gap occurs in the early stages of fine-tuning, but does not have a major impact on the final performance. On the other hand, we find that there has another gap, which we call the "capacity gap": high resource tasks (such as ASR and MT) always require a large model to fit, when the model is reused for a low resource task (E2E ST), it will get a sub-optimal performance due to the over-fitting. In a case study, we find that the regularization plays a more important role than the well-designed modality adaption method, which achieves 29.0 for en-de and 40.3 for en-fr on the MuST-C dataset. Code and models are available at https://github.com/hannlp/TAB.Comment: ACL 2023 Main Conferenc

arXiv.org e-Print Archive

An Efficient Transformer Decoder with Compressed Sub-layers

Author: Li Yanyang
Lin Ye
Xiao Tong
Zhu Jingbo
Publication venue
Publication date: 19/07/2021
Field of study

The large attention-based encoder-decoder network (Transformer) has become prevailing recently due to its effectiveness. But the high computation complexity of its decoder raises the inefficiency issue. By examining the mathematic formulation of the decoder, we show that under some mild conditions, the architecture could be simplified by compressing its sub-layers, the basic building block of Transformer, and achieves a higher parallelism. We thereby propose Compressed Attention Network, whose decoder layer consists of only one sub-layer instead of three. Extensive experiments on 14 WMT machine translation tasks show that our model is 1.42x faster with performance on par with a strong baseline. This strong baseline is already 2x faster than the widely used standard baseline without loss in performance.Comment: accepted by AAAI202

arXiv.org e-Print Archive