172 research outputs found
On the differences between BERT and MT encoder spaces and how to address them in translation tasks
Various studies show that pretrained language models such as BERT cannot straightforwardly replace encoders in neural machine translation despite their enormous success in other tasks. This is even more astonishing considering the similarities between the architectures. This paper sheds some light on the embedding spaces they create, using average cosine similarity, contextuality metrics and measures for representational similarity for comparison, revealing that BERT and NMT encoder representations look significantly different from one another. In order to address this issue, we propose a supervised transformation from one into the other using explicit alignment and fine-tuning. Our results demonstrate the need for such a transformation to improve the applicability of BERT in MT.Peer reviewe
Optimizing Deep Transformers for Chinese-Thai Low-Resource Translation
In this paper, we study the use of deep Transformer translation model for the
CCMT 2022 Chinese-Thai low-resource machine translation task. We first explore
the experiment settings (including the number of BPE merge operations, dropout
probability, embedding size, etc.) for the low-resource scenario with the
6-layer Transformer. Considering that increasing the number of layers also
increases the regularization on new model parameters (dropout modules are also
introduced when using more layers), we adopt the highest performance setting
but increase the depth of the Transformer to 24 layers to obtain improved
translation quality. Our work obtains the SOTA performance in the
Chinese-to-Thai translation in the constrained evaluation
Syntactic Knowledge via Graph Attention with BERT in Machine Translation
Although the Transformer model can effectively acquire context features via a
self-attention mechanism, deeper syntactic knowledge is still not effectively
modeled. To alleviate the above problem, we propose Syntactic knowledge via
Graph attention with BERT (SGB) in Machine Translation (MT) scenarios. Graph
Attention Network (GAT) and BERT jointly represent syntactic dependency feature
as explicit knowledge of the source language to enrich source language
representations and guide target language generation. Our experiments use gold
syntax-annotation sentences and Quality Estimation (QE) model to obtain
interpretability of translation quality improvement regarding syntactic
knowledge without being limited to a BLEU score. Experiments show that the
proposed SGB engines improve translation quality across the three MT tasks
without sacrificing BLEU scores. We investigate what length of source sentences
benefits the most and what dependencies are better identified by the SGB
engines. We also find that learning of specific dependency relations by GAT can
be reflected in the translation quality containing such relations and that
syntax on the graph leads to new modeling of syntactic aspects of source
sentences in the middle and bottom layers of BERT
Language Model Prior for Low-Resource Neural Machine Translation
The scarcity of large parallel corpora is an important obstacle for neural
machine translation. A common solution is to exploit the knowledge of language
models (LM) trained on abundant monolingual data. In this work, we propose a
novel approach to incorporate a LM as prior in a neural translation model (TM).
Specifically, we add a regularization term, which pushes the output
distributions of the TM to be probable under the LM prior, while avoiding wrong
predictions when the TM "disagrees" with the LM. This objective relates to
knowledge distillation, where the LM can be viewed as teaching the TM about the
target language. The proposed approach does not compromise decoding speed,
because the LM is used only at training time, unlike previous work that
requires it during inference. We present an analysis of the effects that
different methods have on the distributions of the TM. Results on two
low-resource machine translation datasets show clear improvements even with
limited monolingual data
SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization
Transfer learning has fundamentally changed the landscape of natural language
processing (NLP) research. Many existing state-of-the-art models are first
pre-trained on a large text corpus and then fine-tuned on downstream tasks.
However, due to limited data resources from downstream tasks and the extremely
large capacity of pre-trained models, aggressive fine-tuning often causes the
adapted model to overfit the data of downstream tasks and forget the knowledge
of the pre-trained model. To address the above issue in a more principled
manner, we propose a new computational framework for robust and efficient
fine-tuning for pre-trained language models. Specifically, our proposed
framework contains two important ingredients: 1. Smoothness-inducing
regularization, which effectively manages the capacity of the model; 2. Bregman
proximal point optimization, which is a class of trust-region methods and can
prevent knowledge forgetting. Our experiments demonstrate that our proposed
method achieves the state-of-the-art performance on multiple NLP benchmarks.Comment: The 58th annual meeting of the Association for Computational
Linguistics (ACL 2020
From Masked Language Modeling to Translation: Non-English Auxiliary Tasks Improve Zero-shot Spoken Language Understanding
The lack of publicly available evaluation data for low-resource languages limits progress in Spoken Language Understanding (SLU). As key tasks like intent classification and slot filling require abundant training data, it is desirable to reuse existing data in high-resource languages to develop models for low-resource scenarios. We introduce xSID, a new benchmark for cross-lingual (x) Slot and Intent Detection in 13 languages from 6 language families, including a very low-resource dialect. To tackle the challenge, we propose a joint learning approach, with English SLU training data and non-English auxiliary tasks from raw text, syntax and translation for transfer. We study two setups which differ by type and language coverage of the pre-trained embeddings. Our results show that jointly learning the main tasks with masked language modeling is effective for slots, while machine translation transfer works best for intent classification
- …