10 research outputs found
Dense Information Flow for Neural Machine Translation
Recently, neural machine translation has achieved remarkable progress by
introducing well-designed deep neural networks into its encoder-decoder
framework. From the optimization perspective, residual connections are adopted
to improve learning performance for both encoder and decoder in most of these
deep architectures, and advanced attention connections are applied as well.
Inspired by the success of the DenseNet model in computer vision problems, in
this paper, we propose a densely connected NMT architecture (DenseNMT) that is
able to train more efficiently for NMT. The proposed DenseNMT not only allows
dense connection in creating new features for both encoder and decoder, but
also uses the dense attention structure to improve attention quality. Our
experiments on multiple datasets show that DenseNMT structure is more
competitive and efficient
Non-Autoregressive Neural Machine Translation with Enhanced Decoder Input
Non-autoregressive translation (NAT) models, which remove the dependence on
previous target tokens from the inputs of the decoder, achieve significantly
inference speedup but at the cost of inferior accuracy compared to
autoregressive translation (AT) models. Previous work shows that the quality of
the inputs of the decoder is important and largely impacts the model accuracy.
In this paper, we propose two methods to enhance the decoder inputs so as to
improve NAT models. The first one directly leverages a phrase table generated
by conventional SMT approaches to translate source tokens to target tokens,
which are then fed into the decoder as inputs. The second one transforms
source-side word embeddings to target-side word embeddings through
sentence-level alignment and word-level adversary learning, and then feeds the
transformed word embeddings into the decoder as inputs. Experimental results
show our method largely outperforms the NAT baseline~\citep{gu2017non} by
BLEU scores on WMT14 English-German task and BLEU scores on WMT16
English-Romanian task.Comment: AAAI 201
Dynamic Layer Aggregation for Neural Machine Translation with Routing-by-Agreement
With the promising progress of deep neural networks, layer aggregation has
been used to fuse information across layers in various fields, such as computer
vision and machine translation. However, most of the previous methods combine
layers in a static fashion in that their aggregation strategy is independent of
specific hidden states. Inspired by recent progress on capsule networks, in
this paper we propose to use routing-by-agreement strategies to aggregate
layers dynamically. Specifically, the algorithm learns the probability of a
part (individual layer representations) assigned to a whole (aggregated
representations) in an iterative way and combines parts accordingly. We
implement our algorithm on top of the state-of-the-art neural machine
translation model TRANSFORMER and conduct experiments on the widely-used WMT14
English-German and WMT17 Chinese-English translation datasets. Experimental
results across language pairs show that the proposed approach consistently
outperforms the strong baseline model and a representative static aggregation
model.Comment: AAAI 201
Fine-Tuning by Curriculum Learning for Non-Autoregressive Neural Machine Translation
Non-autoregressive translation (NAT) models remove the dependence on previous
target tokens and generate all target tokens in parallel, resulting in
significant inference speedup but at the cost of inferior translation accuracy
compared to autoregressive translation (AT) models. Considering that AT models
have higher accuracy and are easier to train than NAT models, and both of them
share the same model configurations, a natural idea to improve the accuracy of
NAT models is to transfer a well-trained AT model to an NAT model through
fine-tuning. However, since AT and NAT models differ greatly in training
strategy, straightforward fine-tuning does not work well. In this work, we
introduce curriculum learning into fine-tuning for NAT. Specifically, we design
a curriculum in the fine-tuning process to progressively switch the training
from autoregressive generation to non-autoregressive generation. Experiments on
four benchmark translation datasets show that the proposed method achieves good
improvement (more than BLEU score) over previous NAT baselines in terms of
translation accuracy, and greatly speed up (more than times) the inference
process over AT baselines.Comment: AAAI 202
Layer-Wise Cross-View Decoding for Sequence-to-Sequence Learning
In sequence-to-sequence learning, the decoder relies on the attention
mechanism to efficiently extract information from the encoder. While it is
common practice to draw information from only the last encoder layer, recent
work has proposed to use representations from different encoder layers for
diversified levels of information. Nonetheless, the decoder still obtains only
a single view of the source sequences, which might lead to insufficient
training of the encoder layer stack due to the hierarchy bypassing problem. In
this work, we propose layer-wise cross-view decoding, where for each decoder
layer, together with the representations from the last encoder layer, which
serve as a global view, those from other encoder layers are supplemented for a
stereoscopic view of the source sequences. Systematic experiments show that we
successfully address the hierarchy bypassing problem and substantially improve
the performance of sequence-to-sequence learning with deep representations on
diverse tasks.Comment: 9 pages, 6 figure
Transformer-based NMT : modeling, training and implementation
International trade and industrial collaborations enable countries and regions to concentrate their developments on specific industries while making the most of other countries' specializations, which significantly accelerates global development. However, globalization also increases the demand for cross-region communication. Language barriers between many languages worldwide create a challenge for achieving deep collaboration between groups speaking different languages, increasing the need for translation. Language technology, specifically, Machine Translation (MT) holds the promise to enable communication between languages efficiently in real-time with minimal costs. Even though nowadays computers can perform computation in parallel very fast, which provides machine translation users with translations with very low latency, and although the evolution from Statistical Machine Translation (SMT) to Neural Machine Translation (NMT) with the utilization of advanced deep learning algorithms has significantly boosted translation quality, current machine translation algorithms are still far from accurately translating all input. Thus, how to further improve the performance of state-of-the-art NMT algorithm remains a valuable open research question which has received a wide range of attention. In the research presented in this thesis, we first investigate the long-distance relation modeling ability of the state-of-the-art NMT model, the Transformer. We propose to learn source phrase representations and incorporate them into the Transformer translation model, aiming to enhance its ability to capture long-distance dependencies well. Second, though previous work (Bapna et al., 2018) suggests that deep Transformers have difficulty in converging, we empirically find that the convergence of deep Transformers depends on the interaction between the layer normalization and residual connections employed to stabilize its training. We conduct a theoretical study about how to ensure the convergence of Transformers, especially for deep Transformers, and propose to ensure the convergence of deep Transformers by putting the Lipschitz constraint on its parameter initialization. Finally, we investigate how to dynamically determine proper and efficient batch sizes during the training of the Transformer model. We find that the gradient direction gets stabilized with increasing batch size during gradient accumulation. Thus we propose to dynamically adjust batch sizes during training by monitoring the gradient direction change within gradient accumulation, and to achieve a proper and efficient batch size by stopping the gradient accumulation when the gradient direction starts to fluctuate. For our research in this thesis, we also implement our own NMT toolkit, the Neutron implementation of the Transformer and its variants. In addition to providing fundamental features as the basis of our implementations for the approaches presented in this thesis, we support many advanced features from recent cutting-edge research work. Implementations of all our approaches in this thesis are also included and open-sourced in the toolkit. To compare with previous approaches, we mainly conducted our experiments on the data from the WMT 14 English to German (En-De) and English to French (En-Fr) news translation tasks, except when studying the convergence of deep Transformers, where we alternated the WMT 14 En-Fr task with the WMT 15 Czech to English (Cs-En) news translation task to compare with Bapna et al. (2018). The sizes of these datasets vary from medium (the WMT 14 En-De, ~ 4.5M sentence pairs) to very large (the WMT 14 En-Fr, ~ 36M sentence pairs), thus we suggest our approaches help improve the translation quality between popular language pairs which are widely used and have sufficient data.China Scholarship Counci