49 research outputs found
Dynamic Sentence Sampling for Efficient Training of Neural Machine Translation
Traditional Neural machine translation (NMT) involves a fixed training
procedure where each sentence is sampled once during each epoch. In reality,
some sentences are well-learned during the initial few epochs; however, using
this approach, the well-learned sentences would continue to be trained along
with those sentences that were not well learned for 10-30 epochs, which results
in a wastage of time. Here, we propose an efficient method to dynamically
sample the sentences in order to accelerate the NMT training. In this approach,
a weight is assigned to each sentence based on the measured difference between
the training costs of two iterations. Further, in each epoch, a certain
percentage of sentences are dynamically sampled according to their weights.
Empirical results based on the NIST Chinese-to-English and the WMT
English-to-German tasks depict that the proposed method can significantly
accelerate the NMT training and improve the NMT performance.Comment: Revised version of ACL-201
Unfolding and Shrinking Neural Machine Translation Ensembles
Ensembling is a well-known technique in neural machine translation (NMT) to
improve system performance. Instead of a single neural net, multiple neural
nets with the same topology are trained separately, and the decoder generates
predictions by averaging over the individual models. Ensembling often improves
the quality of the generated translations drastically. However, it is not
suitable for production systems because it is cumbersome and slow. This work
aims to reduce the runtime to be on par with a single system without
compromising the translation quality. First, we show that the ensemble can be
unfolded into a single large neural network which imitates the output of the
ensemble system. We show that unfolding can already improve the runtime in
practice since more work can be done on the GPU. We proceed by describing a set
of techniques to shrink the unfolded network by reducing the dimensionality of
layers. On Japanese-English we report that the resulting network has the size
and decoding speed of a single NMT network but performs on the level of a
3-ensemble system.Comment: Accepted at EMNLP 201