21 research outputs found
Unfolding and Shrinking Neural Machine Translation Ensembles
Ensembling is a well-known technique in neural machine translation (NMT) to
improve system performance. Instead of a single neural net, multiple neural
nets with the same topology are trained separately, and the decoder generates
predictions by averaging over the individual models. Ensembling often improves
the quality of the generated translations drastically. However, it is not
suitable for production systems because it is cumbersome and slow. This work
aims to reduce the runtime to be on par with a single system without
compromising the translation quality. First, we show that the ensemble can be
unfolded into a single large neural network which imitates the output of the
ensemble system. We show that unfolding can already improve the runtime in
practice since more work can be done on the GPU. We proceed by describing a set
of techniques to shrink the unfolded network by reducing the dimensionality of
layers. On Japanese-English we report that the resulting network has the size
and decoding speed of a single NMT network but performs on the level of a
3-ensemble system.Comment: Accepted at EMNLP 201
Pieces of Eight: 8-bit Neural Machine Translation
Neural machine translation has achieved levels of fluency and adequacy that
would have been surprising a short time ago. Output quality is extremely
relevant for industry purposes, however it is equally important to produce
results in the shortest time possible, mainly for latency-sensitive
applications and to control cloud hosting costs. In this paper we show the
effectiveness of translating with 8-bit quantization for models that have been
trained using 32-bit floating point values. Results show that 8-bit translation
makes a non-negligible impact in terms of speed with no degradation in accuracy
and adequacy.Comment: To appear at NAACL 2018 Industry Trac