60,494 research outputs found
Fast Neural Machine Translation Implementation
This paper describes the submissions to the efficiency track for GPUs at the
Workshop for Neural Machine Translation and Generation by members of the
University of Edinburgh, Adam Mickiewicz University, Tilde and University of
Alicante. We focus on efficient implementation of the recurrent deep-learning
model as implemented in Amun, the fast inference engine for neural machine
translation. We improve the performance with an efficient mini-batching
algorithm, and by fusing the softmax operation with the k-best extraction
algorithm. Submissions using Amun were first, second and third fastest in the
GPU efficiency track
Fast machine translation on parallel and massively parallel hardware
Parallel systems have been widely adopted in the field of machine translation, because
the raw computational power they offer is well suited to this computationally intensive
task. However programming for parallel hardware is not trivial as it requires redesign
of the existing algorithms. In my thesis I design efficient algorithms for machine translation
on parallel hardware. I identify memory accesses as the biggest bottleneck to
processing speed and propose novel algorithms that minimize them. I present three distinct
case studies in which minimizing memory access substantially improves speed:
Starting with statistical machine translation, I design a phrase table that makes decoding
ten times faster on a multi-threaded CPU. Next, I design a GPU-based n-gram
language model that is twice as fast per £ as a highly optimized CPU implementation.
Turning to neural machine translation, I design new stochastic gradient descent techniques
that make end-to-end training twice as fast. The work in this thesis has been
incorporated in two popular machine translation toolkits: Moses and Marian
FastWave: Accelerating Autoregressive Convolutional Neural Networks on FPGA
Autoregressive convolutional neural networks (CNNs) have been widely
exploited for sequence generation tasks such as audio synthesis, language
modeling and neural machine translation. WaveNet is a deep autoregressive CNN
composed of several stacked layers of dilated convolution that is used for
sequence generation. While WaveNet produces state-of-the art audio generation
results, the naive inference implementation is quite slow; it takes a few
minutes to generate just one second of audio on a high-end GPU. In this work,
we develop the first accelerator platform~\textit{FastWave} for autoregressive
convolutional neural networks, and address the associated design challenges. We
design the Fast-Wavenet inference model in Vivado HLS and perform a wide range
of optimizations including fixed-point implementation, array partitioning and
pipelining. Our model uses a fully parameterized parallel architecture for fast
matrix-vector multiplication that enables per-layer customized latency
fine-tuning for further throughput improvement. Our experiments comparatively
assess the trade-off between throughput and resource utilization for various
optimizations. Our best WaveNet design on the Xilinx XCVU13P FPGA that uses
only on-chip memory, achieves 66 faster generation speed compared to CPU
implementation and 11 faster generation speed than GPU implementation.Comment: Published as a conference paper at ICCAD 201
RETURNN as a Generic Flexible Neural Toolkit with Application to Translation and Speech Recognition
We compare the fast training and decoding speed of RETURNN of attention
models for translation, due to fast CUDA LSTM kernels, and a fast pure
TensorFlow beam search decoder. We show that a layer-wise pretraining scheme
for recurrent attention models gives over 1% BLEU improvement absolute and it
allows to train deeper recurrent encoder networks. Promising preliminary
results on max. expected BLEU training are presented. We are able to train
state-of-the-art models for translation and end-to-end models for speech
recognition and show results on WMT 2017 and Switchboard. The flexibility of
RETURNN allows a fast research feedback loop to experiment with alternative
architectures, and its generality allows to use it on a wide range of
applications.Comment: accepted as demo paper on ACL 201
Simple Recurrent Units for Highly Parallelizable Recurrence
Common recurrent neural architectures scale poorly due to the intrinsic
difficulty in parallelizing their state computations. In this work, we propose
the Simple Recurrent Unit (SRU), a light recurrent unit that balances model
capacity and scalability. SRU is designed to provide expressive recurrence,
enable highly parallelized implementation, and comes with careful
initialization to facilitate training of deep models. We demonstrate the
effectiveness of SRU on multiple NLP tasks. SRU achieves 5--9x speed-up over
cuDNN-optimized LSTM on classification and question answering datasets, and
delivers stronger results than LSTM and convolutional models. We also obtain an
average of 0.7 BLEU improvement over the Transformer model on translation by
incorporating SRU into the architecture.Comment: EMNL
- …