347,934 research outputs found
Language Modeling with Deep Transformers
We explore deep autoregressive Transformer models in language modeling for
speech recognition. We focus on two aspects. First, we revisit Transformer
model configurations specifically for language modeling. We show that well
configured Transformer models outperform our baseline models based on the
shallow stack of LSTM recurrent neural network layers. We carry out experiments
on the open-source LibriSpeech 960hr task, for both 200K vocabulary word-level
and 10K byte-pair encoding subword-level language modeling. We apply our
word-level models to conventional hybrid speech recognition by lattice
rescoring, and the subword-level models to attention based encoder-decoder
models by shallow fusion. Second, we show that deep Transformer language models
do not require positional encoding. The positional encoding is an essential
augmentation for the self-attention mechanism which is invariant to sequence
ordering. However, in autoregressive setup, as is the case for language
modeling, the amount of information increases along the position dimension,
which is a positional signal by its own. The analysis of attention weights
shows that deep autoregressive self-attention models can automatically make use
of such positional information. We find that removing the positional encoding
even slightly improves the performance of these models.Comment: To appear in the proceedings of INTERSPEECH 201
Folding Attention: Memory and Power Optimization for On-Device Transformer-based Streaming Speech Recognition
Transformer-based models excel in speech recognition. Existing efforts to
optimize Transformer inference, typically for long-context applications, center
on simplifying attention score calculations. However, streaming speech
recognition models usually process a limited number of tokens each time, making
attention score calculation less of a bottleneck. Instead, the bottleneck lies
in the linear projection layers of multi-head attention and feedforward
networks, constituting a substantial portion of the model size and contributing
significantly to computation, memory, and power usage.
To address this bottleneck, we propose folding attention, a technique
targeting these linear layers, significantly reducing model size and improving
memory and power efficiency. Experiments on on-device Transformer-based
streaming speech recognition models show that folding attention reduces model
size (and corresponding memory consumption) by up to 24% and power consumption
by up to 23%, all without compromising model accuracy or computation overhead
English Broadcast News Speech Recognition by Humans and Machines
With recent advances in deep learning, considerable attention has been given
to achieving automatic speech recognition performance close to human
performance on tasks like conversational telephone speech (CTS) recognition. In
this paper we evaluate the usefulness of these proposed techniques on broadcast
news (BN), a similar challenging task. We also perform a set of recognition
measurements to understand how close the achieved automatic speech recognition
results are to human performance on this task. On two publicly available BN
test sets, DEV04F and RT04, our speech recognition system using LSTM and
residual network based acoustic models with a combination of n-gram and neural
network language models performs at 6.5% and 5.9% word error rate. By achieving
new performance milestones on these test sets, our experiments show that
techniques developed on other related tasks, like CTS, can be transferred to
achieve similar performance. In contrast, the best measured human recognition
performance on these test sets is much lower, at 3.6% and 2.8% respectively,
indicating that there is still room for new techniques and improvements in this
space, to reach human performance levels.Comment: \copyright 2019 IEEE. Personal use of this material is permitted.
Permission from IEEE must be obtained for all other uses, in any current or
future media, including reprinting/republishing this material for advertising
or promotional purposes, creating new collective works, for resale or
redistribution to servers or lists, or reuse of any copyrighted component of
this work in other work
- …