18 research outputs found
RETURNN as a Generic Flexible Neural Toolkit with Application to Translation and Speech Recognition
We compare the fast training and decoding speed of RETURNN of attention
models for translation, due to fast CUDA LSTM kernels, and a fast pure
TensorFlow beam search decoder. We show that a layer-wise pretraining scheme
for recurrent attention models gives over 1% BLEU improvement absolute and it
allows to train deeper recurrent encoder networks. Promising preliminary
results on max. expected BLEU training are presented. We are able to train
state-of-the-art models for translation and end-to-end models for speech
recognition and show results on WMT 2017 and Switchboard. The flexibility of
RETURNN allows a fast research feedback loop to experiment with alternative
architectures, and its generality allows to use it on a wide range of
applications.Comment: accepted as demo paper on ACL 201
Improved training of end-to-end attention models for speech recognition
Sequence-to-sequence attention-based models on subword units allow simple
open-vocabulary end-to-end speech recognition. In this work, we show that such
models can achieve competitive results on the Switchboard 300h and LibriSpeech
1000h tasks. In particular, we report the state-of-the-art word error rates
(WER) of 3.54% on the dev-clean and 3.82% on the test-clean evaluation subsets
of LibriSpeech. We introduce a new pretraining scheme by starting with a high
time reduction factor and lowering it during training, which is crucial both
for convergence and final performance. In some experiments, we also use an
auxiliary CTC loss function to help the convergence. In addition, we train long
short-term memory (LSTM) language models on subword units. By shallow fusion,
we report up to 27% relative improvements in WER over the attention baseline
without a language model.Comment: submitted to Interspeech 201
Robust Beam Search for Encoder-Decoder Attention Based Speech Recognition without Length Bias
As one popular modeling approach for end-to-end speech recognition,
attention-based encoder-decoder models are known to suffer the length bias and
corresponding beam problem. Different approaches have been applied in simple
beam search to ease the problem, most of which are heuristic-based and require
considerable tuning. We show that heuristics are not proper modeling
refinement, which results in severe performance degradation with largely
increased beam sizes. We propose a novel beam search derived from
reinterpreting the sequence posterior with an explicit length modeling. By
applying the reinterpreted probability together with beam pruning, the obtained
final probability leads to a robust model modification, which allows reliable
comparison among output sequences of different lengths. Experimental
verification on the LibriSpeech corpus shows that the proposed approach solves
the length bias problem without heuristics or additional tuning effort. It
provides robust decision making and consistently good performance under both
small and very large beam sizes. Compared with the best results of the
heuristic baseline, the proposed approach achieves the same WER on the 'clean'
sets and 4% relative improvement on the 'other' sets. We also show that it is
more efficient with the additional derived early stopping criterion.Comment: accepted at INTERSPEECH202
RASR2: The RWTH ASR Toolkit for Generic Sequence-to-sequence Speech Recognition
Modern public ASR tools usually provide rich support for training various
sequence-to-sequence (S2S) models, but rather simple support for decoding
open-vocabulary scenarios only. For closed-vocabulary scenarios, public tools
supporting lexical-constrained decoding are usually only for classical ASR, or
do not support all S2S models. To eliminate this restriction on research
possibilities such as modeling unit choice, we present RASR2 in this work, a
research-oriented generic S2S decoder implemented in C++. It offers a strong
flexibility/compatibility for various S2S models, language models, label
units/topologies and neural network architectures. It provides efficient
decoding for both open- and closed-vocabulary scenarios based on a generalized
search framework with rich support for different search modes and settings. We
evaluate RASR2 with a wide range of experiments on both switchboard and
Librispeech corpora. Our source code is public online.Comment: accepted at Interspeech 202
Language Modeling with Deep Transformers
We explore deep autoregressive Transformer models in language modeling for
speech recognition. We focus on two aspects. First, we revisit Transformer
model configurations specifically for language modeling. We show that well
configured Transformer models outperform our baseline models based on the
shallow stack of LSTM recurrent neural network layers. We carry out experiments
on the open-source LibriSpeech 960hr task, for both 200K vocabulary word-level
and 10K byte-pair encoding subword-level language modeling. We apply our
word-level models to conventional hybrid speech recognition by lattice
rescoring, and the subword-level models to attention based encoder-decoder
models by shallow fusion. Second, we show that deep Transformer language models
do not require positional encoding. The positional encoding is an essential
augmentation for the self-attention mechanism which is invariant to sequence
ordering. However, in autoregressive setup, as is the case for language
modeling, the amount of information increases along the position dimension,
which is a positional signal by its own. The analysis of attention weights
shows that deep autoregressive self-attention models can automatically make use
of such positional information. We find that removing the positional encoding
even slightly improves the performance of these models.Comment: To appear in the proceedings of INTERSPEECH 201
Deep Audio Analyzer: a Framework to Industrialize the Research on Audio Forensics
Deep Audio Analyzer is an open source speech framework that aims to simplify
the research and the development process of neural speech processing pipelines,
allowing users to conceive, compare and share results in a fast and
reproducible way. This paper describes the core architecture designed to
support several tasks of common interest in the audio forensics field, showing
possibility of creating new tasks thus customizing the framework. By means of
Deep Audio Analyzer, forensics examiners (i.e. from Law Enforcement Agencies)
and researchers will be able to visualize audio features, easily evaluate
performances on pretrained models, to create, export and share new audio
analysis workflows by combining deep neural network models with few clicks. One
of the advantages of this tool is to speed up research and practical
experimentation, in the field of audio forensics analysis thus also improving
experimental reproducibility by exporting and sharing pipelines. All features
are developed in modules accessible by the user through a Graphic User
Interface. Index Terms: Speech Processing, Deep Learning Audio, Deep Learning
Audio Pipeline creation, Audio Forensics