27 research outputs found
End-to-End Speech Recognition From the Raw Waveform
State-of-the-art speech recognition systems rely on fixed, hand-crafted
features such as mel-filterbanks to preprocess the waveform before the training
pipeline. In this paper, we study end-to-end systems trained directly from the
raw waveform, building on two alternatives for trainable replacements of
mel-filterbanks that use a convolutional architecture. The first one is
inspired by gammatone filterbanks (Hoshen et al., 2015; Sainath et al, 2015),
and the second one by the scattering transform (Zeghidour et al., 2017). We
propose two modifications to these architectures and systematically compare
them to mel-filterbanks, on the Wall Street Journal dataset. The first
modification is the addition of an instance normalization layer, which greatly
improves on the gammatone-based trainable filterbanks and speeds up the
training of the scattering-based filterbanks. The second one relates to the
low-pass filter used in these approaches. These modifications consistently
improve performances for both approaches, and remove the need for a careful
initialization in scattering-based trainable filterbanks. In particular, we
show a consistent improvement in word error rate of the trainable filterbanks
relatively to comparable mel-filterbanks. It is the first time end-to-end
models trained from the raw signal significantly outperform mel-filterbanks on
a large vocabulary task under clean recording conditions.Comment: Accepted for presentation at Interspeech 201
Self-Attention Networks for Connectionist Temporal Classification in Speech Recognition
The success of self-attention in NLP has led to recent applications in
end-to-end encoder-decoder architectures for speech recognition. Separately,
connectionist temporal classification (CTC) has matured as an alignment-free,
non-autoregressive approach to sequence transduction, either by itself or in
various multitask and decoding frameworks. We propose SAN-CTC, a deep, fully
self-attentional network for CTC, and show it is tractable and competitive for
end-to-end speech recognition. SAN-CTC trains quickly and outperforms existing
CTC models and most encoder-decoder models, with character error rates (CERs)
of 4.7% in 1 day on WSJ eval92 and 2.8% in 1 week on LibriSpeech test-clean,
with a fixed architecture and one GPU. Similar improvements hold for WERs after
LM decoding. We motivate the architecture for speech, evaluate position and
downsampling approaches, and explore how label alphabets (character, phoneme,
subword) affect attention heads and performance.Comment: Accepted to ICASSP 201
Exploring Filterbank Learning for Keyword Spotting
Despite their great performance over the years, handcrafted speech features
are not necessarily optimal for any particular speech application.
Consequently, with greater or lesser success, optimal filterbank learning has
been studied for different speech processing tasks. In this paper, we fill in a
gap by exploring filterbank learning for keyword spotting (KWS). Two approaches
are examined: filterbank matrix learning in the power spectral domain and
parameter learning of a psychoacoustically-motivated gammachirp filterbank.
Filterbank parameters are optimized jointly with a modern deep residual neural
network-based KWS back-end. Our experimental results reveal that, in general,
there are no statistically significant differences, in terms of KWS accuracy,
between using a learned filterbank and handcrafted speech features. Thus, while
we conclude that the latter are still a wise choice when using modern KWS
back-ends, we also hypothesize that this could be a symptom of information
redundancy, which opens up new research possibilities in the field of
small-footprint KWS