1,483 research outputs found
Self-Attention Networks for Connectionist Temporal Classification in Speech Recognition
The success of self-attention in NLP has led to recent applications in
end-to-end encoder-decoder architectures for speech recognition. Separately,
connectionist temporal classification (CTC) has matured as an alignment-free,
non-autoregressive approach to sequence transduction, either by itself or in
various multitask and decoding frameworks. We propose SAN-CTC, a deep, fully
self-attentional network for CTC, and show it is tractable and competitive for
end-to-end speech recognition. SAN-CTC trains quickly and outperforms existing
CTC models and most encoder-decoder models, with character error rates (CERs)
of 4.7% in 1 day on WSJ eval92 and 2.8% in 1 week on LibriSpeech test-clean,
with a fixed architecture and one GPU. Similar improvements hold for WERs after
LM decoding. We motivate the architecture for speech, evaluate position and
downsampling approaches, and explore how label alphabets (character, phoneme,
subword) affect attention heads and performance.Comment: Accepted to ICASSP 201
SLUE: New Benchmark Tasks for Spoken Language Understanding Evaluation on Natural Speech
Progress in speech processing has been facilitated by shared datasets and
benchmarks. Historically these have focused on automatic speech recognition
(ASR), speaker identification, or other lower-level tasks. Interest has been
growing in higher-level spoken language understanding tasks, including using
end-to-end models, but there are fewer annotated datasets for such tasks. At
the same time, recent work shows the possibility of pre-training generic
representations and then fine-tuning for several tasks using relatively little
labeled data. We propose to create a suite of benchmark tasks for Spoken
Language Understanding Evaluation (SLUE) consisting of limited-size labeled
training sets and corresponding evaluation sets. This resource would allow the
research community to track progress, evaluate pre-trained representations for
higher-level tasks, and study open questions such as the utility of pipeline
versus end-to-end approaches. We present the first phase of the SLUE benchmark
suite, consisting of named entity recognition, sentiment analysis, and ASR on
the corresponding datasets. We focus on naturally produced (not read or
synthesized) speech, and freely available datasets. We provide new
transcriptions and annotations on subsets of the VoxCeleb and VoxPopuli
datasets, evaluation metrics and results for baseline models, and an
open-source toolkit to reproduce the baselines and evaluate new models.Comment: Updated preprint (Sentiment annotation on test set was updated).
Toolkit link https://github.com/asappresearch/slue-toolki
- …