24,027 research outputs found
You Do Not Need More Data: Improving End-To-End Speech Recognition by Text-To-Speech Data Augmentation
Data augmentation is one of the most effective ways to make end-to-end
automatic speech recognition (ASR) perform close to the conventional hybrid
approach, especially when dealing with low-resource tasks. Using recent
advances in speech synthesis (text-to-speech, or TTS), we build our TTS system
on an ASR training database and then extend the data with synthesized speech to
train a recognition model. We argue that, when the training data amount is
relatively low, this approach can allow an end-to-end model to reach hybrid
systems' quality. For an artificial low-to-medium-resource setup, we compare
the proposed augmentation with the semi-supervised learning technique. We also
investigate the influence of vocoder usage on final ASR performance by
comparing Griffin-Lim algorithm with our modified LPCNet. When applied with an
external language model, our approach outperforms a semi-supervised setup for
LibriSpeech test-clean and only 33% worse than a comparable supervised setup.
Our system establishes a competitive result for end-to-end ASR trained on
LibriSpeech train-clean-100 set with WER 4.3% for test-clean and 13.5% for
test-other
Relative Positional Encoding for Speech Recognition and Direct Translation
Transformer models are powerful sequence-to-sequence architectures that are
capable of directly mapping speech inputs to transcriptions or translations.
However, the mechanism for modeling positions in this model was tailored for
text modeling, and thus is less ideal for acoustic inputs. In this work, we
adapt the relative position encoding scheme to the Speech Transformer, where
the key addition is relative distance between input states in the
self-attention network. As a result, the network can better adapt to the
variable distributions present in speech data. Our experiments show that our
resulting model achieves the best recognition result on the Switchboard
benchmark in the non-augmentation condition, and the best published result in
the MuST-C speech translation benchmark. We also show that this model is able
to better utilize synthetic data than the Transformer, and adapts better to
variable sentence segmentation quality for speech translation.Comment: Submitted to Interspeech 202
A Speech Recognizer based on Multiclass SVMs with HMM-Guided Segmentation
Automatic Speech Recognition (ASR) is essentially a problem of pattern
classification, however, the time dimension of the speech signal has
prevented to pose ASR as a simple static classification problem. Support
Vector Machine (SVM) classifiers could provide an appropriate solution,
since they are very well adapted to high-dimensional classification problems.
Nevertheless, the use of SVMs for ASR is by no means straightforward,
mainly because SVM classifiers require an input of fixed-dimension.
In this paper we study the use of a HMM-based segmentation as a mean to
get the fixed-dimension input vectors required by SVMs, in a problem of
isolated-digit recognition. Different configurations for all the parameters
involved have been tested. Also, we deal with the problem of multi-class
classification (as SVMs are initially binary classifers), studying two of the
most popular approaches: 1-vs-all and 1-vs-1
- …
