Search CORE

10,611 research outputs found

You Do Not Need More Data: Improving End-To-End Speech Recognition by Text-To-Speech Data Augmentation

Author: Andrusenko Andrei
Korostik Roman
Laptev Aleksandr
Medennikov Ivan
Rybin Sergey
Svischev Aleksey
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 30/07/2020
Field of study

Data augmentation is one of the most effective ways to make end-to-end automatic speech recognition (ASR) perform close to the conventional hybrid approach, especially when dealing with low-resource tasks. Using recent advances in speech synthesis (text-to-speech, or TTS), we build our TTS system on an ASR training database and then extend the data with synthesized speech to train a recognition model. We argue that, when the training data amount is relatively low, this approach can allow an end-to-end model to reach hybrid systems' quality. For an artificial low-to-medium-resource setup, we compare the proposed augmentation with the semi-supervised learning technique. We also investigate the influence of vocoder usage on final ASR performance by comparing Griffin-Lim algorithm with our modified LPCNet. When applied with an external language model, our approach outperforms a semi-supervised setup for LibriSpeech test-clean and only 33% worse than a comparable supervised setup. Our system establishes a competitive result for end-to-end ASR trained on LibriSpeech train-clean-100 set with WER 4.3% for test-clean and 13.5% for test-other

arXiv.org e-Print Archive

Crossref

On Using Backpropagation for Speech Texture Generation and Voice Conversion

Author: Bengio Samy
Chorowski Jan
Saurous Rif A.
Weiss Ron J.
Publication venue
Publication date: 08/03/2018
Field of study

Inspired by recent work on neural network image generation which rely on backpropagation towards the network inputs, we present a proof-of-concept system for speech texture synthesis and voice conversion based on two mechanisms: approximate inversion of the representation learned by a speech recognition neural network, and on matching statistics of neuron activations between different source and target utterances. Similar to image texture synthesis and neural style transfer, the system works by optimizing a cost function with respect to the input waveform samples. To this end we use a differentiable mel-filterbank feature extraction pipeline and train a convolutional CTC speech recognition network. Our system is able to extract speaker characteristics from very limited amounts of target speaker data, as little as a few seconds, and can be used to generate realistic speech babble or reconstruct an utterance in a different voice.Comment: Accepted to ICASSP 201

arXiv.org e-Print Archive

Crossref