7 research outputs found
You Do Not Need More Data: Improving End-To-End Speech Recognition by Text-To-Speech Data Augmentation
Data augmentation is one of the most effective ways to make end-to-end
automatic speech recognition (ASR) perform close to the conventional hybrid
approach, especially when dealing with low-resource tasks. Using recent
advances in speech synthesis (text-to-speech, or TTS), we build our TTS system
on an ASR training database and then extend the data with synthesized speech to
train a recognition model. We argue that, when the training data amount is
relatively low, this approach can allow an end-to-end model to reach hybrid
systems' quality. For an artificial low-to-medium-resource setup, we compare
the proposed augmentation with the semi-supervised learning technique. We also
investigate the influence of vocoder usage on final ASR performance by
comparing Griffin-Lim algorithm with our modified LPCNet. When applied with an
external language model, our approach outperforms a semi-supervised setup for
LibriSpeech test-clean and only 33% worse than a comparable supervised setup.
Our system establishes a competitive result for end-to-end ASR trained on
LibriSpeech train-clean-100 set with WER 4.3% for test-clean and 13.5% for
test-other
Accurate synthesis of Dysarthric Speech for ASR data augmentation
Dysarthria is a motor speech disorder often characterized by reduced speech
intelligibility through slow, uncoordinated control of speech production
muscles. Automatic Speech recognition (ASR) systems can help dysarthric talkers
communicate more effectively. However, robust dysarthria-specific ASR requires
a significant amount of training speech, which is not readily available for
dysarthric talkers. This paper presents a new dysarthric speech synthesis
method for the purpose of ASR training data augmentation. Differences in
prosodic and acoustic characteristics of dysarthric spontaneous speech at
varying severity levels are important components for dysarthric speech
modeling, synthesis, and augmentation. For dysarthric speech synthesis, a
modified neural multi-talker TTS is implemented by adding a dysarthria severity
level coefficient and a pause insertion model to synthesize dysarthric speech
for varying severity levels. To evaluate the effectiveness for synthesis of
training data for ASR, dysarthria-specific speech recognition was used. Results
show that a DNN-HMM model trained on additional synthetic dysarthric speech
achieves WER improvement of 12.2% compared to the baseline, and that the
addition of the severity level and pause insertion controls decrease WER by
6.5%, showing the effectiveness of adding these parameters. Overall results on
the TORGO database demonstrate that using dysarthric synthetic speech to
increase the amount of dysarthric-patterned speech for training has significant
impact on the dysarthric ASR systems. In addition, we have conducted a
subjective evaluation to evaluate the dysarthric-ness and similarity of
synthesized speech. Our subjective evaluation shows that the perceived
dysartrhic-ness of synthesized speech is similar to that of true dysarthric
speech, especially for higher levels of dysarthriaComment: arXiv admin note: text overlap with arXiv:2201.1157
SYNTHESIZING DYSARTHRIC SPEECH USING MULTI-SPEAKER TTS FOR DSYARTHRIC SPEECH RECOGNITION
Dysarthria is a motor speech disorder often characterized by reduced speech intelligibility through slow, uncoordinated control of speech production muscles. Automatic Speech recognition (ASR) systems may help dysarthric talkers communicate more effectively. However, robust dysarthria-specific ASR requires a significant amount of training speech is required, which is not readily available for dysarthric talkers.
In this dissertation, we investigate dysarthric speech augmentation and synthesis methods. To better understand differences in prosodic and acoustic characteristics of dysarthric spontaneous speech at varying severity levels, a comparative study between typical and dysarthric speech was conducted. These characteristics are important components for dysarthric speech modeling, synthesis, and augmentation. For augmentation, prosodic transformation and time-feature masking have been proposed. For dysarthric speech synthesis, this dissertation has introduced a modified neural multi-talker TTS by adding a dysarthria severity level coefficient and a pause insertion model to synthesize dysarthric speech for varying severity levels. In addition, we have extended this work by using a label propagation technique to create more meaningful control variables such as a continuous Respiration, Laryngeal and Tongue (RLT) parameter, even for datasets that only provide discrete dysarthria severity level information. This approach increases the controllability of the system, so we are able to generate more dysarthric speech with a broader range.
To evaluate their effectiveness for synthesis of training data, dysarthria-specific speech recognition was used. Results show that a DNN-HMM model trained on additional synthetic dysarthric speech achieves WER improvement of 12.2% compared to the baseline, and that the addition of the severity level and pause insertion controls decrease WER by 6.5%, showing the effectiveness of adding these parameters. Overall results on the TORGO database demonstrate that using dysarthric synthetic speech to increase the amount of dysarthric-patterned speech for training has a significant impact on the dysarthric ASR systems