422 research outputs found
You Do Not Need More Data: Improving End-To-End Speech Recognition by Text-To-Speech Data Augmentation
Data augmentation is one of the most effective ways to make end-to-end
automatic speech recognition (ASR) perform close to the conventional hybrid
approach, especially when dealing with low-resource tasks. Using recent
advances in speech synthesis (text-to-speech, or TTS), we build our TTS system
on an ASR training database and then extend the data with synthesized speech to
train a recognition model. We argue that, when the training data amount is
relatively low, this approach can allow an end-to-end model to reach hybrid
systems' quality. For an artificial low-to-medium-resource setup, we compare
the proposed augmentation with the semi-supervised learning technique. We also
investigate the influence of vocoder usage on final ASR performance by
comparing Griffin-Lim algorithm with our modified LPCNet. When applied with an
external language model, our approach outperforms a semi-supervised setup for
LibriSpeech test-clean and only 33% worse than a comparable supervised setup.
Our system establishes a competitive result for end-to-end ASR trained on
LibriSpeech train-clean-100 set with WER 4.3% for test-clean and 13.5% for
test-other
Cross-Utterance Conditioned VAE for Speech Generation
Speech synthesis systems powered by neural networks hold promise for
multimedia production, but frequently face issues with producing expressive
speech and seamless editing. In response, we present the Cross-Utterance
Conditioned Variational Autoencoder speech synthesis (CUC-VAE S2) framework to
enhance prosody and ensure natural speech generation. This framework leverages
the powerful representational capabilities of pre-trained language models and
the re-expression abilities of variational autoencoders (VAEs). The core
component of the CUC-VAE S2 framework is the cross-utterance CVAE, which
extracts acoustic, speaker, and textual features from surrounding sentences to
generate context-sensitive prosodic features, more accurately emulating human
prosody generation. We further propose two practical algorithms tailored for
distinct speech synthesis applications: CUC-VAE TTS for text-to-speech and
CUC-VAE SE for speech editing. The CUC-VAE TTS is a direct application of the
framework, designed to generate audio with contextual prosody derived from
surrounding texts. On the other hand, the CUC-VAE SE algorithm leverages real
mel spectrogram sampling conditioned on contextual information, producing audio
that closely mirrors real sound and thereby facilitating flexible speech
editing based on text such as deletion, insertion, and replacement.
Experimental results on the LibriTTS datasets demonstrate that our proposed
models significantly enhance speech synthesis and editing, producing more
natural and expressive speech.Comment: 13 pages
Model-based Parametric Prosody Synthesis with Deep Neural Network
Conventional statistical parametric speech synthesis (SPSS) captures only frame-wise acoustic observations and computes probability densities at HMM state level to obtain statistical acoustic models combined with decision trees, which is therefore a purely statistical data-driven approach without explicit integration of any articulatory mechanisms found in speech production research. The present study explores an alternative paradigm, namely, model-based parametric prosody synthesis (MPPS), which integrates dynamic mechanisms of human speech production as a core component of F0 generation. In this paradigm, contextual variations in prosody are processed in two separate yet integrated stages: linguistic to motor, and motor to acoustic. Here the motor model is target approximation (TA), which generates syllable-sized F0 contours with only three motor parameters that are associated to linguistic functions. In this study, we simulate this two-stage process by linking the TA model to a deep neural network (DNN), which learns the “linguistic-motor” mapping given the “motor-acoustic” mapping provided by TA-based syllable-wise F0 production. The proposed prosody modeling system outperforms the HMM-based baseline system in both objective and subjective evaluations
- …