453 research outputs found
Neural Source-Filter Waveform Models for Statistical Parametric Speech Synthesis
Neural waveform models such as WaveNet have demonstrated better performance
than conventional vocoders for statistical parametric speech synthesis. As an
autoregressive (AR) model, WaveNet is limited by a slow sequential waveform
generation process. Some new models that use the inverse-autoregressive flow
(IAF) can generate a whole waveform in a one-shot manner. However, these
IAF-based models require sequential transformation during training, which
severely slows down the training speed. Other models such as Parallel WaveNet
and ClariNet bring together the benefits of AR and IAF-based models and train
an IAF model by transferring the knowledge from a pre-trained AR teacher to an
IAF student without any sequential transformation. However, both models require
additional training criteria, and their implementation is prohibitively
complicated.
We propose a framework for neural source-filter (NSF) waveform modeling
without AR nor IAF-based approaches. This framework requires only three
components for waveform generation: a source module that generates a sine-based
signal as excitation, a non-AR dilated-convolution-based filter module that
transforms the excitation into a waveform, and a conditional module that
pre-processes the acoustic features for the source and filer modules. This
framework minimizes spectral-amplitude distances for model training, which can
be efficiently implemented by using short-time Fourier transform routines.
Under this framework, we designed three NSF models and compared them with
WaveNet. It was demonstrated that the NSF models generated waveforms at least
100 times faster than WaveNet, and the quality of the synthetic speech from the
best NSF model was better than or equally good as that from WaveNet.Comment: Accepted to IEEE/ACM TASLP. Note: this paper is on a follow-up work
of our ICASSP paper. Based on the h-NSF introduced in this work, we proposed
a h-sinc-NSF model and published the third paper in SSW 10
(https://www.isca-speech.org/archive/SSW_2019/pdfs/SSW10_O_1-1.pdf
Ultrasound-Based Silent Speech Interface Built on a Continuous Vocoder
Recently it was shown that within the Silent Speech Interface (SSI) field,
the prediction of F0 is possible from Ultrasound Tongue Images (UTI) as the
articulatory input, using Deep Neural Networks for articulatory-to-acoustic
mapping. Moreover, text-to-speech synthesizers were shown to produce higher
quality speech when using a continuous pitch estimate, which takes non-zero
pitch values even when voicing is not present. Therefore, in this paper on
UTI-based SSI, we use a simple continuous F0 tracker which does not apply a
strict voiced / unvoiced decision. Continuous vocoder parameters (ContF0,
Maximum Voiced Frequency and Mel-Generalized Cepstrum) are predicted using a
convolutional neural network, with UTI as input. The results demonstrate that
during the articulatory-to-acoustic mapping experiments, the continuous F0 is
predicted with lower error, and the continuous vocoder produces slightly more
natural synthesized speech than the baseline vocoder using standard
discontinuous F0.Comment: 5 pages, 3 figures, accepted for publication at Interspeech 201
Reducing mismatch in training of DNN-based glottal excitation models in a statistical parametric text-to-speech system
Neural network-based models that generate glottal excitation waveforms from acoustic features have been found to give improved quality in statistical parametric speech synthesis. Until now, however, these models have been trained separately from the acoustic model. This creates mismatch between training and synthesis, as the synthesized acoustic features used for the excitation model input differ from the original inputs, with which the model was trained on. Furthermore, due to the errors in predicting the vocal tract filter, the original excitation waveforms do not provide perfect reconstruction of the speech waveform even if predicted without error. To address these issues and to make the excitation model more robust against errors in acoustic modeling, this paper proposes two modifications to the excitation model training scheme. First, the excitation model is trained in a connected manner, with inputs generated by the acoustic model. Second, the target glottal waveforms are re-estimated by performing glottal inverse filtering with the predicted vocal tract filters. The results show that both of these modifications improve performance measured in MSE and MFCC distortion, and slightly improve the subjective quality of the synthetic speech.Peer reviewe
- …