15 research outputs found
Ultrasound-Based Silent Speech Interface Built on a Continuous Vocoder
Recently it was shown that within the Silent Speech Interface (SSI) field,
the prediction of F0 is possible from Ultrasound Tongue Images (UTI) as the
articulatory input, using Deep Neural Networks for articulatory-to-acoustic
mapping. Moreover, text-to-speech synthesizers were shown to produce higher
quality speech when using a continuous pitch estimate, which takes non-zero
pitch values even when voicing is not present. Therefore, in this paper on
UTI-based SSI, we use a simple continuous F0 tracker which does not apply a
strict voiced / unvoiced decision. Continuous vocoder parameters (ContF0,
Maximum Voiced Frequency and Mel-Generalized Cepstrum) are predicted using a
convolutional neural network, with UTI as input. The results demonstrate that
during the articulatory-to-acoustic mapping experiments, the continuous F0 is
predicted with lower error, and the continuous vocoder produces slightly more
natural synthesized speech than the baseline vocoder using standard
discontinuous F0.Comment: 5 pages, 3 figures, accepted for publication at Interspeech 201
Ultrasound-Based Articulatory-to-Acoustic Mapping with WaveGlow Speech Synthesis
For articulatory-to-acoustic mapping using deep neural networks, typically
spectral and excitation parameters of vocoders have been used as the training
targets. However, vocoding often results in buzzy and muffled final speech
quality. Therefore, in this paper on ultrasound-based articulatory-to-acoustic
conversion, we use a flow-based neural vocoder (WaveGlow) pre-trained on a
large amount of English and Hungarian speech data. The inputs of the
convolutional neural network are ultrasound tongue images. The training target
is the 80-dimensional mel-spectrogram, which results in a finer detailed
spectral representation than the previously used 25-dimensional Mel-Generalized
Cepstrum. From the output of the ultrasound-to-mel-spectrogram prediction,
WaveGlow inference results in synthesized speech. We compare the proposed
WaveGlow-based system with a continuous vocoder which does not use strict
voiced/unvoiced decision when predicting F0. The results demonstrate that
during the articulatory-to-acoustic mapping experiments, the WaveGlow neural
vocoder produces significantly more natural synthesized speech than the
baseline system. Besides, the advantage of WaveGlow is that F0 is included in
the mel-spectrogram representation, and it is not necessary to predict the
excitation separately.Comment: 5 pages, accepted for publication at Interspeech 2020. arXiv admin
note: substantial text overlap with arXiv:1906.0988