118 research outputs found
Ultrasound-Based Silent Speech Interface Built on a Continuous Vocoder
Recently it was shown that within the Silent Speech Interface (SSI) field,
the prediction of F0 is possible from Ultrasound Tongue Images (UTI) as the
articulatory input, using Deep Neural Networks for articulatory-to-acoustic
mapping. Moreover, text-to-speech synthesizers were shown to produce higher
quality speech when using a continuous pitch estimate, which takes non-zero
pitch values even when voicing is not present. Therefore, in this paper on
UTI-based SSI, we use a simple continuous F0 tracker which does not apply a
strict voiced / unvoiced decision. Continuous vocoder parameters (ContF0,
Maximum Voiced Frequency and Mel-Generalized Cepstrum) are predicted using a
convolutional neural network, with UTI as input. The results demonstrate that
during the articulatory-to-acoustic mapping experiments, the continuous F0 is
predicted with lower error, and the continuous vocoder produces slightly more
natural synthesized speech than the baseline vocoder using standard
discontinuous F0.Comment: 5 pages, 3 figures, accepted for publication at Interspeech 201
Speech Synthesis from Text and Ultrasound Tongue Image-based Articulatory Input
Articulatory information has been shown to be effective in improving the
performance of HMM-based and DNN-based text-to-speech synthesis. Speech
synthesis research focuses traditionally on text-to-speech conversion, when the
input is text or an estimated linguistic representation, and the target is
synthesized speech. However, a research field that has risen in the last decade
is articulation-to-speech synthesis (with a target application of a Silent
Speech Interface, SSI), when the goal is to synthesize speech from some
representation of the movement of the articulatory organs. In this paper, we
extend traditional (vocoder-based) DNN-TTS with articulatory input, estimated
from ultrasound tongue images. We compare text-only, ultrasound-only, and
combined inputs. Using data from eight speakers, we show that that the combined
text and articulatory input can have advantages in limited-data scenarios,
namely, it may increase the naturalness of synthesized speech compared to
single text input. Besides, we analyze the ultrasound tongue recordings of
several speakers, and show that misalignments in the ultrasound transducer
positioning can have a negative effect on the final synthesis performance.Comment: accepted at SSW11 (11th Speech Synthesis Workshop
Ultrasound-Based Articulatory-to-Acoustic Mapping with WaveGlow Speech Synthesis
For articulatory-to-acoustic mapping using deep neural networks, typically
spectral and excitation parameters of vocoders have been used as the training
targets. However, vocoding often results in buzzy and muffled final speech
quality. Therefore, in this paper on ultrasound-based articulatory-to-acoustic
conversion, we use a flow-based neural vocoder (WaveGlow) pre-trained on a
large amount of English and Hungarian speech data. The inputs of the
convolutional neural network are ultrasound tongue images. The training target
is the 80-dimensional mel-spectrogram, which results in a finer detailed
spectral representation than the previously used 25-dimensional Mel-Generalized
Cepstrum. From the output of the ultrasound-to-mel-spectrogram prediction,
WaveGlow inference results in synthesized speech. We compare the proposed
WaveGlow-based system with a continuous vocoder which does not use strict
voiced/unvoiced decision when predicting F0. The results demonstrate that
during the articulatory-to-acoustic mapping experiments, the WaveGlow neural
vocoder produces significantly more natural synthesized speech than the
baseline system. Besides, the advantage of WaveGlow is that F0 is included in
the mel-spectrogram representation, and it is not necessary to predict the
excitation separately.Comment: 5 pages, accepted for publication at Interspeech 2020. arXiv admin
note: substantial text overlap with arXiv:1906.0988
Adaptation of Tacotron2-based Text-To-Speech for Articulatory-to-Acoustic Mapping using Ultrasound Tongue Imaging
For articulatory-to-acoustic mapping, typically only limited parallel
training data is available, making it impossible to apply fully end-to-end
solutions like Tacotron2. In this paper, we experimented with transfer learning
and adaptation of a Tacotron2 text-to-speech model to improve the final
synthesis quality of ultrasound-based articulatory-to-acoustic mapping with a
limited database. We use a multi-speaker pre-trained Tacotron2 TTS model and a
pre-trained WaveGlow neural vocoder. The articulatory-to-acoustic conversion
contains three steps: 1) from a sequence of ultrasound tongue image recordings,
a 3D convolutional neural network predicts the inputs of the pre-trained
Tacotron2 model, 2) the Tacotron2 model converts this intermediate
representation to an 80-dimensional mel-spectrogram, and 3) the WaveGlow model
is applied for final inference. This generated speech contains the timing of
the original articulatory data from the ultrasound recording, but the F0
contour and the spectral information is predicted by the Tacotron2 model. The
F0 values are independent of the original ultrasound images, but represent the
target speaker, as they are inferred from the pre-trained Tacotron2 model. In
our experiments, we demonstrated that the synthesized speech quality is more
natural with the proposed solutions than with our earlier model.Comment: accepted at SSW11. arXiv admin note: text overlap with
arXiv:2008.0315
- …