Articulatory information has been shown to be effective in improving the
performance of HMM-based and DNN-based text-to-speech synthesis. Speech
synthesis research focuses traditionally on text-to-speech conversion, when the
input is text or an estimated linguistic representation, and the target is
synthesized speech. However, a research field that has risen in the last decade
is articulation-to-speech synthesis (with a target application of a Silent
Speech Interface, SSI), when the goal is to synthesize speech from some
representation of the movement of the articulatory organs. In this paper, we
extend traditional (vocoder-based) DNN-TTS with articulatory input, estimated
from ultrasound tongue images. We compare text-only, ultrasound-only, and
combined inputs. Using data from eight speakers, we show that that the combined
text and articulatory input can have advantages in limited-data scenarios,
namely, it may increase the naturalness of synthesized speech compared to
single text input. Besides, we analyze the ultrasound tongue recordings of
several speakers, and show that misalignments in the ultrasound transducer
positioning can have a negative effect on the final synthesis performance.Comment: accepted at SSW11 (11th Speech Synthesis Workshop