2 research outputs found
Investigation of learning abilities on linguistic features in sequence-to-sequence text-to-speech synthesis
Neural sequence-to-sequence text-to-speech synthesis (TTS) can produce
high-quality speech directly from text or simple linguistic features such as
phonemes. Unlike traditional pipeline TTS, the neural sequence-to-sequence TTS
does not require manually annotated and complicated linguistic features such as
part-of-speech tags and syntactic structures for system training. However, it
must be carefully designed and well optimized so that it can implicitly extract
useful linguistic features from the input features. In this paper we
investigate under what conditions the neural sequence-to-sequence TTS can work
well in Japanese and English along with comparisons with deep neural network
(DNN) based pipeline TTS systems. Unlike past comparative studies, the pipeline
systems also use autoregressive probabilistic modeling and a neural vocoder. We
investigated systems from three aspects: a) model architecture, b) model
parameter size, and c) language. For the model architecture aspect, we adopt
modified Tacotron systems that we previously proposed and their variants using
an encoder from Tacotron or Tacotron2. For the model parameter size aspect, we
investigate two model parameter sizes. For the language aspect, we conduct
listening tests in both Japanese and English to see if our findings can be
generalized across languages. Our experiments suggest that a) a neural
sequence-to-sequence TTS system should have a sufficient number of model
parameters to produce high quality speech, b) it should also use a powerful
encoder when it takes characters as inputs, and c) the encoder still has a room
for improvement and needs to have an improved architecture to learn
supra-segmental features more appropriately
Flavored Tacotron: Conditional Learning for Prosodic-linguistic Features
Neural sequence-to-sequence text-to-speech synthesis (TTS), such as
Tacotron-2, transforms text into high-quality speech. However, generating
speech with natural prosody still remains a challenge. Yasuda et. al. show that
unlike natural speech, Tacotron-2's encoder doesn't fully represent prosodic
features (e.g. syllable stress in English) from characters, and result in flat
fundamental frequency variations.
In this work, we propose a novel carefully designed strategy for conditioning
Tacotron-2 on two fundamental prosodic features in English -- stress syllable
and pitch accent, that help achieve more natural prosody. To this end, we use
of a classifier to learn these features in an end-to-end fashion, and apply
feature conditioning at three parts of Tacotron-2's Text-To-Mel Spectrogram:
pre-encoder, post-encoder, and intra-decoder. Further, we show that jointly
conditioned features at pre-encoder and intra-decoder stages result in
prosodically natural synthesized speech (vs. Tacotron-2), and allows the model
to produce speech with more accurate pitch accent and stress patterns.
Quantitative evaluations show that our formulation achieves higher
fundamental frequency contour correlation, and lower Mel Cepstral Distortion
measure between synthesized and natural speech. And subjective evaluation shows
that the proposed method's Mean Opinion Score of 4.14 fairs higher than
baseline Tacotron-2, 3.91, when compared against natural speech (LJSpeech
corpus), 4.28.Comment: