2 research outputs found
CHiVE: Varying Prosody in Speech Synthesis with a Linguistically Driven Dynamic Hierarchical Conditional Variational Network
The prosodic aspects of speech signals produced by current text-to-speech
systems are typically averaged over training material, and as such lack the
variety and liveliness found in natural speech. To avoid monotony and averaged
prosody contours, it is desirable to have a way of modeling the variation in
the prosodic aspects of speech, so audio signals can be synthesized in multiple
ways for a given text. We present a new, hierarchically structured conditional
variational autoencoder to generate prosodic features (fundamental frequency,
energy and duration) suitable for use with a vocoder or a generative model like
WaveNet. At inference time, an embedding representing the prosody of a sentence
may be sampled from the variational layer to allow for prosodic variation. To
efficiently capture the hierarchical nature of the linguistic input (words,
syllables and phones), both the encoder and decoder parts of the auto-encoder
are hierarchical, in line with the linguistic structure, with layers being
clocked dynamically at the respective rates. We show in our experiments that
our dynamic hierarchical network outperforms a non-hierarchical
state-of-the-art baseline, and, additionally, that prosody transfer across
sentences is possible by employing the prosody embedding of one sentence to
generate the speech signal of another
Investigation of learning abilities on linguistic features in sequence-to-sequence text-to-speech synthesis
Neural sequence-to-sequence text-to-speech synthesis (TTS) can produce
high-quality speech directly from text or simple linguistic features such as
phonemes. Unlike traditional pipeline TTS, the neural sequence-to-sequence TTS
does not require manually annotated and complicated linguistic features such as
part-of-speech tags and syntactic structures for system training. However, it
must be carefully designed and well optimized so that it can implicitly extract
useful linguistic features from the input features. In this paper we
investigate under what conditions the neural sequence-to-sequence TTS can work
well in Japanese and English along with comparisons with deep neural network
(DNN) based pipeline TTS systems. Unlike past comparative studies, the pipeline
systems also use autoregressive probabilistic modeling and a neural vocoder. We
investigated systems from three aspects: a) model architecture, b) model
parameter size, and c) language. For the model architecture aspect, we adopt
modified Tacotron systems that we previously proposed and their variants using
an encoder from Tacotron or Tacotron2. For the model parameter size aspect, we
investigate two model parameter sizes. For the language aspect, we conduct
listening tests in both Japanese and English to see if our findings can be
generalized across languages. Our experiments suggest that a) a neural
sequence-to-sequence TTS system should have a sufficient number of model
parameters to produce high quality speech, b) it should also use a powerful
encoder when it takes characters as inputs, and c) the encoder still has a room
for improvement and needs to have an improved architecture to learn
supra-segmental features more appropriately