11 research outputs found
Sampling-based speech parameter generation using moment-matching networks
This paper presents sampling-based speech parameter generation using
moment-matching networks for Deep Neural Network (DNN)-based speech synthesis.
Although people never produce exactly the same speech even if we try to express
the same linguistic and para-linguistic information, typical statistical speech
synthesis produces completely the same speech, i.e., there is no
inter-utterance variation in synthetic speech. To give synthetic speech natural
inter-utterance variation, this paper builds DNN acoustic models that make it
possible to randomly sample speech parameters. The DNNs are trained so that
they make the moments of generated speech parameters close to those of natural
speech parameters. Since the variation of speech parameters is compressed into
a low-dimensional simple prior noise vector, our algorithm has lower
computation cost than direct sampling of speech parameters. As the first step
towards generating synthetic speech that has natural inter-utterance variation,
this paper investigates whether or not the proposed sampling-based generation
deteriorates synthetic speech quality. In evaluation, we compare speech quality
of conventional maximum likelihood-based generation and proposed sampling-based
generation. The result demonstrates the proposed generation causes no
degradation in speech quality.Comment: Submitted to INTERSPEECH 201
Using generative modelling to produce varied intonation for speech synthesis
Unlike human speakers, typical text-to-speech (TTS) systems are unable to
produce multiple distinct renditions of a given sentence. This has previously
been addressed by adding explicit external control. In contrast, generative
models are able to capture a distribution over multiple renditions and thus
produce varied renditions using sampling. Typical neural TTS models learn the
average of the data because they minimise mean squared error. In the context of
prosody, taking the average produces flatter, more boring speech: an "average
prosody". A generative model that can synthesise multiple prosodies will, by
design, not model average prosody. We use variational autoencoders (VAEs) which
explicitly place the most "average" data close to the mean of the Gaussian
prior. We propose that by moving towards the tails of the prior distribution,
the model will transition towards generating more idiosyncratic, varied
renditions. Focusing here on intonation, we investigate the trade-off between
naturalness and intonation variation and find that typical acoustic models can
either be natural, or varied, but not both. However, sampling from the tails of
the VAE prior produces much more varied intonation than the traditional
approaches, whilst maintaining the same level of naturalness.Comment: Accepted for the 10th ISCA Speech Synthesis Workshop (SSW10