1 research outputs found
Deep Encoder-Decoder Models for Unsupervised Learning of Controllable Speech Synthesis
Generating versatile and appropriate synthetic speech requires control over
the output expression separate from the spoken text. Important non-textual
speech variation is seldom annotated, in which case output control must be
learned in an unsupervised fashion. In this paper, we perform an in-depth study
of methods for unsupervised learning of control in statistical speech
synthesis. For example, we show that popular unsupervised training heuristics
can be interpreted as variational inference in certain autoencoder models. We
additionally connect these models to VQ-VAEs, another, recently-proposed class
of deep variational autoencoders, which we show can be derived from a very
similar mathematical argument. The implications of these new probabilistic
interpretations are discussed. We illustrate the utility of the various
approaches with an application to acoustic modelling for emotional speech
synthesis, where the unsupervised methods for learning expression control
(without access to emotional labels) are found to give results that in many
aspects match or surpass the previous best supervised approach.Comment: 17 pages, 4 figure