1 research outputs found
Uncovering Latent Style Factors for Expressive Speech Synthesis
Prosodic modeling is a core problem in speech synthesis. The key challenge is
producing desirable prosody from textual input containing only phonetic
information. In this preliminary study, we introduce the concept of "style
tokens" in Tacotron, a recently proposed end-to-end neural speech synthesis
model. Using style tokens, we aim to extract independent prosodic styles from
training data. We show that without annotation data or an explicit supervision
signal, our approach can automatically learn a variety of prosodic variations
in a purely data-driven way. Importantly, each style token corresponds to a
fixed style factor regardless of the given text sequence. As a result, we can
control the prosodic style of synthetic speech in a somewhat predictable and
globally consistent way.Comment: Submitted to NIPS ML4Audio workshop and ICASS