1 research outputs found
Adjusting Pleasure-Arousal-Dominance for Continuous Emotional Text-to-speech Synthesizer
Emotion is not limited to discrete categories of happy, sad, angry, fear,
disgust, surprise, and so on. Instead, each emotion category is projected into
a set of nearly independent dimensions, named pleasure (or valence), arousal,
and dominance, known as PAD. The value of each dimension varies from -1 to 1,
such that the neutral emotion is in the center with all-zero values. Training
an emotional continuous text-to-speech (TTS) synthesizer on the independent
dimensions provides the possibility of emotional speech synthesis with
unlimited emotion categories. Our end-to-end neural speech synthesizer is based
on the well-known Tacotron. Empirically, we have found the optimum network
architecture for injecting the 3D PADs. Moreover, the PAD values are adjusted
for the speech synthesis purpose.Comment: Interspeech2019, Show and Tell demonstration
https://www.youtube.com/watch?v=MAOk_ZxuA0I&feature=youtu.b