Language Model-Based Emotion Prediction Methods for Emotional Speech
  Synthesis Systems

Hwang, Min-Jae; Kim, Jae-Min; Kwon, Ohsung; Lee, Hoyeon; Song, Eunwoo; Yamamoto, Ryuichi; Yoon, Hyun-Wook

Language Model-Based Emotion Prediction Methods for Emotional Speech Synthesis Systems

Authors: Min-Jae Hwang
Jae-Min Kim
Ohsung Kwon
Hoyeon Lee
Eunwoo Song
Ryuichi Yamamoto
Hyun-Wook Yoon
Publication date: 30 June 2022
Publisher

Abstract

This paper proposes an effective emotional text-to-speech (TTS) system with a pre-trained language model (LM)-based emotion prediction method. Unlike conventional systems that require auxiliary inputs such as manually defined emotion classes, our system directly estimates emotion-related attributes from the input text. Specifically, we utilize generative pre-trained transformer (GPT)-3 to jointly predict both an emotion class and its strength in representing emotions coarse and fine properties, respectively. Then, these attributes are combined in the emotional embedding space and used as conditional features of the TTS model for generating output speech signals. Consequently, the proposed system can produce emotional speech only from text without any auxiliary inputs. Furthermore, because the GPT-3 enables to capture emotional context among the consecutive sentences, the proposed method can effectively handle the paragraph-level generation of emotional speech.Comment: Accepted by INTERSPEECH202

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2206.15067

Last time updated on 28/08/2022