3 research outputs found
JVNV: A Corpus of Japanese Emotional Speech with Verbal Content and Nonverbal Expressions
We present the JVNV, a Japanese emotional speech corpus with verbal content
and nonverbal vocalizations whose scripts are generated by a large-scale
language model. Existing emotional speech corpora lack not only proper
emotional scripts but also nonverbal vocalizations (NVs) that are essential
expressions in spoken language to express emotions. We propose an automatic
script generation method to produce emotional scripts by providing seed words
with sentiment polarity and phrases of nonverbal vocalizations to ChatGPT using
prompt engineering. We select 514 scripts with balanced phoneme coverage from
the generated candidate scripts with the assistance of emotion confidence
scores and language fluency scores. We demonstrate the effectiveness of JVNV by
showing that JVNV has better phoneme coverage and emotion recognizability than
previous Japanese emotional speech corpora. We then benchmark JVNV on emotional
text-to-speech synthesis using discrete codes to represent NVs. We show that
there still exists a gap between the performance of synthesizing read-aloud
speech and emotional speech, and adding NVs in the speech makes the task even
harder, which brings new challenges for this task and makes JVNV a valuable
resource for relevant works in the future. To our best knowledge, JVNV is the
first speech corpus that generates scripts automatically using large language
models
Corpus design for expressive speech: impact of the utterance length
International audienceVoice corpus plays a crucial role in the quality of the synthetic speech generation, specially under a length constraint. Creating a new voice is costly and the recording script selection for an expressive TTS task is generally considered as an optimization problem in order to achieve a rich and parsimonious corpus. In order to vocalize a given book using a TTS system, we investigate four script selection approaches. Based on preliminary observations, we simply propose to select shortest utterances of the book and compare the achievements of this method with state of the art ones for two books, with different utterance lengths and styles, using two kinds of concatenation based TTS systems. The study of the TTS costs indicates that selecting the shortest utterances could result in better synthetic quality, which is confirmed by a perceptual test. By investigating usual criteria for corpus design in literature like unit coverage or distribution similarity of units, it turns out that they are not pertinent metrics in the framework of this study
Towards optimal TTS corpora
International audienc