We present the JVNV, a Japanese emotional speech corpus with verbal content
and nonverbal vocalizations whose scripts are generated by a large-scale
language model. Existing emotional speech corpora lack not only proper
emotional scripts but also nonverbal vocalizations (NVs) that are essential
expressions in spoken language to express emotions. We propose an automatic
script generation method to produce emotional scripts by providing seed words
with sentiment polarity and phrases of nonverbal vocalizations to ChatGPT using
prompt engineering. We select 514 scripts with balanced phoneme coverage from
the generated candidate scripts with the assistance of emotion confidence
scores and language fluency scores. We demonstrate the effectiveness of JVNV by
showing that JVNV has better phoneme coverage and emotion recognizability than
previous Japanese emotional speech corpora. We then benchmark JVNV on emotional
text-to-speech synthesis using discrete codes to represent NVs. We show that
there still exists a gap between the performance of synthesizing read-aloud
speech and emotional speech, and adding NVs in the speech makes the task even
harder, which brings new challenges for this task and makes JVNV a valuable
resource for relevant works in the future. To our best knowledge, JVNV is the
first speech corpus that generates scripts automatically using large language
models