Search CORE

3 research outputs found

JVNV: A Corpus of Japanese Emotional Speech with Verbal Content and Nonverbal Expressions

Author: Aizawa Akiko
Jiang Junfeng
Saito Yuki
Saruwatari Hiroshi
Takamichi Shinnosuke
Xin Detai
Publication venue
Publication date: 09/10/2023
Field of study

We present the JVNV, a Japanese emotional speech corpus with verbal content and nonverbal vocalizations whose scripts are generated by a large-scale language model. Existing emotional speech corpora lack not only proper emotional scripts but also nonverbal vocalizations (NVs) that are essential expressions in spoken language to express emotions. We propose an automatic script generation method to produce emotional scripts by providing seed words with sentiment polarity and phrases of nonverbal vocalizations to ChatGPT using prompt engineering. We select 514 scripts with balanced phoneme coverage from the generated candidate scripts with the assistance of emotion confidence scores and language fluency scores. We demonstrate the effectiveness of JVNV by showing that JVNV has better phoneme coverage and emotion recognizability than previous Japanese emotional speech corpora. We then benchmark JVNV on emotional text-to-speech synthesis using discrete codes to represent NVs. We show that there still exists a gap between the performance of synthesizing read-aloud speech and emotional speech, and adding NVs in the speech makes the task even harder, which brings new challenges for this task and makes JVNV a valuable resource for relevant works in the future. To our best knowledge, JVNV is the first speech corpus that generates scripts automatically using large language models

arXiv.org e-Print Archive

Corpus design for expressive speech: impact of the utterance length

Author: Barbot Nelly
Chevelu Jonathan
Lolive Damien
Shamsi Meysam
Publication venue: 'International Speech Communication Association'
Publication date: 25/05/2020
Field of study

International audienceVoice corpus plays a crucial role in the quality of the synthetic speech generation, specially under a length constraint. Creating a new voice is costly and the recording script selection for an expressive TTS task is generally considered as an optimization problem in order to achieve a rich and parsimonious corpus. In order to vocalize a given book using a TTS system, we investigate four script selection approaches. Based on preliminary observations, we simply propose to select shortest utterances of the book and compare the achievements of this method with state of the art ones for two books, with different utterance lengths and styles, using two kinds of concatenation based TTS systems. The study of the TTS costs indicates that selecting the shortest utterances could result in better synthetic quality, which is confirmed by a perceptual test. By investigating usual criteria for corpus design in literature like unit coverage or distribution similarity of units, it turns out that they are not pertinent metrics in the framework of this study

HAL-CentraleSupelec

Crossref

INRIA a CCSD electronic archive server

HAL-Rennes 1

Towards optimal TTS corpora

Author: Boidin C.
Cadic D.
d'Alessandro Christophe
Publication venue: HAL CCSD
Publication date: 01/01/2010
Field of study

International audienc