Coco-Nut: Corpus of Japanese Utterance and Voice Characteristics
  Description for Prompt-based Control

Nakata, Wataru; Saito, Yuki; Saruwatari, Hiroshi; Takamichi, Shinnosuke; Watanabe, Aya; Xin, Detai

Coco-Nut: Corpus of Japanese Utterance and Voice Characteristics Description for Prompt-based Control

Authors: Wataru Nakata
Yuki Saito
Hiroshi Saruwatari
Shinnosuke Takamichi
Aya Watanabe
Detai Xin
Publication date: 23 September 2023
Publisher

Abstract

In text-to-speech, controlling voice characteristics is important in achieving various-purpose speech synthesis. Considering the success of text-conditioned generation, such as text-to-image, free-form text instruction should be useful for intuitive and complicated control of voice characteristics. A sufficiently large corpus of high-quality and diverse voice samples with corresponding free-form descriptions can advance such control research. However, neither an open corpus nor a scalable method is currently available. To this end, we develop Coco-Nut, a new corpus including diverse Japanese utterances, along with text transcriptions and free-form voice characteristics descriptions. Our methodology to construct this corpus consists of 1) automatic collection of voice-related audio data from the Internet, 2) quality assurance, and 3) manual annotation using crowdsourcing. Additionally, we benchmark our corpus on the prompt embedding model trained by contrastive speech-text learning.Comment: Submitted to ASRU202

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2309.13509

Last time updated on 12/10/2023