In text-to-speech, controlling voice characteristics is important in
achieving various-purpose speech synthesis. Considering the success of
text-conditioned generation, such as text-to-image, free-form text instruction
should be useful for intuitive and complicated control of voice
characteristics. A sufficiently large corpus of high-quality and diverse voice
samples with corresponding free-form descriptions can advance such control
research. However, neither an open corpus nor a scalable method is currently
available. To this end, we develop Coco-Nut, a new corpus including diverse
Japanese utterances, along with text transcriptions and free-form voice
characteristics descriptions. Our methodology to construct this corpus consists
of 1) automatic collection of voice-related audio data from the Internet, 2)
quality assurance, and 3) manual annotation using crowdsourcing. Additionally,
we benchmark our corpus on the prompt embedding model trained by contrastive
speech-text learning.Comment: Submitted to ASRU202