Search CORE

5 research outputs found

Coco-Nut: Corpus of Japanese Utterance and Voice Characteristics Description for Prompt-based Control

Author: Nakata Wataru
Saito Yuki
Saruwatari Hiroshi
Takamichi Shinnosuke
Watanabe Aya
Xin Detai
Publication venue
Publication date: 23/09/2023
Field of study

In text-to-speech, controlling voice characteristics is important in achieving various-purpose speech synthesis. Considering the success of text-conditioned generation, such as text-to-image, free-form text instruction should be useful for intuitive and complicated control of voice characteristics. A sufficiently large corpus of high-quality and diverse voice samples with corresponding free-form descriptions can advance such control research. However, neither an open corpus nor a scalable method is currently available. To this end, we develop Coco-Nut, a new corpus including diverse Japanese utterances, along with text transcriptions and free-form voice characteristics descriptions. Our methodology to construct this corpus consists of 1) automatic collection of voice-related audio data from the Internet, 2) quality assurance, and 3) manual annotation using crowdsourcing. Additionally, we benchmark our corpus on the prompt embedding model trained by contrastive speech-text learning.Comment: Submitted to ASRU202

arXiv.org e-Print Archive

UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022

Author: Koriyama Tomoki
Nakata Wataru
Saeki Takaaki
Saruwatari Hiroshi
Takamichi Shinnosuke
Xin Detai
Publication venue
Publication date: 05/04/2022
Field of study

We present the UTokyo-SaruLab mean opinion score (MOS) prediction system submitted to VoiceMOS Challenge 2022. The challenge is to predict the MOS values of speech samples collected from previous Blizzard Challenges and Voice Conversion Challenges for two tracks: a main track for in-domain prediction and an out-of-domain (OOD) track for which there is less labeled data from different listening tests. Our system is based on ensemble learning of strong and weak learners. Strong learners incorporate several improvements to the previous fine-tuning models of self-supervised learning (SSL) models, while weak learners use basic machine-learning methods to predict scores from SSL features. In the Challenge, our system had the highest score on several metrics for both the main and OOD tracks. In addition, we conducted ablation studies to investigate the effectiveness of our proposed methods.Comment: Submitted to INTERSPEECH 202

arXiv.org e-Print Archive

How Generative Spoken Language Modeling Encodes Noisy Speech: Investigation from Phonetics to Syntactics

Author: Nakamura Tomohiko
Park Joonyong
Saruwatari Hiroshi
Seki Kentaro
Takamichi Shinnosuke
Xin Detai
Publication venue
Publication date: 01/06/2023
Field of study

We examine the speech modeling potential of generative spoken language modeling (GSLM), which involves using learned symbols derived from data rather than phonemes for speech analysis and synthesis. Since GSLM facilitates textless spoken language processing, exploring its effectiveness is critical for paving the way for novel paradigms in spoken-language processing. This paper presents the findings of GSLM's encoding and decoding effectiveness at the spoken-language and speech levels. Through speech resynthesis experiments, we revealed that resynthesis errors occur at the levels ranging from phonology to syntactics and GSLM frequently resynthesizes natural but content-altered speech.Comment: Accepted to INTERSPEECH 202

arXiv.org e-Print Archive

Improving Speech Prosody of Audiobook Text-to-Speech Synthesis with Acoustic and Textual Contexts

Author: Adavanne Sharath
Ang Federico
Kulkarni Ashish
Saruwatari Hiroshi
Takamichi Shinnosuke
Xin Detai
Publication venue
Publication date: 04/11/2022
Field of study

We present a multi-speaker Japanese audiobook text-to-speech (TTS) system that leverages multimodal context information of preceding acoustic context and bilateral textual context to improve the prosody of synthetic speech. Previous work either uses unilateral or single-modality context, which does not fully represent the context information. The proposed method uses an acoustic context encoder and a textual context encoder to aggregate context information and feeds it to the TTS model, which enables the model to predict context-dependent prosody. We conducted comprehensive objective and subjective evaluations on a multi-speaker Japanese audiobook dataset. Experimental results demonstrate that the proposed method significantly outperforms two previous works. Additionally, we present insights about the different choices of context - modalities, lateral information and length - for audiobook TTS that have never been discussed in the literature before

arXiv.org e-Print Archive

JVNV: A Corpus of Japanese Emotional Speech with Verbal Content and Nonverbal Expressions

Author: Aizawa Akiko
Jiang Junfeng
Saito Yuki
Saruwatari Hiroshi
Takamichi Shinnosuke
Xin Detai
Publication venue
Publication date: 09/10/2023
Field of study

We present the JVNV, a Japanese emotional speech corpus with verbal content and nonverbal vocalizations whose scripts are generated by a large-scale language model. Existing emotional speech corpora lack not only proper emotional scripts but also nonverbal vocalizations (NVs) that are essential expressions in spoken language to express emotions. We propose an automatic script generation method to produce emotional scripts by providing seed words with sentiment polarity and phrases of nonverbal vocalizations to ChatGPT using prompt engineering. We select 514 scripts with balanced phoneme coverage from the generated candidate scripts with the assistance of emotion confidence scores and language fluency scores. We demonstrate the effectiveness of JVNV by showing that JVNV has better phoneme coverage and emotion recognizability than previous Japanese emotional speech corpora. We then benchmark JVNV on emotional text-to-speech synthesis using discrete codes to represent NVs. We show that there still exists a gap between the performance of synthesizing read-aloud speech and emotional speech, and adding NVs in the speech makes the task even harder, which brings new challenges for this task and makes JVNV a valuable resource for relevant works in the future. To our best knowledge, JVNV is the first speech corpus that generates scripts automatically using large language models

arXiv.org e-Print Archive