5 research outputs found
Coco-Nut: Corpus of Japanese Utterance and Voice Characteristics Description for Prompt-based Control
In text-to-speech, controlling voice characteristics is important in
achieving various-purpose speech synthesis. Considering the success of
text-conditioned generation, such as text-to-image, free-form text instruction
should be useful for intuitive and complicated control of voice
characteristics. A sufficiently large corpus of high-quality and diverse voice
samples with corresponding free-form descriptions can advance such control
research. However, neither an open corpus nor a scalable method is currently
available. To this end, we develop Coco-Nut, a new corpus including diverse
Japanese utterances, along with text transcriptions and free-form voice
characteristics descriptions. Our methodology to construct this corpus consists
of 1) automatic collection of voice-related audio data from the Internet, 2)
quality assurance, and 3) manual annotation using crowdsourcing. Additionally,
we benchmark our corpus on the prompt embedding model trained by contrastive
speech-text learning.Comment: Submitted to ASRU202
UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022
We present the UTokyo-SaruLab mean opinion score (MOS) prediction system
submitted to VoiceMOS Challenge 2022. The challenge is to predict the MOS
values of speech samples collected from previous Blizzard Challenges and Voice
Conversion Challenges for two tracks: a main track for in-domain prediction and
an out-of-domain (OOD) track for which there is less labeled data from
different listening tests. Our system is based on ensemble learning of strong
and weak learners. Strong learners incorporate several improvements to the
previous fine-tuning models of self-supervised learning (SSL) models, while
weak learners use basic machine-learning methods to predict scores from SSL
features. In the Challenge, our system had the highest score on several metrics
for both the main and OOD tracks. In addition, we conducted ablation studies to
investigate the effectiveness of our proposed methods.Comment: Submitted to INTERSPEECH 202
How Generative Spoken Language Modeling Encodes Noisy Speech: Investigation from Phonetics to Syntactics
We examine the speech modeling potential of generative spoken language
modeling (GSLM), which involves using learned symbols derived from data rather
than phonemes for speech analysis and synthesis. Since GSLM facilitates
textless spoken language processing, exploring its effectiveness is critical
for paving the way for novel paradigms in spoken-language processing. This
paper presents the findings of GSLM's encoding and decoding effectiveness at
the spoken-language and speech levels. Through speech resynthesis experiments,
we revealed that resynthesis errors occur at the levels ranging from phonology
to syntactics and GSLM frequently resynthesizes natural but content-altered
speech.Comment: Accepted to INTERSPEECH 202
Improving Speech Prosody of Audiobook Text-to-Speech Synthesis with Acoustic and Textual Contexts
We present a multi-speaker Japanese audiobook text-to-speech (TTS) system
that leverages multimodal context information of preceding acoustic context and
bilateral textual context to improve the prosody of synthetic speech. Previous
work either uses unilateral or single-modality context, which does not fully
represent the context information. The proposed method uses an acoustic context
encoder and a textual context encoder to aggregate context information and
feeds it to the TTS model, which enables the model to predict context-dependent
prosody. We conducted comprehensive objective and subjective evaluations on a
multi-speaker Japanese audiobook dataset. Experimental results demonstrate that
the proposed method significantly outperforms two previous works. Additionally,
we present insights about the different choices of context - modalities,
lateral information and length - for audiobook TTS that have never been
discussed in the literature before
JVNV: A Corpus of Japanese Emotional Speech with Verbal Content and Nonverbal Expressions
We present the JVNV, a Japanese emotional speech corpus with verbal content
and nonverbal vocalizations whose scripts are generated by a large-scale
language model. Existing emotional speech corpora lack not only proper
emotional scripts but also nonverbal vocalizations (NVs) that are essential
expressions in spoken language to express emotions. We propose an automatic
script generation method to produce emotional scripts by providing seed words
with sentiment polarity and phrases of nonverbal vocalizations to ChatGPT using
prompt engineering. We select 514 scripts with balanced phoneme coverage from
the generated candidate scripts with the assistance of emotion confidence
scores and language fluency scores. We demonstrate the effectiveness of JVNV by
showing that JVNV has better phoneme coverage and emotion recognizability than
previous Japanese emotional speech corpora. We then benchmark JVNV on emotional
text-to-speech synthesis using discrete codes to represent NVs. We show that
there still exists a gap between the performance of synthesizing read-aloud
speech and emotional speech, and adding NVs in the speech makes the task even
harder, which brings new challenges for this task and makes JVNV a valuable
resource for relevant works in the future. To our best knowledge, JVNV is the
first speech corpus that generates scripts automatically using large language
models