937 research outputs found
A Survey on Evaluation Metrics for Backchannel Prediction Models
In this paper we give an overview of the evaluation metrics used to measure the performance of backchannel prediction models. Both objective and subjective evaluation metrics are discussed. The survey shows that almost every backchannel prediction model is evaluated with a different evaluation metric. This makes comparison between developed models unreliable, even beside the other variables in play, such as different corpora, language, conversational setting, amount of data and/or definition of the term backchannel
Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias
Scaling text-to-speech to a large and wild dataset has been proven to be
highly effective in achieving timbre and speech style generalization,
particularly in zero-shot TTS. However, previous works usually encode speech
into latent using audio codec and use autoregressive language models or
diffusion models to generate it, which ignores the intrinsic nature of speech
and may lead to inferior or uncontrollable results. We argue that speech can be
decomposed into several attributes (e.g., content, timbre, prosody, and phase)
and each of them should be modeled using a module with appropriate inductive
biases. From this perspective, we carefully design a novel and large zero-shot
TTS system called Mega-TTS, which is trained with large-scale wild data and
models different attributes in different ways: 1) Instead of using latent
encoded by audio codec as the intermediate feature, we still choose spectrogram
as it separates the phase and other attributes very well. Phase can be
appropriately constructed by the GAN-based vocoder and does not need to be
modeled by the language model. 2) We model the timbre using global vectors
since timbre is a global attribute that changes slowly over time. 3) We further
use a VQGAN-based acoustic model to generate the spectrogram and a latent code
language model to fit the distribution of prosody, since prosody changes
quickly over time in a sentence, and language models can capture both local and
long-range dependencies. We scale Mega-TTS to multi-domain datasets with 20K
hours of speech and evaluate its performance on unseen speakers. Experimental
results demonstrate that Mega-TTS surpasses state-of-the-art TTS systems on
zero-shot TTS, speech editing, and cross-lingual TTS tasks, with superior
naturalness, robustness, and speaker similarity due to the proper inductive
bias of each module. Audio samples are available at
https://mega-tts.github.io/demo-page
A Study of Modeling Rising Intonation in Cantonese Neural Speech Synthesis
In human speech, the attitude of a speaker cannot be fully expressed only by
the textual content. It has to come along with the intonation. Declarative
questions are commonly used in daily Cantonese conversations, and they are
usually uttered with rising intonation. Vanilla neural text-to-speech (TTS)
systems are not capable of synthesizing rising intonation for these sentences
due to the loss of semantic information. Though it has become more common to
complement the systems with extra language models, their performance in
modeling rising intonation is not well studied. In this paper, we propose to
complement the Cantonese TTS model with a BERT-based statement/question
classifier. We design different training strategies and compare their
performance. We conduct our experiments on a Cantonese corpus named CanTTS.
Empirical results show that the separate training approach obtains the best
generalization performance and feasibility.Comment: Accepted by INTERSPEECH 202
- …