62,357 research outputs found
Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias
Scaling text-to-speech to a large and wild dataset has been proven to be
highly effective in achieving timbre and speech style generalization,
particularly in zero-shot TTS. However, previous works usually encode speech
into latent using audio codec and use autoregressive language models or
diffusion models to generate it, which ignores the intrinsic nature of speech
and may lead to inferior or uncontrollable results. We argue that speech can be
decomposed into several attributes (e.g., content, timbre, prosody, and phase)
and each of them should be modeled using a module with appropriate inductive
biases. From this perspective, we carefully design a novel and large zero-shot
TTS system called Mega-TTS, which is trained with large-scale wild data and
models different attributes in different ways: 1) Instead of using latent
encoded by audio codec as the intermediate feature, we still choose spectrogram
as it separates the phase and other attributes very well. Phase can be
appropriately constructed by the GAN-based vocoder and does not need to be
modeled by the language model. 2) We model the timbre using global vectors
since timbre is a global attribute that changes slowly over time. 3) We further
use a VQGAN-based acoustic model to generate the spectrogram and a latent code
language model to fit the distribution of prosody, since prosody changes
quickly over time in a sentence, and language models can capture both local and
long-range dependencies. We scale Mega-TTS to multi-domain datasets with 20K
hours of speech and evaluate its performance on unseen speakers. Experimental
results demonstrate that Mega-TTS surpasses state-of-the-art TTS systems on
zero-shot TTS, speech editing, and cross-lingual TTS tasks, with superior
naturalness, robustness, and speaker similarity due to the proper inductive
bias of each module. Audio samples are available at
https://mega-tts.github.io/demo-page
Generation of multi-modal dialogue for a net environment
In this paper an architecture and special purpose markup language for simulated affective face-to-face communication is presented. In systems based on this architecture, users will be able to watch embodied conversational agents interact with each other in virtual locations on the internet. The markup language, or Rich Representation Language (RRL), has been designed to provide an integrated representation of speech, gesture, posture and facial animation
Robust Speaker Recognition Using Speech Enhancement And Attention Model
In this paper, a novel architecture for speaker recognition is proposed by
cascading speech enhancement and speaker processing. Its aim is to improve
speaker recognition performance when speech signals are corrupted by noise.
Instead of individually processing speech enhancement and speaker recognition,
the two modules are integrated into one framework by a joint optimisation using
deep neural networks. Furthermore, to increase robustness against noise, a
multi-stage attention mechanism is employed to highlight the speaker related
features learned from context information in time and frequency domain. To
evaluate speaker identification and verification performance of the proposed
approach, we test it on the dataset of VoxCeleb1, one of mostly used benchmark
datasets. Moreover, the robustness of our proposed approach is also tested on
VoxCeleb1 data when being corrupted by three types of interferences, general
noise, music, and babble, at different signal-to-noise ratio (SNR) levels. The
obtained results show that the proposed approach using speech enhancement and
multi-stage attention models outperforms two strong baselines not using them in
most acoustic conditions in our experiments.Comment: Acceptted by Odyssey 202
- …