23 research outputs found
Style-Label-Free: Cross-Speaker Style Transfer by Quantized VAE and Speaker-wise Normalization in Speech Synthesis
Cross-speaker style transfer in speech synthesis aims at transferring a style
from source speaker to synthesised speech of a target speaker's timbre. Most
previous approaches rely on data with style labels, but manually-annotated
labels are expensive and not always reliable. In response to this problem, we
propose Style-Label-Free, a cross-speaker style transfer method, which can
realize the style transfer from source speaker to target speaker without style
labels. Firstly, a reference encoder structure based on quantized variational
autoencoder (Q-VAE) and style bottleneck is designed to extract discrete style
representations. Secondly, a speaker-wise batch normalization layer is proposed
to reduce the source speaker leakage. In order to improve the style extraction
ability of the reference encoder, a style invariant and contrastive data
augmentation method is proposed. Experimental results show that the method
outperforms the baseline. We provide a website with audio samples.Comment: Published to ISCSLP 202
DiCLET-TTS: Diffusion Model based Cross-lingual Emotion Transfer for Text-to-Speech -- A Study between English and Mandarin
While the performance of cross-lingual TTS based on monolingual corpora has
been significantly improved recently, generating cross-lingual speech still
suffers from the foreign accent problem, leading to limited naturalness.
Besides, current cross-lingual methods ignore modeling emotion, which is
indispensable paralinguistic information in speech delivery. In this paper, we
propose DiCLET-TTS, a Diffusion model based Cross-Lingual Emotion Transfer
method that can transfer emotion from a source speaker to the intra- and
cross-lingual target speakers. Specifically, to relieve the foreign accent
problem while improving the emotion expressiveness, the terminal distribution
of the forward diffusion process is parameterized into a speaker-irrelevant but
emotion-related linguistic prior by a prior text encoder with the emotion
embedding as a condition. To address the weaker emotional expressiveness
problem caused by speaker disentanglement in emotion embedding, a novel
orthogonal projection based emotion disentangling module (OP-EDM) is proposed
to learn the speaker-irrelevant but emotion-discriminative embedding. Moreover,
a condition-enhanced DPM decoder is introduced to strengthen the modeling
ability of the speaker and the emotion in the reverse diffusion process to
further improve emotion expressiveness in speech delivery. Cross-lingual
emotion transfer experiments show the superiority of DiCLET-TTS over various
competitive models and the good design of OP-EDM in learning speaker-irrelevant
but emotion-discriminative embedding.Comment: accepted by TASL
U-Style: Cascading U-nets with Multi-level Speaker and Style Modeling for Zero-Shot Voice Cloning
Zero-shot speaker cloning aims to synthesize speech for any target speaker
unseen during TTS system building, given only a single speech reference of the
speaker at hand. Although more practical in real applications, the current
zero-shot methods still produce speech with undesirable naturalness and speaker
similarity. Moreover, endowing the target speaker with arbitrary speaking
styles in the zero-shot setup has not been considered. This is because the
unique challenge of zero-shot speaker and style cloning is to learn the
disentangled speaker and style representations from only short references
representing an arbitrary speaker and an arbitrary style. To address this
challenge, we propose U-Style, which employs Grad-TTS as the backbone,
particularly cascading a speaker-specific encoder and a style-specific encoder
between the text encoder and the diffusion decoder. Thus, leveraging signal
perturbation, U-Style is explicitly decomposed into speaker- and style-specific
modeling parts, achieving better speaker and style disentanglement. To improve
unseen speaker and style modeling ability, these two encoders conduct
multi-level speaker and style modeling by skip-connected U-nets, incorporating
the representation extraction and information reconstruction process. Besides,
to improve the naturalness of synthetic speech, we adopt mean-based instance
normalization and style adaptive layer normalization in these encoders to
perform representation extraction and condition adaptation, respectively.
Experiments show that U-Style significantly surpasses the state-of-the-art
methods in unseen speaker cloning regarding naturalness and speaker similarity.
Notably, U-Style can transfer the style from an unseen source speaker to
another unseen target speaker, achieving flexible combinations of desired
speaker timbre and style in zero-shot voice cloning
Improving the quality of neural TTS using long-form content and multi-speaker multi-style modeling
Neural text-to-speech (TTS) can provide quality close to natural speech if an
adequate amount of high-quality speech material is available for training.
However, acquiring speech data for TTS training is costly and time-consuming,
especially if the goal is to generate different speaking styles. In this work,
we show that we can transfer speaking style across speakers and improve the
quality of synthetic speech by training a multi-speaker multi-style (MSMS)
model with long-form recordings, in addition to regular TTS recordings. In
particular, we show that 1) multi-speaker modeling improves the overall TTS
quality, 2) the proposed MSMS approach outperforms pre-training and fine-tuning
approach when utilizing additional multi-speaker data, and 3) long-form
speaking style is highly rated regardless of the target text domain.Comment: Accepted to 12th ISCA Speech Synthesis Workshop (SSW
AI-generated Content for Various Data Modalities: A Survey
AI-generated content (AIGC) methods aim to produce text, images, videos, 3D
assets, and other media using AI algorithms. Due to its wide range of
applications and the demonstrated potential of recent works, AIGC developments
have been attracting lots of attention recently, and AIGC methods have been
developed for various data modalities, such as image, video, text, 3D shape (as
voxels, point clouds, meshes, and neural implicit fields), 3D scene, 3D human
avatar (body and head), 3D motion, and audio -- each presenting different
characteristics and challenges. Furthermore, there have also been many
significant developments in cross-modality AIGC methods, where generative
methods can receive conditioning input in one modality and produce outputs in
another. Examples include going from various modalities to image, video, 3D
shape, 3D scene, 3D avatar (body and head), 3D motion (skeleton and avatar),
and audio modalities. In this paper, we provide a comprehensive review of AIGC
methods across different data modalities, including both single-modality and
cross-modality methods, highlighting the various challenges, representative
works, and recent technical directions in each setting. We also survey the
representative datasets throughout the modalities, and present comparative
results for various modalities. Moreover, we also discuss the challenges and
potential future research directions