367 research outputs found
Semi-supervised learning for continuous emotional intensity controllable speech synthesis with disentangled representations
Recent text-to-speech models have reached the level of generating natural
speech similar to what humans say. But there still have limitations in terms of
expressiveness. The existing emotional speech synthesis models have shown
controllability using interpolated features with scaling parameters in
emotional latent space. However, the emotional latent space generated from the
existing models is difficult to control the continuous emotional intensity
because of the entanglement of features like emotions, speakers, etc. In this
paper, we propose a novel method to control the continuous intensity of
emotions using semi-supervised learning. The model learns emotions of
intermediate intensity using pseudo-labels generated from phoneme-level
sequences of speech information. An embedding space built from the proposed
model satisfies the uniform grid geometry with an emotional basis. The
experimental results showed that the proposed method was superior in
controllability and naturalness.Comment: Accepted by Interspeech 202
EMOCONV-DIFF: Diffusion-based Speech Emotion Conversion for Non-parallel and In-the-wild Data
Speech emotion conversion is the task of converting the expressed emotion of
a spoken utterance to a target emotion while preserving the lexical content and
speaker identity. While most existing works in speech emotion conversion rely
on acted-out datasets and parallel data samples, in this work we specifically
focus on more challenging in-the-wild scenarios and do not rely on parallel
data. To this end, we propose a diffusion-based generative model for speech
emotion conversion, the EmoConv-Diff, that is trained to reconstruct an input
utterance while also conditioning on its emotion. Subsequently, at inference, a
target emotion embedding is employed to convert the emotion of the input
utterance to the given target emotion. As opposed to performing emotion
conversion on categorical representations, we use a continuous arousal
dimension to represent emotions while also achieving intensity control. We
validate the proposed methodology on a large in-the-wild dataset, the
MSP-Podcast v1.10. Our results show that the proposed diffusion model is indeed
capable of synthesizing speech with a controllable target emotion. Crucially,
the proposed approach shows improved performance along the extreme values of
arousal and thereby addresses a common challenge in the speech emotion
conversion literature.Comment: Submitted to ICASSP 202
MSM-VC: High-fidelity Source Style Transfer for Non-Parallel Voice Conversion by Multi-scale Style Modeling
In addition to conveying the linguistic content from source speech to
converted speech, maintaining the speaking style of source speech also plays an
important role in the voice conversion (VC) task, which is essential in many
scenarios with highly expressive source speech, such as dubbing and data
augmentation. Previous work generally took explicit prosodic features or
fixed-length style embedding extracted from source speech to model the speaking
style of source speech, which is insufficient to achieve comprehensive style
modeling and target speaker timbre preservation. Inspired by the style's
multi-scale nature of human speech, a multi-scale style modeling method for the
VC task, referred to as MSM-VC, is proposed in this paper. MSM-VC models the
speaking style of source speech from different levels. To effectively convey
the speaking style and meanwhile prevent timbre leakage from source speech to
converted speech, each level's style is modeled by specific representation.
Specifically, prosodic features, pre-trained ASR model's bottleneck features,
and features extracted by a model trained with a self-supervised strategy are
adopted to model the frame, local, and global-level styles, respectively.
Besides, to balance the performance of source style modeling and target speaker
timbre preservation, an explicit constraint module consisting of a pre-trained
speech emotion recognition model and a speaker classifier is introduced to
MSM-VC. This explicit constraint module also makes it possible to simulate the
style transfer inference process during the training to improve the
disentanglement ability and alleviate the mismatch between training and
inference. Experiments performed on the highly expressive speech corpus
demonstrate that MSM-VC is superior to the state-of-the-art VC methods for
modeling source speech style while maintaining good speech quality and speaker
similarity.Comment: This work was submitted on April 10, 2022 and accepted on August 29,
202
- …