19,673 research outputs found
Accurate emotion strength assessment for seen and unseen speech based on data-driven deep learning
Emotion classification of speech and assessment of the emotion strength are required in applications such as emotional text-to-speech and voice conversion. The emotion attribute ranking function based on Support Vector Machine (SVM) was proposed to predict emotion strength for emotional speech corpus. However, the trained ranking function doesn't generalize to new domains, which limits the scope of applications, especially for out-of-domain or unseen speech. In this paper, we propose a data-driven deep learning model, i.e. StrengthNet, to improve the generalization of emotion strength assessment for seen and unseen speech. This is achieved by the fusion of emotional data from various domains. We follow a multi-task learning network architecture that includes an acoustic encoder, a strength predictor, and an auxiliary emotion predictor. Experiments show that the predicted emotion strength of the proposed StrengthNet is highly correlated with ground truth scores for both seen and unseen speech. We release the source codes at: https://github.com/ttslr/StrengthNet
Mixed-EVC: Mixed Emotion Synthesis and Control in Voice Conversion
Emotional voice conversion (EVC) traditionally targets the transformation of
spoken utterances from one emotional state to another, with previous research
mainly focusing on discrete emotion categories. This paper departs from the
norm by introducing a novel perspective: a nuanced rendering of mixed emotions
and enhancing control over emotional expression. To achieve this, we propose a
novel EVC framework, Mixed-EVC, which only leverages discrete emotion training
labels. We construct an attribute vector that encodes the relationships among
these discrete emotions, which is predicted using a ranking-based support
vector machine and then integrated into a sequence-to-sequence (seq2seq) EVC
framework. Mixed-EVC not only learns to characterize the input emotional style
but also quantifies its relevance to other emotions during training. As a
result, users have the ability to assign these attributes to achieve their
desired rendering of mixed emotions. Objective and subjective evaluations
confirm the effectiveness of our approach in terms of mixed emotion synthesis
and control while surpassing traditional baselines in the conversion of
discrete emotions from one to another
Controllable Accented Text-to-Speech Synthesis
Accented text-to-speech (TTS) synthesis seeks to generate speech with an
accent (L2) as a variant of the standard version (L1). Accented TTS synthesis
is challenging as L2 is different from L1 in both in terms of phonetic
rendering and prosody pattern. Furthermore, there is no easy solution to the
control of the accent intensity in an utterance. In this work, we propose a
neural TTS architecture, that allows us to control the accent and its intensity
during inference. This is achieved through three novel mechanisms, 1) an accent
variance adaptor to model the complex accent variance with three prosody
controlling factors, namely pitch, energy and duration; 2) an accent intensity
modeling strategy to quantify the accent intensity; 3) a consistency constraint
module to encourage the TTS system to render the expected accent intensity at a
fine level. Experiments show that the proposed system attains superior
performance to the baseline models in terms of accent rendering and intensity
control. To our best knowledge, this is the first study of accented TTS
synthesis with explicit intensity control.Comment: To be submitted for possible journal publicatio
EmoMix: Emotion Mixing via Diffusion Models for Emotional Speech Synthesis
There has been significant progress in emotional Text-To-Speech (TTS)
synthesis technology in recent years. However, existing methods primarily focus
on the synthesis of a limited number of emotion types and have achieved
unsatisfactory performance in intensity control. To address these limitations,
we propose EmoMix, which can generate emotional speech with specified intensity
or a mixture of emotions. Specifically, EmoMix is a controllable emotional TTS
model based on a diffusion probabilistic model and a pre-trained speech emotion
recognition (SER) model used to extract emotion embedding. Mixed emotion
synthesis is achieved by combining the noises predicted by diffusion model
conditioned on different emotions during only one sampling process at the
run-time. We further apply the Neutral and specific primary emotion mixed in
varying degrees to control intensity. Experimental results validate the
effectiveness of EmoMix for synthesizing mixed emotion and intensity control.Comment: Accepted by 24th Annual Conference of the International Speech
Communication Association (INTERSPEECH 2023
QI-TTS: Questioning Intonation Control for Emotional Speech Synthesis
Recent expressive text to speech (TTS) models focus on synthesizing emotional
speech, but some fine-grained styles such as intonation are neglected. In this
paper, we propose QI-TTS which aims to better transfer and control intonation
to further deliver the speaker's questioning intention while transferring
emotion from reference speech. We propose a multi-style extractor to extract
style embedding from two different levels. While the sentence level represents
emotion, the final syllable level represents intonation. For fine-grained
intonation control, we use relative attributes to represent intonation
intensity at the syllable level.Experiments have validated the effectiveness of
QI-TTS for improving intonation expressiveness in emotional speech synthesis.Comment: Accepted by ICASSP 202
EmoDiff: Intensity Controllable Emotional Text-to-Speech with Soft-Label Guidance
Although current neural text-to-speech (TTS) models are able to generate
high-quality speech, intensity controllable emotional TTS is still a
challenging task. Most existing methods need external optimizations for
intensity calculation, leading to suboptimal results or degraded quality. In
this paper, we propose EmoDiff, a diffusion-based TTS model where emotion
intensity can be manipulated by a proposed soft-label guidance technique
derived from classifier guidance. Specifically, instead of being guided with a
one-hot vector for the specified emotion, EmoDiff is guided with a soft label
where the value of the specified emotion and \textit{Neutral} is set to
and respectively. The here represents the emotion
intensity and can be chosen from 0 to 1. Our experiments show that EmoDiff can
precisely control the emotion intensity while maintaining high voice quality.
Moreover, diverse speech with specified emotion intensity can be generated by
sampling in the reverse denoising process.Comment: Accepted to ICASSP202
An Overview of Affective Speech Synthesis and Conversion in the Deep Learning Era
Speech is the fundamental mode of human communication, and its synthesis has
long been a core priority in human-computer interaction research. In recent
years, machines have managed to master the art of generating speech that is
understandable by humans. But the linguistic content of an utterance
encompasses only a part of its meaning. Affect, or expressivity, has the
capacity to turn speech into a medium capable of conveying intimate thoughts,
feelings, and emotions -- aspects that are essential for engaging and
naturalistic interpersonal communication. While the goal of imparting
expressivity to synthesised utterances has so far remained elusive, following
recent advances in text-to-speech synthesis, a paradigm shift is well under way
in the fields of affective speech synthesis and conversion as well. Deep
learning, as the technology which underlies most of the recent advances in
artificial intelligence, is spearheading these efforts. In the present
overview, we outline ongoing trends and summarise state-of-the-art approaches
in an attempt to provide a comprehensive overview of this exciting field.Comment: Submitted to the Proceedings of IEE
- …