14 research outputs found
Dyadic Speech-based Affect Recognition using DAMI-P2C Parent-child Multimodal Interaction Dataset
Automatic speech-based affect recognition of individuals in dyadic
conversation is a challenging task, in part because of its heavy reliance on
manual pre-processing. Traditional approaches frequently require hand-crafted
speech features and segmentation of speaker turns. In this work, we design
end-to-end deep learning methods to recognize each person's affective
expression in an audio stream with two speakers, automatically discovering
features and time regions relevant to the target speaker's affect. We integrate
a local attention mechanism into the end-to-end architecture and compare the
performance of three attention implementations -- one mean pooling and two
weighted pooling methods. Our results show that the proposed weighted-pooling
attention solutions are able to learn to focus on the regions containing target
speaker's affective information and successfully extract the individual's
valence and arousal intensity. Here we introduce and use a "dyadic affect in
multimodal interaction - parent to child" (DAMI-P2C) dataset collected in a
study of 34 families, where a parent and a child (3-7 years old) engage in
reading storybooks together. In contrast to existing public datasets for affect
recognition, each instance for both speakers in the DAMI-P2C dataset is
annotated for the perceived affect by three labelers. To encourage more
research on the challenging task of multi-speaker affect sensing, we make the
annotated DAMI-P2C dataset publicly available, including acoustic features of
the dyads' raw audios, affect annotations, and a diverse set of developmental,
social, and demographic profiles of each dyad.Comment: Accepted by the 2020 International Conference on Multimodal
Interaction (ICMI'20
An improved StarGAN for emotional voice conversion: enhancing voice quality and data augmentation
Emotional Voice Conversion (EVC) aims to convert the emotional style of a
source speech signal to a target style while preserving its content and speaker
identity information. Previous emotional conversion studies do not disentangle
emotional information from emotion-independent information that should be
preserved, thus transforming it all in a monolithic manner and generating audio
of low quality, with linguistic distortions. To address this distortion
problem, we propose a novel StarGAN framework along with a two-stage training
process that separates emotional features from those independent of emotion by
using an autoencoder with two encoders as the generator of the Generative
Adversarial Network (GAN). The proposed model achieves favourable results in
both the objective evaluation and the subjective evaluation in terms of
distortion, which reveals that the proposed model can effectively reduce
distortion. Furthermore, in data augmentation experiments for end-to-end speech
emotion recognition, the proposed StarGAN model achieves an increase of 2% in
Micro-F1 and 5% in Macro-F1 compared to the baseline StarGAN model, which
indicates that the proposed model is more valuable for data augmentation.Comment: Accepted by Interspeech 202
Enhancing transferability of black-box adversarial attacks via lifelong learning for speech emotion recognition models
A paper in INTERSPEECH 2020
Attention-augmented end-to-end multi-task learning for emotion prediction from speech
Despite the increasing research interest in end-to-end learning systems for
speech emotion recognition, conventional systems either suffer from the
overfitting due in part to the limited training data, or do not explicitly
consider the different contributions of automatically learnt representations
for a specific task. In this contribution, we propose a novel end-to-end
framework which is enhanced by learning other auxiliary tasks and an attention
mechanism. That is, we jointly train an end-to-end network with several
different but related emotion prediction tasks, i.e., arousal, valence, and
dominance predictions, to extract more robust representations shared among
various tasks than traditional systems with the hope that it is able to relieve
the overfitting problem. Meanwhile, an attention layer is implemented on top of
the layers for each task, with the aim to capture the contribution distribution
of different segment parts for each individual task. To evaluate the
effectiveness of the proposed system, we conducted a set of experiments on the
widely used database IEMOCAP. The empirical results show that the proposed
systems significantly outperform corresponding baseline systems.Comment: accepted by ICASSP 201
Emotion Embeddings \unicode{x2014} Learning Stable and Homogeneous Abstractions from Heterogeneous Affective Datasets
Human emotion is expressed in many communication modalities and media formats
and so their computational study is equally diversified into natural language
processing, audio signal analysis, computer vision, etc. Similarly, the large
variety of representation formats used in previous research to describe
emotions (polarity scales, basic emotion categories, dimensional approaches,
appraisal theory, etc.) have led to an ever proliferating diversity of
datasets, predictive models, and software tools for emotion analysis. Because
of these two distinct types of heterogeneity, at the expressional and
representational level, there is a dire need to unify previous work on
increasingly diverging data and label types. This article presents such a
unifying computational model. We propose a training procedure that learns a
shared latent representation for emotions, so-called emotion embeddings,
independent of different natural languages, communication modalities, media or
representation label formats, and even disparate model architectures.
Experiments on a wide range of heterogeneous affective datasets indicate that
this approach yields the desired interoperability for the sake of reusability,
interpretability and flexibility, without penalizing prediction quality. Code
and data are archived under https://doi.org/10.5281/zenodo.7405327 .Comment: 18 pages, 6 figure