Search CORE

14 research outputs found

Dyadic Speech-based Affect Recognition using DAMI-P2C Parent-child Multimodal Interaction Dataset

Author: Chung Junyoung
Fujita Yusuke
Gordon Goren
Hoover-Dempsey Kathleen V
Ioffe Sergey
Kamphaus Randy W
McNab Katrina
Neuman Susan B
Park Hae Won
Rudovic O.
Sainath Tara N
Spaulding Samuel
Stewart Angela
Trigeorgis George
Zhang Zixing
Zhao Huijuan
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 20/08/2020
Field of study

Automatic speech-based affect recognition of individuals in dyadic conversation is a challenging task, in part because of its heavy reliance on manual pre-processing. Traditional approaches frequently require hand-crafted speech features and segmentation of speaker turns. In this work, we design end-to-end deep learning methods to recognize each person's affective expression in an audio stream with two speakers, automatically discovering features and time regions relevant to the target speaker's affect. We integrate a local attention mechanism into the end-to-end architecture and compare the performance of three attention implementations -- one mean pooling and two weighted pooling methods. Our results show that the proposed weighted-pooling attention solutions are able to learn to focus on the regions containing target speaker's affective information and successfully extract the individual's valence and arousal intensity. Here we introduce and use a "dyadic affect in multimodal interaction - parent to child" (DAMI-P2C) dataset collected in a study of 34 families, where a parent and a child (3-7 years old) engage in reading storybooks together. In contrast to existing public datasets for affect recognition, each instance for both speakers in the DAMI-P2C dataset is annotated for the perceived affect by three labelers. To encourage more research on the challenging task of multi-speaker affect sensing, we make the annotated DAMI-P2C dataset publicly available, including acoustic features of the dyads' raw audios, affect annotations, and a diverse set of developmental, social, and demographic profiles of each dyad.Comment: Accepted by the 2020 International Conference on Multimodal Interaction (ICMI'20

arXiv.org e-Print Archive

Crossref

An improved StarGAN for emotional voice conversion: enhancing voice quality and data augmentation

Author: Chen Junjie
He Xiangheng
Rizos Georgios
Schuller Björn W.
Publication venue: 'International Speech Communication Association'
Publication date: 18/07/2021
Field of study

Emotional Voice Conversion (EVC) aims to convert the emotional style of a source speech signal to a target style while preserving its content and speaker identity information. Previous emotional conversion studies do not disentangle emotional information from emotion-independent information that should be preserved, thus transforming it all in a monolithic manner and generating audio of low quality, with linguistic distortions. To address this distortion problem, we propose a novel StarGAN framework along with a two-stage training process that separates emotional features from those independent of emotion by using an autoencoder with two encoders as the generator of the Generative Adversarial Network (GAN). The proposed model achieves favourable results in both the objective evaluation and the subjective evaluation in terms of distortion, which reveals that the proposed model can effectively reduce distortion. Furthermore, in data augmentation experiments for end-to-end speech emotion recognition, the proposed StarGAN model achieves an increase of 2% in Micro-F1 and 5% in Macro-F1 compared to the baseline StarGAN model, which indicates that the proposed model is more valuable for data augmentation.Comment: Accepted by Interspeech 202

arXiv.org e-Print Archive

OPUS Augsburg

Enhancing transferability of black-box adversarial attacks via lifelong learning for speech emotion recognition models

Author: Cummins Nicholas
Han Jing
Ren Zhao
Schuller Björn
Publication venue
Publication date: 01/01/2020
Field of study

A paper in INTERSPEECH 2020

OPUS Augsburg

Crossref

ZENODO

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

Attention-augmented end-to-end multi-task learning for emotion prediction from speech

Author: Schuller Björn
Wu Bingwen
Zhang Zixing
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2019
Field of study

Despite the increasing research interest in end-to-end learning systems for speech emotion recognition, conventional systems either suffer from the overfitting due in part to the limited training data, or do not explicitly consider the different contributions of automatically learnt representations for a specific task. In this contribution, we propose a novel end-to-end framework which is enhanced by learning other auxiliary tasks and an attention mechanism. That is, we jointly train an end-to-end network with several different but related emotion prediction tasks, i.e., arousal, valence, and dominance predictions, to extract more robust representations shared among various tasks than traditional systems with the hope that it is able to relieve the overfitting problem. Meanwhile, an attention layer is implemented on top of the layers for each task, with the aim to capture the contribution distribution of different segment parts for each individual task. To evaluate the effectiveness of the proposed system, we conducted a set of experiments on the widely used database IEMOCAP. The empirical results show that the proposed systems significantly outperform corresponding baseline systems.Comment: accepted by ICASSP 201

arXiv.org e-Print Archive

OPUS Augsburg

Crossref

Emotion Embeddings \unicode{x2014} Learning Stable and Homogeneous Abstractions from Heterogeneous Affective Datasets

Author: Buechel Sven
Hahn Udo
Publication venue
Publication date: 15/08/2023
Field of study

Human emotion is expressed in many communication modalities and media formats and so their computational study is equally diversified into natural language processing, audio signal analysis, computer vision, etc. Similarly, the large variety of representation formats used in previous research to describe emotions (polarity scales, basic emotion categories, dimensional approaches, appraisal theory, etc.) have led to an ever proliferating diversity of datasets, predictive models, and software tools for emotion analysis. Because of these two distinct types of heterogeneity, at the expressional and representational level, there is a dire need to unify previous work on increasingly diverging data and label types. This article presents such a unifying computational model. We propose a training procedure that learns a shared latent representation for emotions, so-called emotion embeddings, independent of different natural languages, communication modalities, media or representation label formats, and even disparate model architectures. Experiments on a wide range of heterogeneous affective datasets indicate that this approach yields the desired interoperability for the sake of reusability, interpretability and flexibility, without penalizing prediction quality. Code and data are archived under https://doi.org/10.5281/zenodo.7405327 .Comment: 18 pages, 6 figure

arXiv.org e-Print Archive