36 research outputs found
Speech emotion recognition using semantic information
Speech emotion recognition is a crucial problem manifesting in a multitude of
applications such as human computer interaction and education. Although several
advancements have been made in the recent years, especially with the advent of
Deep Neural Networks (DNN), most of the studies in the literature fail to
consider the semantic information in the speech signal. In this paper, we
propose a novel framework that can capture both the semantic and the
paralinguistic information in the signal. In particular, our framework is
comprised of a semantic feature extractor, that captures the semantic
information, and a paralinguistic feature extractor, that captures the
paralinguistic information. Both semantic and paraliguistic features are then
combined to a unified representation using a novel attention mechanism. The
unified feature vector is passed through a LSTM to capture the temporal
dynamics in the signal, before the final prediction. To validate the
effectiveness of our framework, we use the popular SEWA dataset of the AVEC
challenge series and compare with the three winning papers. Our model provides
state-of-the-art results in the valence and liking dimensions.Comment: ICASSP 202
The ACII 2022 Affective Vocal Bursts Workshop & Competition: understanding a critically understudied modality of emotional expression
The ACII Affective Vocal Bursts Workshop & Competition is focused on understanding multiple affective dimensions of vocal bursts: laughs, gasps, cries, screams, and many other non-linguistic vocalizations central to the expression of emotion and to human communication more generally. This year's competition comprises four tracks using a large-scale and in-the-wild dataset of 59,299 vocalizations from 1,702 speakers. The first, the A-VB-High task, requires competition participants to perform a multi-label regression on a novel model for emotion, utilizing ten classes of richly annotated emotional expression intensities, including; Awe, Fear, and Surprise. The second, the A-VB-Two task, utilizes the more conventional 2-dimensional model for emotion, arousal, and valence. The third, the A-VB-Culture task, requires participants to explore the cultural aspects of the dataset, training native-country dependent models. Finally, for the fourth task, A-VB-Type, participants should recognize the type of vocal burst (e.g., laughter, cry, grunt) as an 8-class classification. This paper describes the four tracks and baseline systems, which use state-of-the-art machine learning methods. The baseline performance for each track is obtained by utilizing an end-to-end deep learning model and is as follows: for A-VB-High, a mean (over the 10-dimensions) Concordance Correlation Coefficient (CCC) of 0.5687 CCC is obtained; for A-VB-Two, a mean (over the 2-dimensions) CCC of 0.5084 is obtained; for A-VB-Culture, a mean CCC from the four cultures of 0.4401 is obtained; and for A-VB-Type, the baseline Unweighted Average Recall (UAR) from the 8-classes is 0.4172 UAR
An Overview of Affective Speech Synthesis and Conversion in the Deep Learning Era
Speech is the fundamental mode of human communication, and its synthesis has
long been a core priority in human-computer interaction research. In recent
years, machines have managed to master the art of generating speech that is
understandable by humans. But the linguistic content of an utterance
encompasses only a part of its meaning. Affect, or expressivity, has the
capacity to turn speech into a medium capable of conveying intimate thoughts,
feelings, and emotions -- aspects that are essential for engaging and
naturalistic interpersonal communication. While the goal of imparting
expressivity to synthesised utterances has so far remained elusive, following
recent advances in text-to-speech synthesis, a paradigm shift is well under way
in the fields of affective speech synthesis and conversion as well. Deep
learning, as the technology which underlies most of the recent advances in
artificial intelligence, is spearheading these efforts. In the present
overview, we outline ongoing trends and summarise state-of-the-art approaches
in an attempt to provide a comprehensive overview of this exciting field.Comment: Submitted to the Proceedings of IEE
The MuSe 2022 Multimodal Sentiment Analysis Challenge: Humor, Emotional Reactions, and Stress
The Multimodal Sentiment Analysis Challenge (MuSe) 2022 is dedicated to
multimodal sentiment and emotion recognition. For this year's challenge, we
feature three datasets: (i) the Passau Spontaneous Football Coach Humor
(Passau-SFCH) dataset that contains audio-visual recordings of German football
coaches, labelled for the presence of humour; (ii) the Hume-Reaction dataset in
which reactions of individuals to emotional stimuli have been annotated with
respect to seven emotional expression intensities, and (iii) the Ulm-Trier
Social Stress Test (Ulm-TSST) dataset comprising of audio-visual data labelled
with continuous emotion values (arousal and valence) of people in stressful
dispositions. Using the introduced datasets, MuSe 2022 2022 addresses three
contemporary affective computing problems: in the Humor Detection Sub-Challenge
(MuSe-Humor), spontaneous humour has to be recognised; in the Emotional
Reactions Sub-Challenge (MuSe-Reaction), seven fine-grained `in-the-wild'
emotions have to be predicted; and in the Emotional Stress Sub-Challenge
(MuSe-Stress), a continuous prediction of stressed emotion values is featured.
The challenge is designed to attract different research communities,
encouraging a fusion of their disciplines. Mainly, MuSe 2022 targets the
communities of audio-visual emotion recognition, health informatics, and
symbolic sentiment analysis. This baseline paper describes the datasets as well
as the feature sets extracted from them. A recurrent neural network with LSTM
cells is used to set competitive baseline results on the test partitions for
each sub-challenge. We report an Area Under the Curve (AUC) of .8480 for
MuSe-Humor; .2801 mean (from 7-classes) Pearson's Correlations Coefficient for
MuSe-Reaction, as well as .4931 Concordance Correlation Coefficient (CCC) and
.4761 for valence and arousal in MuSe-Stress, respectively.Comment: Preliminary baseline paper for the 3rd Multimodal Sentiment Analysis
Challenge (MuSe) 2022, a full-day workshop at ACM Multimedia 202
MuSe 2020 challenge and workshop: multimodal sentiment analysis, emotion-target engagement and trustworthiness detection in real-life media: emotional car reviews in-the-wild
ABSTRACT
Multimodal Sentiment Analysis in Real-life Media (MuSe) 2020 is a Challenge-based Workshop focusing on the tasks of sentiment recognition, as well as emotion-target engagement and trustworthiness detection by means of more comprehensively integrating the audio-visual and language modalities. The purpose of MuSe 2020 is to bring together communities from different disciplines; mainly, the audio-visual emotion recognition community (signal-based), and the sentiment analysis community (symbol-based). We present three distinct sub-challenges: MuSe-Wild, which focuses on continuous emotion (arousal and valence) prediction; MuSe-Topic, in which participants recognise 10 domain-specific topics as the target of 3-class (low, medium, high) emotions; and MuSe-Trust, in which the novel aspect of trustworthiness is to be predicted. In this paper, we provide detailed information on MuSe-CAR, the first of its kind in-the-wild database, which is utilised for the challenge, as well as the state-of-the-art features and modelling approaches applied. For each sub-challenge, a competitive baseline for participants is set; namely, on test we report for MuSe-Wild a combined (valence and arousal) CCC of .2568, for MuSe-Topic a score (computed as 0.34 * UAR + 0.66 * F1) of 76.78 % on the 10-class topic and 40.64 % on the 3-class emotion prediction, and for MuSe-Trust a CCC of .4359.Funding from the EP- SRC Grant No. 2021037, and the Bavarian State Ministry of Education, Science and the Arts in the framework of the Centre Digitisation.Bavaria (ZD.B). We thank the sponsors of the Challenge BMW Group and audEERING
Deep affect prediction in-the-wild: aff-wild database and challenge, deep architectures, and beyond
Automatic understanding of human affect using visual signals is of great importance in everyday human–machine interactions. Appraising human emotional states, behaviors and reactions displayed in real-world settings, can be accomplished using latent continuous dimensions (e.g., the circumplex model of affect). Valence (i.e., how positive or negative is an emotion) and arousal (i.e., power of the activation of the emotion) constitute popular and effective representations for affect. Nevertheless, the majority of collected datasets this far, although containing naturalistic emotional states, have been captured in highly controlled recording conditions. In this paper, we introduce the Aff-Wild benchmark for training and evaluating affect recognition algorithms. We also report on the results of the First Affect-in-the-wild Challenge (Aff-Wild Challenge) that was recently organized in conjunction with CVPR 2017 on the Aff-Wild database, and was the first ever challenge on the estimation of valence and arousal in-the-wild. Furthermore, we design and extensively train an end-to-end deep neural architecture which performs prediction of continuous emotion dimensions based on visual cues. The proposed deep learning architecture, AffWildNet, includes convolutional and recurrent neural network layers, exploiting the invariant properties of convolutional features, while also modeling temporal dynamics that arise in human behavior via the recurrent layers. The AffWildNet produced state-of-the-art results on the Aff-Wild Challenge. We then exploit the AffWild database for learning features, which can be used as priors for achieving best performances both for dimensional, as well as categorical emotion recognition, using the RECOLA, AFEW-VA and EmotiW 2017 datasets, compared to all other methods designed for the same goal. The database and emotion recognition models are available at http://ibug.doc.ic.ac.uk/resources/first-affect-wild-challenge