17,242 research outputs found
A Comparison of Machine Learning Algorithms and Feature Sets for Automatic Vocal Emotion Recognition in Speech
Vocal emotion recognition (VER) in natural speech, often referred to as speech emotion recognition (SER), remains challenging for both humans and computers. Applied fields including clinical diagnosis and intervention, social interaction research or Human Computer Interaction (HCI) increasingly benefit from efficient VER algorithms. Several feature sets were used with machine-learning (ML) algorithms for discrete emotion classification. However, there is no consensus for which low-level-descriptors and classifiers are optimal. Therefore, we aimed to compare the performance of machine-learning algorithms with several different feature sets. Concretely, seven ML algorithms were compared on the Berlin Database of Emotional Speech: Multilayer Perceptron Neural Network (MLP), J48 Decision Tree (DT), Support Vector Machine with Sequential Minimal Optimization (SMO), Random Forest (RF), k-Nearest Neighbor (KNN), Simple Logistic Regression (LOG) and Multinomial Logistic Regression (MLR) with 10-fold cross validation using four openSMILE feature sets (i.e., IS-09, emobase, GeMAPS and eGeMAPS). Results indicated that SMO, MLP and LOG show better performance (reaching to 87.85%, 84.00% and 83.74% accuracies, respectively) compared to RF, DT, MLR and KNN (with minimum 73.46%, 53.08%, 70.65% and 58.69% accuracies, respectively). Overall, the emobase feature set performed best. We discuss the implications of these findings for applications in diagnosis, intervention or HCI
Improving Unimodal Inference with Multimodal Transformers
This paper proposes an approach for improving performance of unimodal models
with multimodal training. Our approach involves a multi-branch architecture
that incorporates unimodal models with a multimodal transformer-based branch.
By co-training these branches, the stronger multimodal branch can transfer its
knowledge to the weaker unimodal branches through a multi-task objective,
thereby improving the performance of the resulting unimodal models. We evaluate
our approach on tasks of dynamic hand gesture recognition based on RGB and
Depth, audiovisual emotion recognition based on speech and facial video, and
audio-video-text based sentiment analysis. Our approach outperforms the
conventionally trained unimodal counterparts. Interestingly, we also observe
that optimization of the unimodal branches improves the multimodal branch,
compared to a similar multimodal model trained from scratch
Recommended from our members
Learning Speech Emotion Representations in the Quaternion Domain
The modeling of human emotion expression in speech signals is an important, yet challenging task. The high resource demand of speech emotion recognition models, combined with the general scarcity of emotion-labelled data are obstacles to the development and application of effective solutions in this field. In this paper, we present an approach to jointly circumvent these difficulties. Our method, named RH-emo, is a novel semi-supervised architecture aimed at extracting quaternion embeddings from real-valued monoaural spectrograms, enabling the use of quaternion-valued networks for speech emotion recognition tasks. RH-emo is a hybrid real/quaternion autoencoder network that consists of a real-valued encoder in parallel to a real-valued emotion classifier and a quaternion-valued decoder. On the one hand, the classifier permits to optimization of each latent axis of the embeddings for the classification of a specific emotion-related characteristic: valence, arousal, dominance, and overall emotion. On the other hand, quaternion reconstruction enables the latent dimension to develop intra-channel correlations that are required for an effective representation as a quaternion entity. We test our approach on speech emotion recognition tasks using four popular datasets: IEMOCAP, RAVDESS, EmoDB, and TESS, comparing the performance of three well-established real-valued CNN architectures (AlexNet, ResNet-50, VGG) and their quaternion-valued equivalent fed with the embeddings created with RH-emo. We obtain a consistent improvement in the test accuracy for all datasets, while drastically reducing the resources' demand of models. Moreover, we performed additional experiments and ablation studies that confirm the effectiveness of our approach
An Empirical Analysis of the Role of Amplifiers, Downtoners, and Negations in Emotion Classification in Microblogs
The effect of amplifiers, downtoners, and negations has been studied in
general and particularly in the context of sentiment analysis. However, there
is only limited work which aims at transferring the results and methods to
discrete classes of emotions, e. g., joy, anger, fear, sadness, surprise, and
disgust. For instance, it is not straight-forward to interpret which emotion
the phrase "not happy" expresses. With this paper, we aim at obtaining a better
understanding of such modifiers in the context of emotion-bearing words and
their impact on document-level emotion classification, namely, microposts on
Twitter. We select an appropriate scope detection method for modifiers of
emotion words, incorporate it in a document-level emotion classification model
as additional bag of words and show that this approach improves the performance
of emotion classification. In addition, we build a term weighting approach
based on the different modifiers into a lexical model for the analysis of the
semantics of modifiers and their impact on emotion meaning. We show that
amplifiers separate emotions expressed with an emotion- bearing word more
clearly from other secondary connotations. Downtoners have the opposite effect.
In addition, we discuss the meaning of negations of emotion-bearing words. For
instance we show empirically that "not happy" is closer to sadness than to
anger and that fear-expressing words in the scope of downtoners often express
surprise.Comment: Accepted for publication at The 5th IEEE International Conference on
Data Science and Advanced Analytics (DSAA), https://dsaa2018.isi.it
- …