24 research outputs found
Emotion Recognition in the Wild using Deep Neural Networks and Bayesian Classifiers
Group emotion recognition in the wild is a challenging problem, due to the
unstructured environments in which everyday life pictures are taken. Some of
the obstacles for an effective classification are occlusions, variable lighting
conditions, and image quality. In this work we present a solution based on a
novel combination of deep neural networks and Bayesian classifiers. The neural
network works on a bottom-up approach, analyzing emotions expressed by isolated
faces. The Bayesian classifier estimates a global emotion integrating top-down
features obtained through a scene descriptor. In order to validate the system
we tested the framework on the dataset released for the Emotion Recognition in
the Wild Challenge 2017. Our method achieved an accuracy of 64.68% on the test
set, significantly outperforming the 53.62% competition baseline.Comment: accepted by the Fifth Emotion Recognition in the Wild (EmotiW)
Challenge 201
Semi-Supervised Speech Emotion Recognition with Ladder Networks
Speech emotion recognition (SER) systems find applications in various fields
such as healthcare, education, and security and defense. A major drawback of
these systems is their lack of generalization across different conditions. This
problem can be solved by training models on large amounts of labeled data from
the target domain, which is expensive and time-consuming. Another approach is
to increase the generalization of the models. An effective way to achieve this
goal is by regularizing the models through multitask learning (MTL), where
auxiliary tasks are learned along with the primary task. These methods often
require the use of labeled data which is computationally expensive to collect
for emotion recognition (gender, speaker identity, age or other emotional
descriptors). This study proposes the use of ladder networks for emotion
recognition, which utilizes an unsupervised auxiliary task. The primary task is
a regression problem to predict emotional attributes. The auxiliary task is the
reconstruction of intermediate feature representations using a denoising
autoencoder. This auxiliary task does not require labels so it is possible to
train the framework in a semi-supervised fashion with abundant unlabeled data
from the target domain. This study shows that the proposed approach creates a
powerful framework for SER, achieving superior performance than fully
supervised single-task learning (STL) and MTL baselines. The approach is
implemented with several acoustic features, showing that ladder networks
generalize significantly better in cross-corpus settings. Compared to the STL
baselines, the proposed approach achieves relative gains in concordance
correlation coefficient (CCC) between 3.0% and 3.5% for within corpus
evaluations, and between 16.1% and 74.1% for cross corpus evaluations,
highlighting the power of the architecture
M3ER: Multiplicative Multimodal Emotion Recognition Using Facial, Textual, and Speech Cues
We present M3ER, a learning-based method for emotion recognition from
multiple input modalities. Our approach combines cues from multiple
co-occurring modalities (such as face, text, and speech) and also is more
robust than other methods to sensor noise in any of the individual modalities.
M3ER models a novel, data-driven multiplicative fusion method to combine the
modalities, which learn to emphasize the more reliable cues and suppress others
on a per-sample basis. By introducing a check step which uses Canonical
Correlational Analysis to differentiate between ineffective and effective
modalities, M3ER is robust to sensor noise. M3ER also generates proxy features
in place of the ineffectual modalities. We demonstrate the efficiency of our
network through experimentation on two benchmark datasets, IEMOCAP and
CMU-MOSEI. We report a mean accuracy of 82.7% on IEMOCAP and 89.0% on
CMU-MOSEI, which, collectively, is an improvement of about 5% over prior work