23,795 research outputs found
Group-Level Emotion Recognition Using a Unimodal Privacy-Safe Non-Individual Approach
This article presents our unimodal privacy-safe and non-individual proposal
for the audio-video group emotion recognition subtask at the Emotion
Recognition in the Wild (EmotiW) Challenge 2020 1. This sub challenge aims to
classify in the wild videos into three categories: Positive, Neutral and
Negative. Recent deep learning models have shown tremendous advances in
analyzing interactions between people, predicting human behavior and affective
evaluation. Nonetheless, their performance comes from individual-based
analysis, which means summing up and averaging scores from individual
detections, which inevitably leads to some privacy issues. In this research, we
investigated a frugal approach towards a model able to capture the global moods
from the whole image without using face or pose detection, or any
individual-based feature as input. The proposed methodology mixes
state-of-the-art and dedicated synthetic corpora as training sources. With an
in-depth exploration of neural network architectures for group-level emotion
recognition, we built a VGG-based model achieving 59.13% accuracy on the VGAF
test set (eleventh place of the challenge). Given that the analysis is unimodal
based only on global features and that the performance is evaluated on a
real-world dataset, these results are promising and let us envision extending
this model to multimodality for classroom ambiance evaluation, our final target
application
EmoNets: Multimodal deep learning approaches for emotion recognition in video
The task of the emotion recognition in the wild (EmotiW) Challenge is to
assign one of seven emotions to short video clips extracted from Hollywood
style movies. The videos depict acted-out emotions under realistic conditions
with a large degree of variation in attributes such as pose and illumination,
making it worthwhile to explore approaches which consider combinations of
features from multiple modalities for label assignment. In this paper we
present our approach to learning several specialist models using deep learning
techniques, each focusing on one modality. Among these are a convolutional
neural network, focusing on capturing visual information in detected faces, a
deep belief net focusing on the representation of the audio stream, a K-Means
based "bag-of-mouths" model, which extracts visual features around the mouth
region and a relational autoencoder, which addresses spatio-temporal aspects of
videos. We explore multiple methods for the combination of cues from these
modalities into one common classifier. This achieves a considerably greater
accuracy than predictions from our strongest single-modality classifier. Our
method was the winning submission in the 2013 EmotiW challenge and achieved a
test set accuracy of 47.67% on the 2014 dataset
- …