2,370 research outputs found
Group-level Emotion Recognition using Transfer Learning from Face Identification
In this paper, we describe our algorithmic approach, which was used for
submissions in the fifth Emotion Recognition in the Wild (EmotiW 2017)
group-level emotion recognition sub-challenge. We extracted feature vectors of
detected faces using the Convolutional Neural Network trained for face
identification task, rather than traditional pre-training on emotion
recognition problems. In the final pipeline an ensemble of Random Forest
classifiers was learned to predict emotion score using available training set.
In case when the faces have not been detected, one member of our ensemble
extracts features from the whole image. During our experimental study, the
proposed approach showed the lowest error rate when compared to other explored
techniques. In particular, we achieved 75.4% accuracy on the validation data,
which is 20% higher than the handcrafted feature-based baseline. The source
code using Keras framework is publicly available.Comment: 5 pages, 3 figures, accepted for publication at ICMI17 (EmotiW Grand
Challenge
Spatio-Temporal Facial Expression Recognition Using Convolutional Neural Networks and Conditional Random Fields
Automated Facial Expression Recognition (FER) has been a challenging task for
decades. Many of the existing works use hand-crafted features such as LBP, HOG,
LPQ, and Histogram of Optical Flow (HOF) combined with classifiers such as
Support Vector Machines for expression recognition. These methods often require
rigorous hyperparameter tuning to achieve good results. Recently Deep Neural
Networks (DNN) have shown to outperform traditional methods in visual object
recognition. In this paper, we propose a two-part network consisting of a
DNN-based architecture followed by a Conditional Random Field (CRF) module for
facial expression recognition in videos. The first part captures the spatial
relation within facial images using convolutional layers followed by three
Inception-ResNet modules and two fully-connected layers. To capture the
temporal relation between the image frames, we use linear chain CRF in the
second part of our network. We evaluate our proposed network on three publicly
available databases, viz. CK+, MMI, and FERA. Experiments are performed in
subject-independent and cross-database manners. Our experimental results show
that cascading the deep network architecture with the CRF module considerably
increases the recognition of facial expressions in videos and in particular it
outperforms the state-of-the-art methods in the cross-database experiments and
yields comparable results in the subject-independent experiments.Comment: To appear in 12th IEEE Conference on Automatic Face and Gesture
Recognition Worksho
Temporal contextual descriptors and applications to emotion analysis.
The current trends in technology suggest that the next generation of services and devices allows smarter customization and automatic context recognition. Computers learn the behavior of the users and can offer them customized services depending on the context, location, and preferences. One of the most important challenges in human-machine interaction is the proper understanding of human emotions by machines and automated systems. In the recent years, the progress made in machine learning and pattern recognition led to the development of algorithms that are able to learn the detection and identification of human emotions from experience. These algorithms use different modalities such as image, speech, and physiological signals to analyze and learn human emotions. In many settings, the vocal information might be more available than other modalities due to widespread of voice sensors in phones, cars, and computer systems in general. In emotion analysis from speech, an audio utterance is represented by an ordered (in time) sequence of features or a multivariate time series. Typically, the sequence is further mapped into a global descriptor representative of the entire utterance/sequence. This descriptor is used for classification and analysis. In classic approaches, statistics are computed over the entire sequence and used as a global descriptor. This often results in the loss of temporal ordering from the original sequence. Emotion is a succession of acoustic events. By discarding the temporal ordering of these events in the mapping, the classic approaches cannot detect acoustic patterns that lead to a certain emotion. In this dissertation, we propose a novel feature mapping framework. The proposed framework maps temporally ordered sequence of acoustic features into data-driven global descriptors that integrate the temporal information from the original sequence. The framework contains three mapping algorithms. These algorithms integrate the temporal information implicitly and explicitly in the descriptor\u27s representation. In the rst algorithm, the Temporal Averaging Algorithm, we average the data temporally using leaky integrators to produce a global descriptor that implicitly integrates the temporal information from the original sequence. In order to integrate the discrimination between classes in the mapping, we propose the Temporal Response Averaging Algorithm which combines the temporal averaging step of the previous algorithm and unsupervised learning to produce data driven temporal contextual descriptors. In the third algorithm, we use the topology preserving property of the Self-Organizing Maps and the continuous nature of speech to map a temporal sequence into an ordered trajectory representing the behavior over time of the input utterance on a 2-D map of emotions. The temporal information is integrated explicitly in the descriptor which makes it easier to monitor emotions in long speeches. The proposed mapping framework maps speech data of different length to the same equivalent representation which alleviates the problem of dealing with variable length temporal sequences. This is advantageous in real time setting where the size of the analysis window can be variable. Using the proposed feature mapping framework, we build a novel data-driven speech emotion detection and recognition system that indexes speech databases to facilitate the classification and retrieval of emotions. We test the proposed system using two datasets. The first corpus is acted. We showed that the proposed mapping framework outperforms the classic approaches while providing descriptors that are suitable for the analysis and visualization of humans’ emotions in speech data. The second corpus is an authentic dataset. In this dissertation, we evaluate the performances of our system using a collection of debates. For that purpose, we propose a novel debate collection that is one of the first initiatives in the literature. We show that the proposed system is able to learn human emotions from debates
Audio Deepfake Detection: A Survey
Audio deepfake detection is an emerging active topic. A growing number of
literatures have aimed to study deepfake detection algorithms and achieved
effective performance, the problem of which is far from being solved. Although
there are some review literatures, there has been no comprehensive survey that
provides researchers with a systematic overview of these developments with a
unified evaluation. Accordingly, in this survey paper, we first highlight the
key differences across various types of deepfake audio, then outline and
analyse competitions, datasets, features, classifications, and evaluation of
state-of-the-art approaches. For each aspect, the basic techniques, advanced
developments and major challenges are discussed. In addition, we perform a
unified comparison of representative features and classifiers on ASVspoof 2021,
ADD 2023 and In-the-Wild datasets for audio deepfake detection, respectively.
The survey shows that future research should address the lack of large scale
datasets in the wild, poor generalization of existing detection methods to
unknown fake attacks, as well as interpretability of detection results
Real-time Convolutional Neural Networks for Emotion and Gender Classification
In this paper we propose an implement a general convolutional neural network
(CNN) building framework for designing real-time CNNs. We validate our models
by creating a real-time vision system which accomplishes the tasks of face
detection, gender classification and emotion classification simultaneously in
one blended step using our proposed CNN architecture. After presenting the
details of the training procedure setup we proceed to evaluate on standard
benchmark sets. We report accuracies of 96% in the IMDB gender dataset and 66%
in the FER-2013 emotion dataset. Along with this we also introduced the very
recent real-time enabled guided back-propagation visualization technique.
Guided back-propagation uncovers the dynamics of the weight changes and
evaluates the learned features. We argue that the careful implementation of
modern CNN architectures, the use of the current regularization methods and the
visualization of previously hidden features are necessary in order to reduce
the gap between slow performances and real-time architectures. Our system has
been validated by its deployment on a Care-O-bot 3 robot used during
RoboCup@Home competitions. All our code, demos and pre-trained architectures
have been released under an open-source license in our public repository.Comment: Submitted to ICRA 201
- …