5,135 research outputs found
Multi-velocity neural networks for gesture recognition in videos
We present a new action recognition deep neural network which adaptively
learns the best action velocities in addition to the classification. While deep
neural networks have reached maturity for image understanding tasks, we are
still exploring network topologies and features to handle the richer
environment of video clips. Here, we tackle the problem of multiple velocities
in action recognition, and provide state-of-the-art results for gesture
recognition, on known and new collected datasets. We further provide the
training steps for our semi-supervised network, suited to learn from huge
unlabeled datasets with only a fraction of labeled examples
Deep video gesture recognition using illumination invariants
In this paper we present architectures based on deep neural nets for gesture
recognition in videos, which are invariant to local scaling. We amalgamate
autoencoder and predictor architectures using an adaptive weighting scheme
coping with a reduced size labeled dataset, while enriching our models from
enormous unlabeled sets. We further improve robustness to lighting conditions
by introducing a new adaptive filer based on temporal local scale
normalization. We provide superior results over known methods, including recent
reported approaches based on neural nets
Going Deeper in Facial Expression Recognition using Deep Neural Networks
Automated Facial Expression Recognition (FER) has remained a challenging and
interesting problem. Despite efforts made in developing various methods for
FER, existing approaches traditionally lack generalizability when applied to
unseen images or those that are captured in wild setting. Most of the existing
approaches are based on engineered features (e.g. HOG, LBPH, and Gabor) where
the classifier's hyperparameters are tuned to give best recognition accuracies
across a single database, or a small collection of similar databases.
Nevertheless, the results are not significant when they are applied to novel
data. This paper proposes a deep neural network architecture to address the FER
problem across multiple well-known standard face datasets. Specifically, our
network consists of two convolutional layers each followed by max pooling and
then four Inception layers. The network is a single component architecture that
takes registered facial images as the input and classifies them into either of
the six basic or the neutral expressions. We conducted comprehensive
experiments on seven publically available facial expression databases, viz.
MultiPIE, MMI, CK+, DISFA, FERA, SFEW, and FER2013. The results of proposed
architecture are comparable to or better than the state-of-the-art methods and
better than traditional convolutional neural networks and in both accuracy and
training time.Comment: To be appear in IEEE Winter Conference on Applications of Computer
Vision (WACV), 2016 {Accepted in first round submission
Non-Volume Preserving-based Feature Fusion Approach to Group-Level Expression Recognition on Crowd Videos
Group-level emotion recognition (ER) is a growing research area as the
demands for assessing crowds of all sizes is becoming an interest in both the
security arena as well as social media. This work extends the earlier ER
investigations, which focused on either group-level ER on single images or
within a video, by fully investigating group-level expression recognition on
crowd videos. In this paper, we propose an effective deep feature level fusion
mechanism to model the spatial-temporal information in the crowd videos. In our
approach, the fusing process is performed on deep feature domain by a
generative probabilistic model, Non-Volume Preserving Fusion (NVPF), that
models spatial information relationship. Furthermore, we extend our proposed
spatial NVPF approach to spatial-temporal NVPF approach to learn the temporal
information between frames. In order to demonstrate the robustness and
effectiveness of each component in the proposed approach, three experiments
were conducted: (i) evaluation on AffectNet database to benchmark the proposed
EmoNet for recognizing facial expression; (ii) evaluation on EmotiW2018 to
benchmark the proposed deep feature level fusion mechanism NVPF; and, (iii)
examine the proposed TNVPF on an innovative Group-level Emotion on Crowd Videos
(GECV) dataset composed of 627 videos collected from publicly available
sources. GECV dataset is a collection of videos containing crowds of people.
Each video is labeled with emotion categories at three levels: individual
faces, group of people and the entire video frame.Comment: Under review at Patter Recognitio
Deep Facial Expression Recognition: A Survey
With the transition of facial expression recognition (FER) from
laboratory-controlled to challenging in-the-wild conditions and the recent
success of deep learning techniques in various fields, deep neural networks
have increasingly been leveraged to learn discriminative representations for
automatic FER. Recent deep FER systems generally focus on two important issues:
overfitting caused by a lack of sufficient training data and
expression-unrelated variations, such as illumination, head pose and identity
bias. In this paper, we provide a comprehensive survey on deep FER, including
datasets and algorithms that provide insights into these intrinsic problems.
First, we describe the standard pipeline of a deep FER system with the related
background knowledge and suggestions of applicable implementations for each
stage. We then introduce the available datasets that are widely used in the
literature and provide accepted data selection and evaluation principles for
these datasets. For the state of the art in deep FER, we review existing novel
deep neural networks and related training strategies that are designed for FER
based on both static images and dynamic image sequences, and discuss their
advantages and limitations. Competitive performances on widely used benchmarks
are also summarized in this section. We then extend our survey to additional
related issues and application scenarios. Finally, we review the remaining
challenges and corresponding opportunities in this field as well as future
directions for the design of robust deep FER systems
Deep Fusion: An Attention Guided Factorized Bilinear Pooling for Audio-video Emotion Recognition
Automatic emotion recognition (AER) is a challenging task due to the abstract
concept and multiple expressions of emotion. Although there is no consensus on
a definition, human emotional states usually can be apperceived by auditory and
visual systems. Inspired by this cognitive process in human beings, it's
natural to simultaneously utilize audio and visual information in AER. However,
most traditional fusion approaches only build a linear paradigm, such as
feature concatenation and multi-system fusion, which hardly captures complex
association between audio and video. In this paper, we introduce factorized
bilinear pooling (FBP) to deeply integrate the features of audio and video.
Specifically, the features are selected through the embedded attention
mechanism from respective modalities to obtain the emotion-related regions. The
whole pipeline can be completed in a neural network. Validated on the AFEW
database of the audio-video sub-challenge in EmotiW2018, the proposed approach
achieves an accuracy of 62.48%, outperforming the state-of-the-art result
Real-time Facial Expression Recognition "In The Wild'' by Disentangling 3D Expression from Identity
Human emotions analysis has been the focus of many studies, especially in the
field of Affective Computing, and is important for many applications, e.g.
human-computer intelligent interaction, stress analysis, interactive games,
animations, etc. Solutions for automatic emotion analysis have also benefited
from the development of deep learning approaches and the availability of vast
amount of visual facial data on the internet. This paper proposes a novel
method for human emotion recognition from a single RGB image. We construct a
large-scale dataset of facial videos (\textbf{FaceVid}), rich in facial
dynamics, identities, expressions, appearance and 3D pose variations. We use
this dataset to train a deep Convolutional Neural Network for estimating
expression parameters of a 3D Morphable Model and combine it with an effective
back-end emotion classifier. Our proposed framework runs at 50 frames per
second and is capable of robustly estimating parameters of 3D expression
variation and accurately recognizing facial expressions from in-the-wild
images. We present extensive experimental evaluation that shows that the
proposed method outperforms the compared techniques in estimating the 3D
expression parameters and achieves state-of-the-art performance in recognising
the basic emotions from facial images, as well as recognising stress from
facial videos. %compared to the current state of the art in emotion recognition
from facial images.Comment: to be published in 15th IEEE International Conference on Automatic
Face and Gesture Recognition (FG 2020
Deep generative-contrastive networks for facial expression recognition
As the expressive depth of an emotional face differs with individuals or
expressions, recognizing an expression using a single facial image at a moment
is difficult. A relative expression of a query face compared to a reference
face might alleviate this difficulty. In this paper, we propose to utilize
contrastive representation that embeds a distinctive expressive factor for a
discriminative purpose. The contrastive representation is calculated at the
embedding layer of deep networks by comparing a given (query) image with the
reference image. We attempt to utilize a generative reference image that is
estimated based on the given image. Consequently, we deploy deep neural
networks that embed a combination of a generative model, a contrastive model,
and a discriminative model with an end-to-end training manner. In our proposed
networks, we attempt to disentangle a facial expressive factor in two steps
including learning of a generator network and a contrastive encoder network. We
conducted extensive experiments on publicly available face expression databases
(CK+, MMI, Oulu-CASIA, and in-the-wild databases) that have been widely adopted
in the recent literatures. The proposed method outperforms the known
state-of-the art methods in terms of the recognition accuracy
Multi-task, multi-label and multi-domain learning with residual convolutional networks for emotion recognition
Automated emotion recognition in the wild from facial images remains a
challenging problem. Although recent advances in Deep Learning have supposed a
significant breakthrough in this topic, strong changes in pose, orientation and
point of view severely harm current approaches. In addition, the acquisition of
labeled datasets is costly, and current state-of-the-art deep learning
algorithms cannot model all the aforementioned difficulties. In this paper, we
propose to apply a multi-task learning loss function to share a common feature
representation with other related tasks. Particularly we show that emotion
recognition benefits from jointly learning a model with a detector of facial
Action Units (collective muscle movements). The proposed loss function
addresses the problem of learning multiple tasks with heterogeneously labeled
data, improving previous multi-task approaches. We validate the proposal using
two datasets acquired in non controlled environments, and an application to
predict compound facial emotion expressions.Comment: Preprint submitted to IJC
Probabilistic Attribute Tree in Convolutional Neural Networks for Facial Expression Recognition
In this paper, we proposed a novel Probabilistic Attribute Tree-CNN (PAT-CNN)
to explicitly deal with the large intra-class variations caused by
identity-related attributes, e.g., age, race, and gender. Specifically, a novel
PAT module with an associated PAT loss was proposed to learn features in a
hierarchical tree structure organized according to attributes, where the final
features are less affected by the attributes. Then, expression-related features
are extracted from leaf nodes. Samples are probabilistically assigned to tree
nodes at different levels such that expression-related features can be learned
from all samples weighted by probabilities. We further proposed a
semi-supervised strategy to learn the PAT-CNN from limited attribute-annotated
samples to make the best use of available data. Experimental results on five
facial expression datasets have demonstrated that the proposed PAT-CNN
outperforms the baseline models by explicitly modeling attributes. More
impressively, the PAT-CNN using a single model achieves the best performance
for faces in the wild on the SFEW dataset, compared with the state-of-the-art
methods using an ensemble of hundreds of CNNs.Comment: 10 page
- …