2,529 research outputs found
Mode Variational LSTM Robust to Unseen Modes of Variation: Application to Facial Expression Recognition
Spatio-temporal feature encoding is essential for encoding the dynamics in
video sequences. Recurrent neural networks, particularly long short-term memory
(LSTM) units, have been popular as an efficient tool for encoding
spatio-temporal features in sequences. In this work, we investigate the effect
of mode variations on the encoded spatio-temporal features using LSTMs. We show
that the LSTM retains information related to the mode variation in the
sequence, which is irrelevant to the task at hand (e.g. classification facial
expressions). Actually, the LSTM forget mechanism is not robust enough to mode
variations and preserves information that could negatively affect the encoded
spatio-temporal features. We propose the mode variational LSTM to encode
spatio-temporal features robust to unseen modes of variation. The mode
variational LSTM modifies the original LSTM structure by adding an additional
cell state that focuses on encoding the mode variation in the input sequence.
To efficiently regulate what features should be stored in the additional cell
state, additional gating functionality is also introduced. The effectiveness of
the proposed mode variational LSTM is verified using the facial expression
recognition task. Comparative experiments on publicly available datasets
verified that the proposed mode variational LSTM outperforms existing methods.
Moreover, a new dynamic facial expression dataset with different modes of
variation, including various modes like pose and illumination variations, was
collected to comprehensively evaluate the proposed mode variational LSTM.
Experimental results verified that the proposed mode variational LSTM encodes
spatio-temporal features robust to unseen modes of variation.Comment: Accepted in AAAI-1
Machine Analysis of Facial Expressions
No abstract
Automatic analysis of facial actions: a survey
As one of the most comprehensive and objective ways to describe facial expressions, the Facial Action Coding System (FACS) has recently received significant attention. Over the past 30 years, extensive research has been conducted by psychologists and neuroscientists on various aspects of facial expression analysis using FACS. Automating FACS coding would make this research faster and more widely applicable, opening up new avenues to understanding how we communicate through facial expressions. Such an automated process can also potentially increase the reliability, precision and temporal resolution of coding. This paper provides a comprehensive survey of research into machine analysis of facial actions. We systematically review all components of such systems: pre-processing, feature extraction and machine coding of facial actions. In addition, the existing FACS-coded facial expression databases are summarised. Finally, challenges that have to be addressed to make automatic facial action analysis applicable in real-life situations are extensively discussed. There are two underlying motivations for us to write this survey paper: the first is to provide an up-to-date review of the existing literature, and the second is to offer some insights into the future of machine recognition of facial actions: what are the challenges and opportunities that researchers in the field face
Discriminatively Trained Latent Ordinal Model for Video Classification
We study the problem of video classification for facial analysis and human
action recognition. We propose a novel weakly supervised learning method that
models the video as a sequence of automatically mined, discriminative
sub-events (eg. onset and offset phase for "smile", running and jumping for
"highjump"). The proposed model is inspired by the recent works on Multiple
Instance Learning and latent SVM/HCRF -- it extends such frameworks to model
the ordinal aspect in the videos, approximately. We obtain consistent
improvements over relevant competitive baselines on four challenging and
publicly available video based facial analysis datasets for prediction of
expression, clinical pain and intent in dyadic conversations and on three
challenging human action datasets. We also validate the method with qualitative
results and show that they largely support the intuitions behind the method.Comment: Paper accepted in IEEE TPAMI. arXiv admin note: substantial text
overlap with arXiv:1604.0150
Every Smile is Unique: Landmark-Guided Diverse Smile Generation
Each smile is unique: one person surely smiles in different ways (e.g.,
closing/opening the eyes or mouth). Given one input image of a neutral face,
can we generate multiple smile videos with distinctive characteristics? To
tackle this one-to-many video generation problem, we propose a novel deep
learning architecture named Conditional Multi-Mode Network (CMM-Net). To better
encode the dynamics of facial expressions, CMM-Net explicitly exploits facial
landmarks for generating smile sequences. Specifically, a variational
auto-encoder is used to learn a facial landmark embedding. This single
embedding is then exploited by a conditional recurrent network which generates
a landmark embedding sequence conditioned on a specific expression (e.g.,
spontaneous smile). Next, the generated landmark embeddings are fed into a
multi-mode recurrent landmark generator, producing a set of landmark sequences
still associated to the given smile class but clearly distinct from each other.
Finally, these landmark sequences are translated into face videos. Our
experimental results demonstrate the effectiveness of our CMM-Net in generating
realistic videos of multiple smile expressions.Comment: Accepted as a poster in Conference on Computer Vision and Pattern
Recognition (CVPR), 201
- …