10,851 research outputs found
Deep Multimodal Speaker Naming
Automatic speaker naming is the problem of localizing as well as identifying
each speaking character in a TV/movie/live show video. This is a challenging
problem mainly attributes to its multimodal nature, namely face cue alone is
insufficient to achieve good performance. Previous multimodal approaches to
this problem usually process the data of different modalities individually and
merge them using handcrafted heuristics. Such approaches work well for simple
scenes, but fail to achieve high performance for speakers with large appearance
variations. In this paper, we propose a novel convolutional neural networks
(CNN) based learning framework to automatically learn the fusion function of
both face and audio cues. We show that without using face tracking, facial
landmark localization or subtitle/transcript, our system with robust multimodal
feature extraction is able to achieve state-of-the-art speaker naming
performance evaluated on two diverse TV series. The dataset and implementation
of our algorithm are publicly available online
Distinguishing Posed and Spontaneous Smiles by Facial Dynamics
Smile is one of the key elements in identifying emotions and present state of
mind of an individual. In this work, we propose a cluster of approaches to
classify posed and spontaneous smiles using deep convolutional neural network
(CNN) face features, local phase quantization (LPQ), dense optical flow and
histogram of gradient (HOG). Eulerian Video Magnification (EVM) is used for
micro-expression smile amplification along with three normalization procedures
for distinguishing posed and spontaneous smiles. Although the deep CNN face
model is trained with large number of face images, HOG features outperforms
this model for overall face smile classification task. Using EVM to amplify
micro-expressions did not have a significant impact on classification accuracy,
while the normalizing facial features improved classification accuracy. Unlike
many manual or semi-automatic methodologies, our approach aims to automatically
classify all smiles into either `spontaneous' or `posed' categories, by using
support vector machines (SVM). Experimental results on large UvA-NEMO smile
database show promising results as compared to other relevant methods.Comment: 16 pages, 8 figures, ACCV 2016, Second Workshop on Spontaneous Facial
Behavior Analysi
Shallow Triple Stream Three-dimensional CNN (STSTNet) for Micro-expression Recognition
In the recent year, state-of-the-art for facial micro-expression recognition
have been significantly advanced by deep neural networks. The robustness of
deep learning has yielded promising performance beyond that of traditional
handcrafted approaches. Most works in literature emphasized on increasing the
depth of networks and employing highly complex objective functions to learn
more features. In this paper, we design a Shallow Triple Stream
Three-dimensional CNN (STSTNet) that is computationally light whilst capable of
extracting discriminative high level features and details of micro-expressions.
The network learns from three optical flow features (i.e., optical strain,
horizontal and vertical optical flow fields) computed based on the onset and
apex frames of each video. Our experimental results demonstrate the
effectiveness of the proposed STSTNet, which obtained an unweighted average
recall rate of 0.7605 and unweighted F1-score of 0.7353 on the composite
database consisting of 442 samples from the SMIC, CASME II and SAMM databases.Comment: 5 pages, 1 figure, Accepted and published in IEEE FG 201
A Novel Apex-Time Network for Cross-Dataset Micro-Expression Recognition
The automatic recognition of micro-expression has been boosted ever since the
successful introduction of deep learning approaches. As researchers working on
such topics are moving to learn from the nature of micro-expression, the
practice of using deep learning techniques has evolved from processing the
entire video clip of micro-expression to the recognition on apex frame. Using
the apex frame is able to get rid of redundant video frames, but the relevant
temporal evidence of micro-expression would be thereby left out. This paper
proposes a novel Apex-Time Network (ATNet) to recognize micro-expression based
on spatial information from the apex frame as well as on temporal information
from the respective-adjacent frames. Through extensive experiments on three
benchmarks, we demonstrate the improvement achieved by learning such temporal
information. Specially, the model with such temporal information is more robust
in cross-dataset validations.Comment: 6 pages, 3 figures, 3 tables, code available, accepted in ACII 201
- …