5,540 research outputs found
Enriched Long-term Recurrent Convolutional Network for Facial Micro-Expression Recognition
Facial micro-expression (ME) recognition has posed a huge challenge to
researchers for its subtlety in motion and limited databases. Recently,
handcrafted techniques have achieved superior performance in micro-expression
recognition but at the cost of domain specificity and cumbersome parametric
tunings. In this paper, we propose an Enriched Long-term Recurrent
Convolutional Network (ELRCN) that first encodes each micro-expression frame
into a feature vector through CNN module(s), then predicts the micro-expression
by passing the feature vector through a Long Short-term Memory (LSTM) module.
The framework contains two different network variants: (1) Channel-wise
stacking of input data for spatial enrichment, (2) Feature-wise stacking of
features for temporal enrichment. We demonstrate that the proposed approach is
able to achieve reasonably good performance, without data augmentation. In
addition, we also present ablation studies conducted on the framework and
visualizations of what CNN "sees" when predicting the micro-expression classes.Comment: Published in Micro-Expression Grand Challenge 2018, Workshop of 13th
IEEE Facial & Gesture 201
Ordered Pooling of Optical Flow Sequences for Action Recognition
Training of Convolutional Neural Networks (CNNs) on long video sequences is
computationally expensive due to the substantial memory requirements and the
massive number of parameters that deep architectures demand. Early fusion of
video frames is thus a standard technique, in which several consecutive frames
are first agglomerated into a compact representation, and then fed into the CNN
as an input sample. For this purpose, a summarization approach that represents
a set of consecutive RGB frames by a single dynamic image to capture pixel
dynamics is proposed recently. In this paper, we introduce a novel ordered
representation of consecutive optical flow frames as an alternative and argue
that this representation captures the action dynamics more effectively than RGB
frames. We provide intuitions on why such a representation is better for action
recognition. We validate our claims on standard benchmark datasets and
demonstrate that using summaries of flow images lead to significant
improvements over RGB frames while achieving accuracy comparable to the
state-of-the-art on UCF101 and HMDB datasets.Comment: Accepted in WACV 201
MoSculp: Interactive Visualization of Shape and Time
We present a system that allows users to visualize complex human motion via
3D motion sculptures---a representation that conveys the 3D structure swept by
a human body as it moves through space. Given an input video, our system
computes the motion sculptures and provides a user interface for rendering it
in different styles, including the options to insert the sculpture back into
the original video, render it in a synthetic scene or physically print it.
To provide this end-to-end workflow, we introduce an algorithm that estimates
that human's 3D geometry over time from a set of 2D images and develop a
3D-aware image-based rendering approach that embeds the sculpture back into the
scene. By automating the process, our system takes motion sculpture creation
out of the realm of professional artists, and makes it applicable to a wide
range of existing video material.
By providing viewers with 3D information, motion sculptures reveal space-time
motion information that is difficult to perceive with the naked eye, and allow
viewers to interpret how different parts of the object interact over time. We
validate the effectiveness of this approach with user studies, finding that our
motion sculpture visualizations are significantly more informative about motion
than existing stroboscopic and space-time visualization methods.Comment: UIST 2018. Project page: http://mosculp.csail.mit.edu
Going Deeper into First-Person Activity Recognition
We bring together ideas from recent work on feature design for egocentric
action recognition under one framework by exploring the use of deep
convolutional neural networks (CNN). Recent work has shown that features such
as hand appearance, object attributes, local hand motion and camera ego-motion
are important for characterizing first-person actions. To integrate these ideas
under one framework, we propose a twin stream network architecture, where one
stream analyzes appearance information and the other stream analyzes motion
information. Our appearance stream encodes prior knowledge of the egocentric
paradigm by explicitly training the network to segment hands and localize
objects. By visualizing certain neuron activation of our network, we show that
our proposed architecture naturally learns features that capture object
attributes and hand-object configurations. Our extensive experiments on
benchmark egocentric action datasets show that our deep architecture enables
recognition rates that significantly outperform state-of-the-art techniques --
an average increase in accuracy over all datasets. Furthermore, by
learning to recognize objects, actions and activities jointly, the performance
of individual recognition tasks also increase by (actions) and
(objects). We also include the results of extensive ablative analysis to
highlight the importance of network design decisions.
- …