268,477 research outputs found
Action Recognition with Dynamic Image Networks
We introduce the concept of "dynamic image", a novel compact representation
of videos useful for video analysis, particularly in combination with
convolutional neural networks (CNNs). A dynamic image encodes temporal data
such as RGB or optical flow videos by using the concept of `rank pooling'. The
idea is to learn a ranking machine that captures the temporal evolution of the
data and to use the parameters of the latter as a representation. When a linear
ranking machine is used, the resulting representation is in the form of an
image, which we call dynamic because it summarizes the video dynamics in
addition of appearance. This is a powerful idea because it allows to convert
any video to an image so that existing CNN models pre-trained for the analysis
of still images can be immediately extended to videos. We also present an
efficient and effective approximate rank pooling operator, accelerating
standard rank pooling algorithms by orders of magnitude, and formulate that as
a CNN layer. This new layer allows generalizing dynamic images to dynamic
feature maps. We demonstrate the power of the new representations on standard
benchmarks in action recognition achieving state-of-the-art performance.Comment: 14 pages, 9 figures, 9 table
Action recognition from RGB-D data
In recent years, action recognition based on RGB-D data has attracted increasing attention. Different from traditional 2D action recognition, RGB-D data contains extra depth and skeleton modalities. Different modalities have their own characteristics. This thesis presents seven novel methods to take advantages of the three modalities for action recognition.
First, effective handcrafted features are designed and frequent pattern mining method is employed to mine the most discriminative, representative and nonredundant features for skeleton-based action recognition. Second, to take advantages of powerful Convolutional Neural Networks (ConvNets), it is proposed to represent spatio-temporal information carried in 3D skeleton sequences in three 2D images by encoding the joint trajectories and their dynamics into color distribution in the images, and ConvNets are adopted to learn the discriminative features for human action recognition. Third, for depth-based action recognition, three strategies of data augmentation are proposed to apply ConvNets to small training datasets. Forth, to take full advantage of the 3D structural information offered in the depth modality and its being insensitive to illumination variations, three simple, compact yet effective images-based representations are proposed and ConvNets are adopted for feature extraction and classification. However, both of previous two methods are sensitive to noise and could not differentiate well fine-grained actions. Fifth, it is proposed to represent a depth map sequence into three pairs of structured dynamic images at body, part and joint levels respectively through bidirectional rank pooling to deal with the issue. The structured dynamic image preserves the spatial-temporal information, enhances the structure information across both body parts/joints and different temporal scales, and takes advantages of ConvNets for action recognition. Sixth, it is proposed to extract and use scene flow for action recognition from RGB and depth data. Last, to exploit the joint information in multi-modal features arising from heterogeneous sources (RGB, depth), it is proposed to cooperatively train a single ConvNet (referred to as c-ConvNet) on both RGB features and depth features, and deeply aggregate the two modalities to achieve robust action recognition
Describing Videos by Exploiting Temporal Structure
Recent progress in using recurrent neural networks (RNNs) for image
description has motivated the exploration of their application for video
description. However, while images are static, working with videos requires
modeling their dynamic temporal structure and then properly integrating that
information into a natural language description. In this context, we propose an
approach that successfully takes into account both the local and global
temporal structure of videos to produce descriptions. First, our approach
incorporates a spatial temporal 3-D convolutional neural network (3-D CNN)
representation of the short temporal dynamics. The 3-D CNN representation is
trained on video action recognition tasks, so as to produce a representation
that is tuned to human motion and behavior. Second we propose a temporal
attention mechanism that allows to go beyond local temporal modeling and learns
to automatically select the most relevant temporal segments given the
text-generating RNN. Our approach exceeds the current state-of-art for both
BLEU and METEOR metrics on the Youtube2Text dataset. We also present results on
a new, larger and more challenging dataset of paired video and natural language
descriptions.Comment: Accepted to ICCV15. This version comes with code release and
supplementary materia
Deep recurrent spiking neural networks capture both static and dynamic representations of the visual cortex under movie stimuli
In the real world, visual stimuli received by the biological visual system
are predominantly dynamic rather than static. A better understanding of how the
visual cortex represents movie stimuli could provide deeper insight into the
information processing mechanisms of the visual system. Although some progress
has been made in modeling neural responses to natural movies with deep neural
networks, the visual representations of static and dynamic information under
such time-series visual stimuli remain to be further explored. In this work,
considering abundant recurrent connections in the mouse visual system, we
design a recurrent module based on the hierarchy of the mouse cortex and add it
into Deep Spiking Neural Networks, which have been demonstrated to be a more
compelling computational model for the visual cortex. Using Time-Series
Representational Similarity Analysis, we measure the representational
similarity between networks and mouse cortical regions under natural movie
stimuli. Subsequently, we conduct a comparison of the representational
similarity across recurrent/feedforward networks and image/video training
tasks. Trained on the video action recognition task, recurrent SNN achieves the
highest representational similarity and significantly outperforms feedforward
SNN trained on the same task by 15% and the recurrent SNN trained on the image
classification task by 8%. We investigate how static and dynamic
representations of SNNs influence the similarity, as a way to explain the
importance of these two forms of representations in biological neural coding.
Taken together, our work is the first to apply deep recurrent SNNs to model the
mouse visual cortex under movie stimuli and we establish that these networks
are competent to capture both static and dynamic representations and make
contributions to understanding the movie information processing mechanisms of
the visual cortex
Ordered Pooling of Optical Flow Sequences for Action Recognition
Training of Convolutional Neural Networks (CNNs) on long video sequences is
computationally expensive due to the substantial memory requirements and the
massive number of parameters that deep architectures demand. Early fusion of
video frames is thus a standard technique, in which several consecutive frames
are first agglomerated into a compact representation, and then fed into the CNN
as an input sample. For this purpose, a summarization approach that represents
a set of consecutive RGB frames by a single dynamic image to capture pixel
dynamics is proposed recently. In this paper, we introduce a novel ordered
representation of consecutive optical flow frames as an alternative and argue
that this representation captures the action dynamics more effectively than RGB
frames. We provide intuitions on why such a representation is better for action
recognition. We validate our claims on standard benchmark datasets and
demonstrate that using summaries of flow images lead to significant
improvements over RGB frames while achieving accuracy comparable to the
state-of-the-art on UCF101 and HMDB datasets.Comment: Accepted in WACV 201
Im2Flow: Motion Hallucination from Static Images for Action Recognition
Existing methods to recognize actions in static images take the images at
their face value, learning the appearances---objects, scenes, and body
poses---that distinguish each action class. However, such models are deprived
of the rich dynamic structure and motions that also define human activity. We
propose an approach that hallucinates the unobserved future motion implied by a
single snapshot to help static-image action recognition. The key idea is to
learn a prior over short-term dynamics from thousands of unlabeled videos,
infer the anticipated optical flow on novel static images, and then train
discriminative models that exploit both streams of information. Our main
contributions are twofold. First, we devise an encoder-decoder convolutional
neural network and a novel optical flow encoding that can translate a static
image into an accurate flow map. Second, we show the power of hallucinated flow
for recognition, successfully transferring the learned motion into a standard
two-stream network for activity recognition. On seven datasets, we demonstrate
the power of the approach. It not only achieves state-of-the-art accuracy for
dense optical flow prediction, but also consistently enhances recognition of
actions and dynamic scenes.Comment: Published in CVPR 2018, project page:
http://vision.cs.utexas.edu/projects/im2flow
- …