15,417 research outputs found
Spatio-Temporal Fusion Networks for Action Recognition
The video based CNN works have focused on effective ways to fuse appearance
and motion networks, but they typically lack utilizing temporal information
over video frames. In this work, we present a novel spatio-temporal fusion
network (STFN) that integrates temporal dynamics of appearance and motion
information from entire videos. The captured temporal dynamic information is
then aggregated for a better video level representation and learned via
end-to-end training. The spatio-temporal fusion network consists of two set of
Residual Inception blocks that extract temporal dynamics and a fusion
connection for appearance and motion features. The benefits of STFN are: (a) it
captures local and global temporal dynamics of complementary data to learn
video-wide information; and (b) it is applicable to any network for video
classification to boost performance. We explore a variety of design choices for
STFN and verify how the network performance is varied with the ablation
studies. We perform experiments on two challenging human activity datasets,
UCF101 and HMDB51, and achieve the state-of-the-art results with the best
network
Multi-View Region Adaptive Multi-temporal DMM and RGB Action Recognition
Human action recognition remains an important yet challenging task. This work
proposes a novel action recognition system. It uses a novel Multiple View
Region Adaptive Multi-resolution in time Depth Motion Map (MV-RAMDMM)
formulation combined with appearance information. Multiple stream 3D
Convolutional Neural Networks (CNNs) are trained on the different views and
time resolutions of the region adaptive Depth Motion Maps. Multiple views are
synthesised to enhance the view invariance. The region adaptive weights, based
on localised motion, accentuate and differentiate parts of actions possessing
faster motion. Dedicated 3D CNN streams for multi-time resolution appearance
information (RGB) are also included. These help to identify and differentiate
between small object interactions. A pre-trained 3D-CNN is used here with
fine-tuning for each stream along with multiple class Support Vector Machines
(SVM)s. Average score fusion is used on the output. The developed approach is
capable of recognising both human action and human-object interaction. Three
public domain datasets including: MSR 3D Action,Northwestern UCLA multi-view
actions and MSR 3D daily activity are used to evaluate the proposed solution.
The experimental results demonstrate the robustness of this approach compared
with state-of-the-art algorithms.Comment: 14 pages, 6 figures, 13 tables. Submitte
Two-Stream Convolutional Networks for Action Recognition in Videos
We investigate architectures of discriminatively trained deep Convolutional
Networks (ConvNets) for action recognition in video. The challenge is to
capture the complementary information on appearance from still frames and
motion between frames. We also aim to generalise the best performing
hand-crafted features within a data-driven learning framework.
Our contribution is three-fold. First, we propose a two-stream ConvNet
architecture which incorporates spatial and temporal networks. Second, we
demonstrate that a ConvNet trained on multi-frame dense optical flow is able to
achieve very good performance in spite of limited training data. Finally, we
show that multi-task learning, applied to two different action classification
datasets, can be used to increase the amount of training data and improve the
performance on both.
Our architecture is trained and evaluated on the standard video actions
benchmarks of UCF-101 and HMDB-51, where it is competitive with the state of
the art. It also exceeds by a large margin previous attempts to use deep nets
for video classification
- …