19 research outputs found
Hidden Two-Stream Convolutional Networks for Action Recognition
Analyzing videos of human actions involves understanding the temporal
relationships among video frames. State-of-the-art action recognition
approaches rely on traditional optical flow estimation methods to pre-compute
motion information for CNNs. Such a two-stage approach is computationally
expensive, storage demanding, and not end-to-end trainable. In this paper, we
present a novel CNN architecture that implicitly captures motion information
between adjacent frames. We name our approach hidden two-stream CNNs because it
only takes raw video frames as input and directly predicts action classes
without explicitly computing optical flow. Our end-to-end approach is 10x
faster than its two-stage baseline. Experimental results on four challenging
action recognition datasets: UCF101, HMDB51, THUMOS14 and ActivityNet v1.2 show
that our approach significantly outperforms the previous best real-time
approaches.Comment: Accepted at ACCV 2018, camera ready. Code available at
https://github.com/bryanyzhu/Hidden-Two-Strea
Large-Scale Mapping of Human Activity using Geo-Tagged Videos
This paper is the first work to perform spatio-temporal mapping of human
activity using the visual content of geo-tagged videos. We utilize a recent
deep-learning based video analysis framework, termed hidden two-stream
networks, to recognize a range of activities in YouTube videos. This framework
is efficient and can run in real time or faster which is important for
recognizing events as they occur in streaming video or for reducing latency in
analyzing already captured video. This is, in turn, important for using video
in smart-city applications. We perform a series of experiments to show our
approach is able to accurately map activities both spatially and temporally. We
also demonstrate the advantages of using the visual content over the
tags/titles.Comment: Accepted at ACM SIGSPATIAL 201
Spatio-Temporal Fusion Networks for Action Recognition
The video based CNN works have focused on effective ways to fuse appearance
and motion networks, but they typically lack utilizing temporal information
over video frames. In this work, we present a novel spatio-temporal fusion
network (STFN) that integrates temporal dynamics of appearance and motion
information from entire videos. The captured temporal dynamic information is
then aggregated for a better video level representation and learned via
end-to-end training. The spatio-temporal fusion network consists of two set of
Residual Inception blocks that extract temporal dynamics and a fusion
connection for appearance and motion features. The benefits of STFN are: (a) it
captures local and global temporal dynamics of complementary data to learn
video-wide information; and (b) it is applicable to any network for video
classification to boost performance. We explore a variety of design choices for
STFN and verify how the network performance is varied with the ablation
studies. We perform experiments on two challenging human activity datasets,
UCF101 and HMDB51, and achieve the state-of-the-art results with the best
network
Texture-Based Input Feature Selection for Action Recognition
The performance of video action recognition has been significantly boosted by
using motion representations within a two-stream Convolutional Neural Network
(CNN) architecture. However, there are a few challenging problems in action
recognition in real scenarios, e.g., the variations in viewpoints and poses,
and the changes in backgrounds. The domain discrepancy between the training
data and the test data causes the performance drop. To improve the model
robustness, we propose a novel method to determine the task-irrelevant content
in inputs which increases the domain discrepancy. The method is based on a
human parsing model (HP model) which jointly conducts dense correspondence
labelling and semantic part segmentation. The predictions from the HP model
also function as re-rendering the human regions in each video using the same
set of textures to make humans appearances in all classes be the same. A
revised dataset is generated for training and testing and makes the action
recognition model exhibit invariance to the irrelevant content in the inputs.
Moreover, the predictions from the HP model are used to enrich the inputs to
the AR model during both training and testing. Experimental results show that
our proposed model is superior to existing models for action recognition on the
HMDB-51 dataset and the Penn Action dataset