5,198 research outputs found
Optical Flow Guided Feature: A Fast and Robust Motion Representation for Video Action Recognition
Motion representation plays a vital role in human action recognition in
videos. In this study, we introduce a novel compact motion representation for
video action recognition, named Optical Flow guided Feature (OFF), which
enables the network to distill temporal information through a fast and robust
approach. The OFF is derived from the definition of optical flow and is
orthogonal to the optical flow. The derivation also provides theoretical
support for using the difference between two frames. By directly calculating
pixel-wise spatiotemporal gradients of the deep feature maps, the OFF could be
embedded in any existing CNN based video action recognition framework with only
a slight additional cost. It enables the CNN to extract spatiotemporal
information, especially the temporal information between frames simultaneously.
This simple but powerful idea is validated by experimental results. The network
with OFF fed only by RGB inputs achieves a competitive accuracy of 93.3% on
UCF-101, which is comparable with the result obtained by two streams (RGB and
optical flow), but is 15 times faster in speed. Experimental results also show
that OFF is complementary to other motion modalities such as optical flow. When
the proposed method is plugged into the state-of-the-art video action
recognition framework, it has 96:0% and 74:2% accuracy on UCF-101 and HMDB-51
respectively. The code for this project is available at
https://github.com/kevin-ssy/Optical-Flow-Guided-Feature.Comment: CVPR 2018. code available at
https://github.com/kevin-ssy/Optical-Flow-Guided-Featur
Video Classification With CNNs: Using The Codec As A Spatio-Temporal Activity Sensor
We investigate video classification via a two-stream convolutional neural
network (CNN) design that directly ingests information extracted from
compressed video bitstreams. Our approach begins with the observation that all
modern video codecs divide the input frames into macroblocks (MBs). We
demonstrate that selective access to MB motion vector (MV) information within
compressed video bitstreams can also provide for selective, motion-adaptive, MB
pixel decoding (a.k.a., MB texture decoding). This in turn allows for the
derivation of spatio-temporal video activity regions at extremely high speed in
comparison to conventional full-frame decoding followed by optical flow
estimation. In order to evaluate the accuracy of a video classification
framework based on such activity data, we independently train two CNN
architectures on MB texture and MV correspondences and then fuse their scores
to derive the final classification of each test video. Evaluation on two
standard datasets shows that the proposed approach is competitive to the best
two-stream video classification approaches found in the literature. At the same
time: (i) a CPU-based realization of our MV extraction is over 977 times faster
than GPU-based optical flow methods; (ii) selective decoding is up to 12 times
faster than full-frame decoding; (iii) our proposed spatial and temporal CNNs
perform inference at 5 to 49 times lower cloud computing cost than the fastest
methods from the literature.Comment: Accepted in IEEE Transactions on Circuits and Systems for Video
Technology. Extension of ICIP 2017 conference pape
- …