30,444 research outputs found
Motion-aware Memory Network for Fast Video Salient Object Detection
Previous methods based on 3DCNN, convLSTM, or optical flow have achieved
great success in video salient object detection (VSOD). However, they still
suffer from high computational costs or poor quality of the generated saliency
maps. To solve these problems, we design a space-time memory (STM)-based
network, which extracts useful temporal information of the current frame from
adjacent frames as the temporal branch of VSOD. Furthermore, previous methods
only considered single-frame prediction without temporal association. As a
result, the model may not focus on the temporal information sufficiently. Thus,
we initially introduce object motion prediction between inter-frame into VSOD.
Our model follows standard encoder--decoder architecture. In the encoding
stage, we generate high-level temporal features by using high-level features
from the current and its adjacent frames. This approach is more efficient than
the optical flow-based methods. In the decoding stage, we propose an effective
fusion strategy for spatial and temporal branches. The semantic information of
the high-level features is used to fuse the object details in the low-level
features, and then the spatiotemporal features are obtained step by step to
reconstruct the saliency maps. Moreover, inspired by the boundary supervision
commonly used in image salient object detection (ISOD), we design a
motion-aware loss for predicting object boundary motion and simultaneously
perform multitask learning for VSOD and object motion prediction, which can
further facilitate the model to extract spatiotemporal features accurately and
maintain the object integrity. Extensive experiments on several datasets
demonstrated the effectiveness of our method and can achieve state-of-the-art
metrics on some datasets. The proposed model does not require optical flow or
other preprocessing, and can reach a speed of nearly 100 FPS during inference.Comment: 12 pages, 10 figure
Video Classification With CNNs: Using The Codec As A Spatio-Temporal Activity Sensor
We investigate video classification via a two-stream convolutional neural
network (CNN) design that directly ingests information extracted from
compressed video bitstreams. Our approach begins with the observation that all
modern video codecs divide the input frames into macroblocks (MBs). We
demonstrate that selective access to MB motion vector (MV) information within
compressed video bitstreams can also provide for selective, motion-adaptive, MB
pixel decoding (a.k.a., MB texture decoding). This in turn allows for the
derivation of spatio-temporal video activity regions at extremely high speed in
comparison to conventional full-frame decoding followed by optical flow
estimation. In order to evaluate the accuracy of a video classification
framework based on such activity data, we independently train two CNN
architectures on MB texture and MV correspondences and then fuse their scores
to derive the final classification of each test video. Evaluation on two
standard datasets shows that the proposed approach is competitive to the best
two-stream video classification approaches found in the literature. At the same
time: (i) a CPU-based realization of our MV extraction is over 977 times faster
than GPU-based optical flow methods; (ii) selective decoding is up to 12 times
faster than full-frame decoding; (iii) our proposed spatial and temporal CNNs
perform inference at 5 to 49 times lower cloud computing cost than the fastest
methods from the literature.Comment: Accepted in IEEE Transactions on Circuits and Systems for Video
Technology. Extension of ICIP 2017 conference pape
- …