4,158 research outputs found
Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification
Despite the steady progress in video analysis led by the adoption of
convolutional neural networks (CNNs), the relative improvement has been less
drastic as that in 2D static image classification. Three main challenges exist
including spatial (image) feature representation, temporal information
representation, and model/computation complexity. It was recently shown by
Carreira and Zisserman that 3D CNNs, inflated from 2D networks and pretrained
on ImageNet, could be a promising way for spatial and temporal representation
learning. However, as for model/computation complexity, 3D CNNs are much more
expensive than 2D CNNs and prone to overfit. We seek a balance between speed
and accuracy by building an effective and efficient video classification system
through systematic exploration of critical network design choices. In
particular, we show that it is possible to replace many of the 3D convolutions
by low-cost 2D convolutions. Rather surprisingly, best result (in both speed
and accuracy) is achieved when replacing the 3D convolutions at the bottom of
the network, suggesting that temporal representation learning on high-level
semantic features is more useful. Our conclusion generalizes to datasets with
very different properties. When combined with several other cost-effective
designs including separable spatial/temporal convolution and feature gating,
our system results in an effective video classification system that that
produces very competitive results on several action classification benchmarks
(Kinetics, Something-something, UCF101 and HMDB), as well as two action
detection (localization) benchmarks (JHMDB and UCF101-24).Comment: ECCV 2018 camera read
Collaborative Spatio-temporal Feature Learning for Video Action Recognition
Spatio-temporal feature learning is of central importance for action
recognition in videos. Existing deep neural network models either learn spatial
and temporal features independently (C2D) or jointly with unconstrained
parameters (C3D). In this paper, we propose a novel neural operation which
encodes spatio-temporal features collaboratively by imposing a weight-sharing
constraint on the learnable parameters. In particular, we perform 2D
convolution along three orthogonal views of volumetric video data,which learns
spatial appearance and temporal motion cues respectively. By sharing the
convolution kernels of different views, spatial and temporal features are
collaboratively learned and thus benefit from each other. The complementary
features are subsequently fused by a weighted summation whose coefficients are
learned end-to-end. Our approach achieves state-of-the-art performance on
large-scale benchmarks and won the 1st place in the Moments in Time Challenge
2018. Moreover, based on the learned coefficients of different views, we are
able to quantify the contributions of spatial and temporal features. This
analysis sheds light on interpretability of the model and may also guide the
future design of algorithm for video recognition.Comment: CVPR 201
- …