630 research outputs found
Temporal Bilinear Networks for Video Action Recognition
Temporal modeling in videos is a fundamental yet challenging problem in
computer vision. In this paper, we propose a novel Temporal Bilinear (TB) model
to capture the temporal pairwise feature interactions between adjacent frames.
Compared with some existing temporal methods which are limited in linear
transformations, our TB model considers explicit quadratic bilinear
transformations in the temporal domain for motion evolution and sequential
relation modeling. We further leverage the factorized bilinear model in linear
complexity and a bottleneck network design to build our TB blocks, which also
constrains the parameters and computation cost. We consider two schemes in
terms of the incorporation of TB blocks and the original 2D spatial
convolutions, namely wide and deep Temporal Bilinear Networks (TBN). Finally,
we perform experiments on several widely adopted datasets including Kinetics,
UCF101 and HMDB51. The effectiveness of our TBNs is validated by comprehensive
ablation analyses and comparisons with various state-of-the-art methods.Comment: Accepted by AAAI 201
A Closer Look at Spatiotemporal Convolutions for Action Recognition
In this paper we discuss several forms of spatiotemporal convolutions for
video analysis and study their effects on action recognition. Our motivation
stems from the observation that 2D CNNs applied to individual frames of the
video have remained solid performers in action recognition. In this work we
empirically demonstrate the accuracy advantages of 3D CNNs over 2D CNNs within
the framework of residual learning. Furthermore, we show that factorizing the
3D convolutional filters into separate spatial and temporal components yields
significantly advantages in accuracy. Our empirical study leads to the design
of a new spatiotemporal convolutional block "R(2+1)D" which gives rise to CNNs
that achieve results comparable or superior to the state-of-the-art on
Sports-1M, Kinetics, UCF101 and HMDB51
Cross-Modal Message Passing for Two-stream Fusion
Processing and fusing information among multi-modal is a very useful
technique for achieving high performance in many computer vision problems. In
order to tackle multi-modal information more effectively, we introduce a novel
framework for multi-modal fusion: Cross-modal Message Passing (CMMP).
Specifically, we propose a cross-modal message passing mechanism to fuse
two-stream network for action recognition, which composes of an appearance
modal network (RGB image) and a motion modal (optical flow image) network. The
objectives of individual networks in this framework are two-fold: a standard
classification objective and a competing objective. The classification object
ensures that each modal network predicts the true action category while the
competing objective encourages each modal network to outperform the other one.
We quantitatively show that the proposed CMMP fuses the traditional two-stream
network more effectively, and outperforms all existing two-stream fusion method
on UCF-101 and HMDB-51 datasets.Comment: 2018 IEEE International Conference on Acoustics, Speech and Signal
Processin
Memory-Augmented Temporal Dynamic Learning for Action Recognition
Human actions captured in video sequences contain two crucial factors for
action recognition, i.e., visual appearance and motion dynamics. To model these
two aspects, Convolutional and Recurrent Neural Networks (CNNs and RNNs) are
adopted in most existing successful methods for recognizing actions. However,
CNN based methods are limited in modeling long-term motion dynamics. RNNs are
able to learn temporal motion dynamics but lack effective ways to tackle
unsteady dynamics in long-duration motion. In this work, we propose a
memory-augmented temporal dynamic learning network, which learns to write the
most evident information into an external memory module and ignore irrelevant
ones. In particular, we present a differential memory controller to make a
discrete decision on whether the external memory module should be updated with
current feature. The discrete memory controller takes in the memory history,
context embedding and current feature as inputs and controls information flow
into the external memory module. Additionally, we train this discrete memory
controller using straight-through estimator. We evaluate this end-to-end system
on benchmark datasets (UCF101 and HMDB51) of human action recognition. The
experimental results show consistent improvements on both datasets over prior
works and our baselines.Comment: The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19
Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification
Despite the steady progress in video analysis led by the adoption of
convolutional neural networks (CNNs), the relative improvement has been less
drastic as that in 2D static image classification. Three main challenges exist
including spatial (image) feature representation, temporal information
representation, and model/computation complexity. It was recently shown by
Carreira and Zisserman that 3D CNNs, inflated from 2D networks and pretrained
on ImageNet, could be a promising way for spatial and temporal representation
learning. However, as for model/computation complexity, 3D CNNs are much more
expensive than 2D CNNs and prone to overfit. We seek a balance between speed
and accuracy by building an effective and efficient video classification system
through systematic exploration of critical network design choices. In
particular, we show that it is possible to replace many of the 3D convolutions
by low-cost 2D convolutions. Rather surprisingly, best result (in both speed
and accuracy) is achieved when replacing the 3D convolutions at the bottom of
the network, suggesting that temporal representation learning on high-level
semantic features is more useful. Our conclusion generalizes to datasets with
very different properties. When combined with several other cost-effective
designs including separable spatial/temporal convolution and feature gating,
our system results in an effective video classification system that that
produces very competitive results on several action classification benchmarks
(Kinetics, Something-something, UCF101 and HMDB), as well as two action
detection (localization) benchmarks (JHMDB and UCF101-24).Comment: ECCV 2018 camera read
- …