93 research outputs found
Memory-Augmented Temporal Dynamic Learning for Action Recognition
Human actions captured in video sequences contain two crucial factors for
action recognition, i.e., visual appearance and motion dynamics. To model these
two aspects, Convolutional and Recurrent Neural Networks (CNNs and RNNs) are
adopted in most existing successful methods for recognizing actions. However,
CNN based methods are limited in modeling long-term motion dynamics. RNNs are
able to learn temporal motion dynamics but lack effective ways to tackle
unsteady dynamics in long-duration motion. In this work, we propose a
memory-augmented temporal dynamic learning network, which learns to write the
most evident information into an external memory module and ignore irrelevant
ones. In particular, we present a differential memory controller to make a
discrete decision on whether the external memory module should be updated with
current feature. The discrete memory controller takes in the memory history,
context embedding and current feature as inputs and controls information flow
into the external memory module. Additionally, we train this discrete memory
controller using straight-through estimator. We evaluate this end-to-end system
on benchmark datasets (UCF101 and HMDB51) of human action recognition. The
experimental results show consistent improvements on both datasets over prior
works and our baselines.Comment: The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19
Cross-Modal Message Passing for Two-stream Fusion
Processing and fusing information among multi-modal is a very useful
technique for achieving high performance in many computer vision problems. In
order to tackle multi-modal information more effectively, we introduce a novel
framework for multi-modal fusion: Cross-modal Message Passing (CMMP).
Specifically, we propose a cross-modal message passing mechanism to fuse
two-stream network for action recognition, which composes of an appearance
modal network (RGB image) and a motion modal (optical flow image) network. The
objectives of individual networks in this framework are two-fold: a standard
classification objective and a competing objective. The classification object
ensures that each modal network predicts the true action category while the
competing objective encourages each modal network to outperform the other one.
We quantitatively show that the proposed CMMP fuses the traditional two-stream
network more effectively, and outperforms all existing two-stream fusion method
on UCF-101 and HMDB-51 datasets.Comment: 2018 IEEE International Conference on Acoustics, Speech and Signal
Processin
- …