1,263 research outputs found
VideoCapsuleNet: A Simplified Network for Action Detection
The recent advances in Deep Convolutional Neural Networks (DCNNs) have shown
extremely good results for video human action classification, however, action
detection is still a challenging problem. The current action detection
approaches follow a complex pipeline which involves multiple tasks such as tube
proposals, optical flow, and tube classification. In this work, we present a
more elegant solution for action detection based on the recently developed
capsule network. We propose a 3D capsule network for videos, called
VideoCapsuleNet: a unified network for action detection which can jointly
perform pixel-wise action segmentation along with action classification. The
proposed network is a generalization of capsule network from 2D to 3D, which
takes a sequence of video frames as input. The 3D generalization drastically
increases the number of capsules in the network, making capsule routing
computationally expensive. We introduce capsule-pooling in the convolutional
capsule layer to address this issue which makes the voting algorithm tractable.
The routing-by-agreement in the network inherently models the action
representations and various action characteristics are captured by the
predicted capsules. This inspired us to utilize the capsules for action
localization and the class-specific capsules predicted by the network are used
to determine a pixel-wise localization of actions. The localization is further
improved by parameterized skip connections with the convolutional capsule
layers and the network is trained end-to-end with a classification as well as
localization loss. The proposed network achieves sate-of-the-art performance on
multiple action detection datasets including UCF-Sports, J-HMDB, and UCF-101
(24 classes) with an impressive ~20% improvement on UCF-101 and ~15%
improvement on J-HMDB in terms of v-mAP scores
- …