6,033 research outputs found
VideoCapsuleNet: A Simplified Network for Action Detection
The recent advances in Deep Convolutional Neural Networks (DCNNs) have shown
extremely good results for video human action classification, however, action
detection is still a challenging problem. The current action detection
approaches follow a complex pipeline which involves multiple tasks such as tube
proposals, optical flow, and tube classification. In this work, we present a
more elegant solution for action detection based on the recently developed
capsule network. We propose a 3D capsule network for videos, called
VideoCapsuleNet: a unified network for action detection which can jointly
perform pixel-wise action segmentation along with action classification. The
proposed network is a generalization of capsule network from 2D to 3D, which
takes a sequence of video frames as input. The 3D generalization drastically
increases the number of capsules in the network, making capsule routing
computationally expensive. We introduce capsule-pooling in the convolutional
capsule layer to address this issue which makes the voting algorithm tractable.
The routing-by-agreement in the network inherently models the action
representations and various action characteristics are captured by the
predicted capsules. This inspired us to utilize the capsules for action
localization and the class-specific capsules predicted by the network are used
to determine a pixel-wise localization of actions. The localization is further
improved by parameterized skip connections with the convolutional capsule
layers and the network is trained end-to-end with a classification as well as
localization loss. The proposed network achieves sate-of-the-art performance on
multiple action detection datasets including UCF-Sports, J-HMDB, and UCF-101
(24 classes) with an impressive ~20% improvement on UCF-101 and ~15%
improvement on J-HMDB in terms of v-mAP scores
Action Recognition, Temporal Localization and Detection in Trimmed and Untrimmed Video
Automatic understanding of videos is one of the most active areas of computer vision research. It has applications in video surveillance, human computer interaction, video sports analysis, virtual and augmented reality, video retrieval etc. In this dissertation, we address four important tasks in video understanding, namely action recognition, temporal action localization, spatial-temporal action detection and video object/action segmentation. This dissertation makes contributions to above tasks by proposing. First, for video action recognition, we propose a category level feature learning method. Our proposed method automatically identifies such pairs of categories using a criterion of mutual pairwise proximity in the (kernelized) feature space, and a category-level similarity matrix where each entry corresponds to the one-vs-one SVM margin for pairs of categories. Second, for temporal action localization, we propose to exploit the temporal structure of actions by modeling an action as a sequence of sub-actions and present a computationally efficient approach. Third, we propose 3D Tube Convolutional Neural Network (TCNN) based pipeline for action detection. The proposed architecture is a unified deep network that is able to recognize and localize action based on 3D convolution features. It generalizes the popular faster R-CNN framework from images to videos. Last, an end-to-end encoder-decoder based 3D convolutional neural network pipeline is proposed, which is able to segment out the foreground objects from the background. Moreover, the action label can be obtained as well by passing the foreground object into an action classifier. Extensive experiments on several video datasets demonstrate the superior performance of the proposed approach for video understanding compared to the state-of-the-art
CDC: Convolutional-De-Convolutional Networks for Precise Temporal Action Localization in Untrimmed Videos
Temporal action localization is an important yet challenging problem. Given a
long, untrimmed video consisting of multiple action instances and complex
background contents, we need not only to recognize their action categories, but
also to localize the start time and end time of each instance. Many
state-of-the-art systems use segment-level classifiers to select and rank
proposal segments of pre-determined boundaries. However, a desirable model
should move beyond segment-level and make dense predictions at a fine
granularity in time to determine precise temporal boundaries. To this end, we
design a novel Convolutional-De-Convolutional (CDC) network that places CDC
filters on top of 3D ConvNets, which have been shown to be effective for
abstracting action semantics but reduce the temporal length of the input data.
The proposed CDC filter performs the required temporal upsampling and spatial
downsampling operations simultaneously to predict actions at the frame-level
granularity. It is unique in jointly modeling action semantics in space-time
and fine-grained temporal dynamics. We train the CDC network in an end-to-end
manner efficiently. Our model not only achieves superior performance in
detecting actions in every frame, but also significantly boosts the precision
of localizing temporal boundaries. Finally, the CDC network demonstrates a very
high efficiency with the ability to process 500 frames per second on a single
GPU server. We will update the camera-ready version and publish the source
codes online soon.Comment: IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
201
- …