2 research outputs found
Two-Stream 3D Convolutional Neural Network for Skeleton-Based Action Recognition
It remains a challenge to efficiently extract spatialtemporal information
from skeleton sequences for 3D human action recognition. Although most recent
action recognition methods are based on Recurrent Neural Networks which present
outstanding performance, one of the shortcomings of these methods is the
tendency to overemphasize the temporal information. Since 3D convolutional
neural network(3D CNN) is a powerful tool to simultaneously learn features from
both spatial and temporal dimensions through capturing the correlations between
three dimensional signals, this paper proposes a novel two-stream model using
3D CNN. To our best knowledge, this is the first application of 3D CNN in
skeleton-based action recognition. Our method consists of three stages. First,
skeleton joints are mapped into a 3D coordinate space and then encoding the
spatial and temporal information, respectively. Second, 3D CNN models are
seperately adopted to extract deep features from two streams. Third, to enhance
the ability of deep features to capture global relationships, we extend every
stream into multitemporal version. Extensive experiments on the SmartHome
dataset and the large-scale NTU RGB-D dataset demonstrate that our method
outperforms most of RNN-based methods, which verify the complementary property
between spatial and temporal information and the robustness to noise.Comment: 5 pages, 6 figures, 3 tabel
Dynamic Kernel Distillation for Efficient Pose Estimation in Videos
Existing video-based human pose estimation methods extensively apply large
networks onto every frame in the video to localize body joints, which suffer
high computational cost and hardly meet the low-latency requirement in
realistic applications. To address this issue, we propose a novel Dynamic
Kernel Distillation (DKD) model to facilitate small networks for estimating
human poses in videos, thus significantly lifting the efficiency. In
particular, DKD introduces a light-weight distillator to online distill pose
kernels via leveraging temporal cues from the previous frame in a one-shot
feed-forward manner. Then, DKD simplifies body joint localization into a
matching procedure between the pose kernels and the current frame, which can be
efficiently computed via simple convolution. In this way, DKD fast transfers
pose knowledge from one frame to provide compact guidance for body joint
localization in the following frame, which enables utilization of small
networks in video-based pose estimation. To facilitate the training process,
DKD exploits a temporally adversarial training strategy that introduces a
temporal discriminator to help generate temporally coherent pose kernels and
pose estimation results within a long range. Experiments on Penn Action and
Sub-JHMDB benchmarks demonstrate outperforming efficiency of DKD, specifically,
10x flops reduction and 2x speedup over previous best model, and its
state-of-the-art accuracy.Comment: To appear in ICCV 201