43 research outputs found
Selective Spatio-Temporal Aggregation Based Pose Refinement System: Towards Understanding Human Activities in Real-World Videos
Taking advantage of human pose data for understanding human activities has
attracted much attention these days. However, state-of-the-art pose estimators
struggle in obtaining high-quality 2D or 3D pose data due to occlusion,
truncation and low-resolution in real-world un-annotated videos. Hence, in this
work, we propose 1) a Selective Spatio-Temporal Aggregation mechanism, named
SST-A, that refines and smooths the keypoint locations extracted by multiple
expert pose estimators, 2) an effective weakly-supervised self-training
framework which leverages the aggregated poses as pseudo ground-truth instead
of handcrafted annotations for real-world pose estimation. Extensive
experiments are conducted for evaluating not only the upstream pose refinement
but also the downstream action recognition performance on four datasets, Toyota
Smarthome, NTU-RGB+D, Charades, and Kinetics-50. We demonstrate that the
skeleton data refined by our Pose-Refinement system (SSTA-PRS) is effective at
boosting various existing action recognition models, which achieves competitive
or state-of-the-art performance.Comment: WACV202
Learning Graph Convolutional Network for Skeleton-based Human Action Recognition by Neural Searching
Human action recognition from skeleton data, fueled by the Graph
Convolutional Network (GCN), has attracted lots of attention, due to its
powerful capability of modeling non-Euclidean structure data. However, many
existing GCN methods provide a pre-defined graph and fix it through the entire
network, which can loss implicit joint correlations. Besides, the mainstream
spectral GCN is approximated by one-order hop, thus higher-order connections
are not well involved. Therefore, huge efforts are required to explore a better
GCN architecture. To address these problems, we turn to Neural Architecture
Search (NAS) and propose the first automatically designed GCN for
skeleton-based action recognition. Specifically, we enrich the search space by
providing multiple dynamic graph modules after fully exploring the
spatial-temporal correlations between nodes. Besides, we introduce multiple-hop
modules and expect to break the limitation of representational capacity caused
by one-order approximation. Moreover, a sampling- and memory-efficient
evolution strategy is proposed to search an optimal architecture for this task.
The resulted architecture proves the effectiveness of the higher-order
approximation and the dynamic graph modeling mechanism with temporal
interactions, which is barely discussed before. To evaluate the performance of
the searched model, we conduct extensive experiments on two very large scaled
datasets and the results show that our model gets the state-of-the-art results.Comment: Accepted by AAAI202