135,220 research outputs found
Self-Supervised 3D Action Representation Learning with Skeleton Cloud Colorization
3D Skeleton-based human action recognition has attracted increasing attention
in recent years. Most of the existing work focuses on supervised learning which
requires a large number of labeled action sequences that are often expensive
and time-consuming to annotate. In this paper, we address self-supervised 3D
action representation learning for skeleton-based action recognition. We
investigate self-supervised representation learning and design a novel skeleton
cloud colorization technique that is capable of learning spatial and temporal
skeleton representations from unlabeled skeleton sequence data. We represent a
skeleton action sequence as a 3D skeleton cloud and colorize each point in the
cloud according to its temporal and spatial orders in the original
(unannotated) skeleton sequence. Leveraging the colorized skeleton point cloud,
we design an auto-encoder framework that can learn spatial-temporal features
from the artificial color labels of skeleton joints effectively. Specifically,
we design a two-steam pretraining network that leverages fine-grained and
coarse-grained colorization to learn multi-scale spatial-temporal features. In
addition, we design a Masked Skeleton Cloud Repainting task that can pretrain
the designed auto-encoder framework to learn informative representations. We
evaluate our skeleton cloud colorization approach with linear classifiers
trained under different configurations, including unsupervised,
semi-supervised, fully-supervised, and transfer learning settings. Extensive
experiments on NTU RGB+D, NTU RGB+D 120, PKU-MMD, NW-UCLA, and UWA3D datasets
show that the proposed method outperforms existing unsupervised and
semi-supervised 3D action recognition methods by large margins and achieves
competitive performance in supervised 3D action recognition as well.Comment: This work is an extension of our ICCV 2021 paper [arXiv:2108.01959]
https://openaccess.thecvf.com/content/ICCV2021/html/Yang_Skeleton_Cloud_Colorization_for_Unsupervised_3D_Action_Representation_Learning_ICCV_2021_paper.htm
Interaction-aware spatio-temporal pyramid attention networks for action classification
For CNN-based visual action recognition, the accuracy may be increased if local key action regions are focused on. The task of self-attention is to focus on key features and ignore irrelevant information. So, self-attention is useful for action recognition. However, the current self-attention methods usually ignore correlations among local feature vectors at spatial positions in feature maps in CNNs. In this paper, we propose an effective interaction-aware self-attention model which can extract information about the interactions between feature vectors to learn attention maps. Since the different layers in a network capture feature maps at different scales, we introduce a spatial pyramid with the feature maps at different layers to attention modeling. The multi-scale information is utilized to obtain more accurate attention scores. These attention scores are used to weight the local feature vectors and the feature maps and then calculate the attention feature maps. Since the number of feature maps input to the spatial pyramid attention layer is unrestricted, we easily extend this attention layer to a spatial-temporal version. Our model can be embedded into any general CNN to form a video-level end-to-end attention network for action recognition. Besides using the RGB stream alone, several methods are investigated to combine the RGB and flow streams for the final prediction of the classes of human actions. Experimental results show that our method achieves state-of-the-art results on the datasets UCF101, HMDB51, Kinetics-400 and untrimmed Charades
Multi-scale 3D Convolution Network for Video Based Person Re-Identification
This paper proposes a two-stream convolution network to extract spatial and
temporal cues for video based person Re-Identification (ReID). A temporal
stream in this network is constructed by inserting several Multi-scale 3D (M3D)
convolution layers into a 2D CNN network. The resulting M3D convolution network
introduces a fraction of parameters into the 2D CNN, but gains the ability of
multi-scale temporal feature learning. With this compact architecture, M3D
convolution network is also more efficient and easier to optimize than existing
3D convolution networks. The temporal stream further involves Residual
Attention Layers (RAL) to refine the temporal features. By jointly learning
spatial-temporal attention masks in a residual manner, RAL identifies the
discriminative spatial regions and temporal cues. The other stream in our
network is implemented with a 2D CNN for spatial feature extraction. The
spatial and temporal features from two streams are finally fused for the video
based person ReID. Evaluations on three widely used benchmarks datasets, i.e.,
MARS, PRID2011, and iLIDS-VID demonstrate the substantial advantages of our
method over existing 3D convolution networks and state-of-art methods.Comment: AAAI, 201
- …