59 research outputs found
Act-VIT: A Representationally Robust Attention Architecture for Skeleton Based Action Recognition Using Vision Transformer
Skeleton-based action recognition receives the attention of many researchers
as it is robust to viewpoint and illumination changes, and its processing is
much more efficient than video frames. With the emergence of deep learning
models, it has become very popular to represent the skeleton data in
pseudo-image form and apply Convolutional Neural Networks for action
recognition. Thereafter, studies concentrated on finding effective methods for
forming pseudo-images. Recently, attention networks, more specifically
transformers have provided promising results in various vision problems. In
this study, the effectiveness of vision transformers for skeleton-based action
recognition is examined and its robustness on the pseudo-image representation
scheme is investigated. To this end, a three-level architecture, Act-VIT is
proposed, which forms a set of pseudo images apply a classifier on each of the
representation and combine their results to find the final action class. The
classifiers of Act-VIT are first realized by CNNs and then by VITs and their
performances are compared. Experimental studies reveal that the vision
transformer is less sensitive to the initial pseudo-image representation
compared to CNN. Nevertheless, even with the vision transformer, the
recognition performance can be further improved by consensus of classifiers
Recognition and 3D Localization of Pedestrian Actions from Monocular Video
Understanding and predicting pedestrian behavior is an important and
challenging area of research for realizing safe and effective navigation
strategies in automated and advanced driver assistance technologies in urban
scenes. This paper focuses on monocular pedestrian action recognition and 3D
localization from an egocentric view for the purpose of predicting intention
and forecasting future trajectory. A challenge in addressing this problem in
urban traffic scenes is attributed to the unpredictable behavior of
pedestrians, whereby actions and intentions are constantly in flux and depend
on the pedestrians pose, their 3D spatial relations, and their interaction with
other agents as well as with the environment. To partially address these
challenges, we consider the importance of pose toward recognition and 3D
localization of pedestrian actions. In particular, we propose an action
recognition framework using a two-stream temporal relation network with inputs
corresponding to the raw RGB image sequence of the tracked pedestrian as well
as the pedestrian pose. The proposed method outperforms methods using a
single-stream temporal relation network based on evaluations using the JAAD
public dataset. The estimated pose and associated body key-points are also used
as input to a network that estimates the 3D location of the pedestrian using a
unique loss function. The evaluation of our 3D localization method on the KITTI
dataset indicates the improvement of the average localization error as compared
to existing state-of-the-art methods. Finally, we conduct qualitative tests of
action recognition and 3D localization on HRI's H3D driving dataset
DeepHuMS: Deep Human Motion Signature for 3D Skeletal Sequences
3D Human Motion Indexing and Retrieval is an interesting problem due to the
rise of several data-driven applications aimed at analyzing and/or re-utilizing
3D human skeletal data, such as data-driven animation, analysis of sports
bio-mechanics, human surveillance etc. Spatio-temporal articulations of humans,
noisy/missing data, different speeds of the same motion etc. make it
challenging and several of the existing state of the art methods use hand-craft
features along with optimization based or histogram based comparison in order
to perform retrieval. Further, they demonstrate it only for very small datasets
and few classes. We make a case for using a learned representation that should
recognize the motion as well as enforce a discriminative ranking. To that end,
we propose, a 3D human motion descriptor learned using a deep network. Our
learned embedding is generalizable and applicable to real-world data -
addressing the aforementioned challenges and further enables sub-motion
searching in its embedding space using another network. Our model exploits the
inter-class similarity using trajectory cues, and performs far superior in a
self-supervised setting. State of the art results on all these fronts is shown
on two large scale 3D human motion datasets - NTU RGB+D and HDM05.Comment: Under Review, Conferenc
- …