55 research outputs found

    Unsupervised Human Action Recognition with Skeletal Graph Laplacian and Self-Supervised Viewpoints Invariance

    Full text link
    This paper presents a novel end-to-end method for the problem of skeleton-based unsupervised human action recognition. We propose a new architecture with a convolutional autoencoder that uses graph Laplacian regularization to model the skeletal geometry across the temporal dynamics of actions. Our approach is robust towards viewpoint variations by including a self-supervised gradient reverse layer that ensures generalization across camera views. The proposed method is validated on NTU-60 and NTU-120 large-scale datasets in which it outperforms all prior unsupervised skeleton-based approaches on the cross-subject, cross-view, and cross-setup protocols. Although unsupervised, our learnable representation allows our method even to surpass a few supervised skeleton-based action recognition methods. The code is available in: www.github.com/IIT-PAVIS/UHAR_Skeletal_Laplacia

    Human Action Recognition from Various Data Modalities:A Review

    Get PDF
    Human Action Recognition (HAR), aiming to understand human behaviors and then assign category labels, has a wide range of applications, and thus has been attracting increasing attention in the field of computer vision. Generally, human actions can be represented using various data modalities, such as RGB, skeleton, depth, infrared sequence, point cloud, event stream, audio, acceleration, radar, and WiFi, etc., which encode different sources of useful yet distinct information and have various advantages and application scenarios. Consequently, lots of existing works have attempted to investigate different types of approaches for HAR using various modalities. In this paper, we give a comprehensive survey for HAR from the perspective of the input data modalities. Specifically, we review both the hand-crafted feature-based and deep learning-based methods for single data modalities, and also review the methods based on multiple modalities, including the fusion-based frameworks and the co-learning-based approaches. The current benchmark datasets for HAR are also introduced. Finally, we discuss some potentially important research directions in this area

    Bidirectional skeleton-based isolated sign recognition using graph convolution networks

    Get PDF
    To improve computer-based recognition from video of isolated signs from American Sign Language (ASL), we propose a new skeleton-based method that involves explicit detection of the start and end frames of signs, trained on the ASLLVD dataset; it uses linguistically relevant parameters based on the skeleton input. Our method employs a bidirectional learning approach within a Graph Convolutional Network (GCN) framework. We apply this method to the WLASL dataset, but with corrections to the gloss labeling to ensure consistency in the labels assigned to different signs; it is important to have a 1-1 correspondence between signs and text-based gloss labels. We achieve a success rate of 77.43% for top-1 and 94.54% for top-5 using this modified WLASL dataset. Our method, which does not require multi-modal data input, outperforms other state-of-the-art approaches on the same modified WLASL dataset, demonstrating the importance of both attention to the start and end frames of signs and the use of bidirectional data streams in the GCNs for isolated sign recognition.Published versio

    Neural Information Processing Techniques for Skeleton-Based Action Recognition

    Get PDF
    Human action recognition is one of the core research problems in human-centered computing and computer vision. This problem lays the technical foundations for a wide range of applications, such as human-robot interaction, virtual reality, sports analysis, and so on. Recently, skeleton-based action recognition, as a subarea of action recognition, is swiftly accumulating attention and popularity. The task is to recognize actions performed by human articulation points. Compared with other data modalities, 3D human skeleton representations have extensive unique desirable characteristics, including succinctness, robustness, racial-impartiality, and many more. Currently, research on skeleton-based action recognition primarily concentrates on designing new spatial and temporal neural network operators to more thoroughly extract action features. In this thesis, on the other hand, we aim to propose methods that can be compatibly equipped with existing approaches. That is, we desire to further collaboratively strengthen current algorithms rather than forming competition with them. To this end, we propose five techniques and one large-scale human skeleton dataset. First, we present fusing higher-order spatial features in the form of angular encoding into modern architectures to robustly capture the relationships between joints and body parts. Many skeleton-based action recognizers are confused by actions that have similar motion trajectories. The proposed angular features robustly capture the relationships between joints and body parts, achieving new state-of-the-art accuracy in two large benchmarks, including NTU60 and NTU120, while employing fewer parameters and reduced run time. Second, we design two temporal accessories that facilitate existing skeleton-based action recognizers to more richly capture motion patterns. Specifically, the proposed two modules support alleviating the adverse influence of signal noise as well as guide networks to explicitly capture the sequence's chronological order. The two accessories facilitate a simple skeleton-based action recognizer to achieve new state-of-the-art (SOTA) accuracy on two large benchmark datasets. Third, we devise a new form of graph neural network as a potential new network backbone for extracting topological information of skeletonized human sequences. The proposed graph neural network is capable of learning relative positions between the nodes within a graph, substantially improving performance on various synthetic and real-world graph datasets while enjoying stable scalability. Fourth, we propose an information-theoretic technique to address imbalanced datasets, \ie, the categorical distribution of class labels is non-uniform. The proposed method improves classification accuracy when the training dataset is imbalanced. Our result provides an alternative view: neural network classifiers are mutual information estimators. Fifth, we present a neural crowdsourcing method to correct human errors. When annotating skeleton-based actions, human annotators may not reach a unanimous action category due to ambiguities of skeleton motion trajectories from different actions. The proposed method can help unify different annotated results into a single label. Sixth, we collect a large-scale human skeleton dataset for benchmarking existing methods and defining new problems for achieving the commercialization of skeleton-based action recognition. Using ANUBIS, we evaluate the performance of current skeleton-based action recognizers. At the end of this thesis, we conclude our proposed methods and propose four technique problems that may need to be solved first in order to commercialize skeleton-based action recognition in reality

    Action recognition in depth videos using nonparametric probabilistic graphical models

    Get PDF
    Action recognition involves automatically labelling videos that contain human motion with action classes. It has applications in diverse areas such as smart surveillance, human computer interaction and content retrieval. The recent advent of depth sensing technology that produces depth image sequences has offered opportunities to solve the challenging action recognition problem. The depth images facilitate robust estimation of a human skeleton’s 3D joint positions and a high level action can be inferred from a sequence of these joint positions. A natural way to model a sequence of joint positions is to use a graphical model that describes probabilistic dependencies between the observed joint positions and some hidden state variables. A problem with these models is that the number of hidden states must be fixed a priori even though for many applications this number is not known in advance. This thesis proposes nonparametric variants of graphical models with the number of hidden states automatically inferred from data. The inference is performed in a full Bayesian setting by using the Dirichlet Process as a prior over the model’s infinite dimensional parameter space. This thesis describes three original constructions of nonparametric graphical models that are applied in the classification of actions in depth videos. Firstly, the action classes are represented by a Hidden Markov Model (HMM) with an unbounded number of hidden states. The formulation enables information sharing and discriminative learning of parameters. Secondly, a hierarchical HMM with an unbounded number of actions and poses is used to represent activities. The construction produces a simplified model for activity classification by using logistic regression to capture the relationship between action states and activity labels. Finally, the action classes are modelled by a Hidden Conditional Random Field (HCRF) with the number of intermediate hidden states learned from data. Tractable inference procedures based on Markov Chain Monte Carlo (MCMC) techniques are derived for all these constructions. Experiments with multiple benchmark datasets confirm the efficacy of the proposed approaches for action recognition
    • …
    corecore