Neural Information Processing Techniques for Skeleton-Based Action Recognition

Abstract

Human action recognition is one of the core research problems in human-centered computing and computer vision. This problem lays the technical foundations for a wide range of applications, such as human-robot interaction, virtual reality, sports analysis, and so on. Recently, skeleton-based action recognition, as a subarea of action recognition, is swiftly accumulating attention and popularity. The task is to recognize actions performed by human articulation points. Compared with other data modalities, 3D human skeleton representations have extensive unique desirable characteristics, including succinctness, robustness, racial-impartiality, and many more. Currently, research on skeleton-based action recognition primarily concentrates on designing new spatial and temporal neural network operators to more thoroughly extract action features. In this thesis, on the other hand, we aim to propose methods that can be compatibly equipped with existing approaches. That is, we desire to further collaboratively strengthen current algorithms rather than forming competition with them. To this end, we propose five techniques and one large-scale human skeleton dataset. First, we present fusing higher-order spatial features in the form of angular encoding into modern architectures to robustly capture the relationships between joints and body parts. Many skeleton-based action recognizers are confused by actions that have similar motion trajectories. The proposed angular features robustly capture the relationships between joints and body parts, achieving new state-of-the-art accuracy in two large benchmarks, including NTU60 and NTU120, while employing fewer parameters and reduced run time. Second, we design two temporal accessories that facilitate existing skeleton-based action recognizers to more richly capture motion patterns. Specifically, the proposed two modules support alleviating the adverse influence of signal noise as well as guide networks to explicitly capture the sequence's chronological order. The two accessories facilitate a simple skeleton-based action recognizer to achieve new state-of-the-art (SOTA) accuracy on two large benchmark datasets. Third, we devise a new form of graph neural network as a potential new network backbone for extracting topological information of skeletonized human sequences. The proposed graph neural network is capable of learning relative positions between the nodes within a graph, substantially improving performance on various synthetic and real-world graph datasets while enjoying stable scalability. Fourth, we propose an information-theoretic technique to address imbalanced datasets, \ie, the categorical distribution of class labels is non-uniform. The proposed method improves classification accuracy when the training dataset is imbalanced. Our result provides an alternative view: neural network classifiers are mutual information estimators. Fifth, we present a neural crowdsourcing method to correct human errors. When annotating skeleton-based actions, human annotators may not reach a unanimous action category due to ambiguities of skeleton motion trajectories from different actions. The proposed method can help unify different annotated results into a single label. Sixth, we collect a large-scale human skeleton dataset for benchmarking existing methods and defining new problems for achieving the commercialization of skeleton-based action recognition. Using ANUBIS, we evaluate the performance of current skeleton-based action recognizers. At the end of this thesis, we conclude our proposed methods and propose four technique problems that may need to be solved first in order to commercialize skeleton-based action recognition in reality

    Similar works