Search CORE

90,698 research outputs found

Learning Scene Flow With Skeleton Guidance For 3D Action Recognition

Author: Magoulianitis Vasileios
Psaltis Athanasios
Publication venue
Publication date: 23/06/2023
Field of study

Among the existing modalities for 3D action recognition, 3D flow has been poorly examined, although conveying rich motion information cues for human actions. Presumably, its susceptibility to noise renders it intractable, thus challenging the learning process within deep models. This work demonstrates the use of 3D flow sequence by a deep spatiotemporal model and further proposes an incremental two-level spatial attention mechanism, guided from skeleton domain, for emphasizing motion features close to the body joint areas and according to their informativeness. Towards this end, an extended deep skeleton model is also introduced to learn the most discriminant action motion dynamics, so as to estimate an informativeness score for each joint. Subsequently, a late fusion scheme is adopted between the two models for learning the high level cross-modal correlations. Experimental results on the currently largest and most challenging dataset NTU RGB+D, demonstrate the effectiveness of the proposed approach, achieving state-of-the-art results.Comment: 18 pages, 3 figures, 3 tables, conferenc

arXiv.org e-Print Archive

Asymmetric 3D Convolutional Neural Networks for Action Recognition

Author: Du Y.
Hu W.
Li B.
Maybank Stephen J.
Xing J.
Yang H.
Yuan C.
Publication venue: 'Elsevier BV'
Publication date: 24/07/2018
Field of study

Convolutional Neural Network based action recognition methods have achieved significant improvements in recent years. The 3D convolution extends the 2D convolution from operating on one single frame to a video clip, so it is able to extract effective spatial-temporal features for better analysis of human activities in videos. The 3D convolution, however, involves many more parameters than 2D convolution. Thus, it is very expensive on computation, costly on storage, and difficult to learn. In this work, we propose efficient asymmetric one-directional 3D convolutions to approximate the traditional 3D convolution. To improve the feature learning capacity of asymmetric 3D convolutions, we design a set of local 3D convolutional networks, i.e. MicroNets, to incorporate multi-scale 3D convolution branches. Then, we design an asymmetric 3D-CNN deep model which is constructed by MicroNets for the action recognition task. Moreover, to avoid training two networks on RGB and optical flow fields separately as most works do, we propose a simple but effective multi-source enhanced input, which fuses the useful information of the RGB frame and the optical flow field at the pre-processing stage. We evaluate our asymmetric 3D-CNN models on two of the most challenging action recognition benchmarks, UCF-101 and HMDB-51. Our model outperforms all the traditional 3D-CNN models in both effectiveness and efficiency, and is comparable with the recent state-of-the-art action recognition methods on both benchmarks

Birkbeck Institutional Research Online

Improved two-stream model for human action recognition

Author: Guan SU
Man KL
Siddique K
Smith J
Zhao Y
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 17/06/2020
Field of study

This paper addresses the recognitions of human actions in videos. Human action recognition can be seen as the automatic labeling of a video according to the actions occurring in it. It has become one of the most challenging and attractive problems in the pattern recognition and video classification fields. The problem itself is difficult to solve by traditional video processing methods because of several challenges such as the background noise, sizes of subjects in different videos, and the speed of actions. Derived from the progress of deep learning methods, several directions are developed to recognize a human action from a video, such as the long-short-term memory (LSTM)-based model, two-stream convolutional neural network (CNN) model, and the convolutional 3D model.In this paper, we focus on the two-stream structure. The traditional two-stream CNN network solves the problem that CNNs do not have satisfactory performance on temporal features. By training a temporal stream, which uses the optical flow as the input, a CNN can have the ability to extract temporal features. However, the optical flow only contains limited temporal information because it only records the movements of pixels on the x-axis and the y-axis. Therefore, we attempt to design and implement a new two-stream model by using an LSTM-based model in its spatial stream to extract both spatial and temporal features in RGB frames. In addition, we implement a DenseNet in the temporal stream to improve the recognition accuracy. This is in-contrast to traditional approaches which typically utilize the spatial stream for extracting only spatial features. The quantitative evaluation and experiments are conducted on the UCF-101 dataset, which is a well-developed public video dataset. For the temporal stream, we choose the optical flow of UCF-101. Images in the optical flow are provided by the Graz University of Technology. The experimental result shows that the proposed method outperforms the traditional two-stream CNN method with an accuracy of at least 3%. For both spatial and temporal streams, the proposed model also achieves higher recognition accuracies. In addition, compared with the state of the art methods, the new model can still have the best recognition performance

University of Liverpool Repository

Going Deeper into Action Recognition: A Survey

Author: Harandi Mehrtash
Herath Samitha
Porikli Fatih
Publication venue
Publication date: 01/01/2017
Field of study

Understanding human actions in visual data is tied to advances in complementary research areas including object recognition, human dynamics, domain adaptation and semantic segmentation. Over the last decade, human action analysis evolved from earlier schemes that are often limited to controlled environments to nowadays advanced solutions that can learn from millions of videos and apply to almost all daily activities. Given the broad range of applications from video surveillance to human-computer interaction, scientific milestones in action recognition are achieved more rapidly, eventually leading to the demise of what used to be good in a short time. This motivated us to provide a comprehensive review of the notable steps taken towards recognizing human actions. To this end, we start our discussion with the pioneering methods that use handcrafted representations, and then, navigate into the realm of deep learning based approaches. We aim to remain objective throughout this survey, touching upon encouraging improvements as well as inevitable fallbacks, in the hope of raising fresh questions and motivating new research directions for the reader

arXiv.org e-Print Archive

The Australian National University