47,469 research outputs found

    Learning to recognise 3D human action from a new skeleton-based representation using deep convolutional neural networks

    Get PDF
    Recognising human actions in untrimmed videos is an important challenging task. An effective three-dimensional (3D) motion representation and a powerful learning model are two key factors influencing recognition performance. In this study, the authors introduce a new skeleton-based representation for 3D action recognition in videos. The key idea of the proposed representation is to transform 3D joint coordinates of the human body carried in skeleton sequences into RGB images via a colour encoding process. By normalising the 3D joint coordinates and dividing each skeleton frame into five parts, where the joints are concatenated according to the order of their physical connections, the colour-coded representation is able to represent spatio-temporal evolutions of complex 3D motions, independently of the length of each sequence. They then design and train different deep convolutional neural networks based on the residual network architecture on the obtained image-based representations to learn 3D motion features and classify them into classes. Their proposed method is evaluated on two widely used action recognition benchmarks: MSR Action3D and NTU-RGB+D, a very large-scale dataset for 3D human action recognition. The experimental results demonstrate that the proposed method outperforms previous state-of-the-art approaches while requiring less computation for training and prediction

    From Line Drawings to Human Actions: Deep Neural Networks for Visual Data Representation

    No full text
    In recent years, deep neural networks have been very successful in computer vision, speech recognition, and artificial intelligent systems. The rapid growth of data and fast increasing computational tools provide solid foundations for the applications which rely on the learning of large scale deep neural networks with millions of parameters. The deep learning approaches have been proved to be able to learn powerful representations of the inputs in various tasks, such as image classification, object recognition, and scene understanding. This thesis demonstrates the generality and capacity of deep learning approaches through a series of case studies including image matching and human activity understanding. In these studies, I explore the combinations of the neural network models with existing machine learning techniques and extend the deep learning approach for each task. Four related tasks are investigated: 1) image matching through similarity learning; 2) human action prediction; 3) finger force estimation in manipulation actions; and 4) bimodal learning for human action understanding. Deep neural networks have been shown to be very efficient in supervised learning. Further, in some tasks, one would like to group the features of the samples in the same category close to each other, in additional to the discriminative representation. Such kind of properties is desired in a number of applications, such as semantic retrieval, image quality measurement, and social network analysis, etc. My first study is to develop a similarity learning method based on deep neural networks for image matching between sketch images and 3D models. In this task, I propose to use Siamese network to learn similarities of sketches and develop a novel method for sketch based 3D shape retrieval. The proposed method can successfully learn the representations of sketch images as well as the similarities, then the 3D shape retrieval problem can be solved with off-the-shelf nearest neighbor methods. After studying the representation learning methods for static inputs, my focus turns to learning the representations of sequential data. To be specific, I focus on manipulation actions, because they are widely used in the daily life and play important parts in the human-robot collaboration system. Deep neural networks have been shown to be powerful to represent short video clips [Donahue et al., 2015]. However, most existing methods consider the action recognition problem as a classification task. These methods assume the inputs are pre-segmented videos and the outputs are category labels. In the scenarios such as the human-robot collaboration system, the ability to predict the ongoing human actions at an early stage is highly important. I first attempt to address this issue with a fast manipulation action prediction method. Then I build the action prediction model based on Long Short-Term Memory (LSTM) architecture. The proposed approach processes the sequential inputs as continuous signals and keeps updating the prediction of the intended action based on the learned action representations. Further, I study the relationships between visual inputs and the physical information, such as finger forces, that involved in the manipulation actions. This is motivated by recent studies in cognitive science which show that the subject’s intention is strongly related to the hand movements during an action execution. Human observers can interpret other’s actions in terms of movements and forces, which can be used to repeat the observed actions. If a robot system has the ability to estimate the force feedbacks, it can learn how to manipulate an object by watching human demonstrations. In this work, the finger forces are estimated by only watching the movement of hands. A modified LSTM model is used to regress the finger forces from video frames. To facilitate this study, a specially designed sensor glove has been used to collect data of finger forces, and a new dataset has been collected to provide synchronized streams of videos and finger forces. Last, I investigate the usefulness of physical information in human action recognition, which is an application of bimodal learning, where both the vision inputs and the additional information are used to learn the action representation. My study demonstrates that, by combining additional information with the vision inputs, the accuracy of human action recognition can be improved steadily. I extend the LSTM architecture to accept both video frames and sensor data as bimodal inputs to predict the action. A hallucination network is jointly trained to approximate the representations of the additional inputs. During the testing stage, the hallucination network generates approximated representations that used for classification. In this way, the proposed method does not rely on the additional inputs for testing

    Going Deeper into Action Recognition: A Survey

    Full text link
    Understanding human actions in visual data is tied to advances in complementary research areas including object recognition, human dynamics, domain adaptation and semantic segmentation. Over the last decade, human action analysis evolved from earlier schemes that are often limited to controlled environments to nowadays advanced solutions that can learn from millions of videos and apply to almost all daily activities. Given the broad range of applications from video surveillance to human-computer interaction, scientific milestones in action recognition are achieved more rapidly, eventually leading to the demise of what used to be good in a short time. This motivated us to provide a comprehensive review of the notable steps taken towards recognizing human actions. To this end, we start our discussion with the pioneering methods that use handcrafted representations, and then, navigate into the realm of deep learning based approaches. We aim to remain objective throughout this survey, touching upon encouraging improvements as well as inevitable fallbacks, in the hope of raising fresh questions and motivating new research directions for the reader

    Deep Learning on Lie Groups for Skeleton-based Action Recognition

    Full text link
    In recent years, skeleton-based action recognition has become a popular 3D classification problem. State-of-the-art methods typically first represent each motion sequence as a high-dimensional trajectory on a Lie group with an additional dynamic time warping, and then shallowly learn favorable Lie group features. In this paper we incorporate the Lie group structure into a deep network architecture to learn more appropriate Lie group features for 3D action recognition. Within the network structure, we design rotation mapping layers to transform the input Lie group features into desirable ones, which are aligned better in the temporal domain. To reduce the high feature dimensionality, the architecture is equipped with rotation pooling layers for the elements on the Lie group. Furthermore, we propose a logarithm mapping layer to map the resulting manifold data into a tangent space that facilitates the application of regular output layers for the final classification. Evaluations of the proposed network for standard 3D human action recognition datasets clearly demonstrate its superiority over existing shallow Lie group feature learning methods as well as most conventional deep learning methods.Comment: Accepted to CVPR 201

    An original framework for understanding human actions and body language by using deep neural networks

    Get PDF
    The evolution of both fields of Computer Vision (CV) and Artificial Neural Networks (ANNs) has allowed the development of efficient automatic systems for the analysis of people's behaviour. By studying hand movements it is possible to recognize gestures, often used by people to communicate information in a non-verbal way. These gestures can also be used to control or interact with devices without physically touching them. In particular, sign language and semaphoric hand gestures are the two foremost areas of interest due to their importance in Human-Human Communication (HHC) and Human-Computer Interaction (HCI), respectively. While the processing of body movements play a key role in the action recognition and affective computing fields. The former is essential to understand how people act in an environment, while the latter tries to interpret people's emotions based on their poses and movements; both are essential tasks in many computer vision applications, including event recognition, and video surveillance. In this Ph.D. thesis, an original framework for understanding Actions and body language is presented. The framework is composed of three main modules: in the first one, a Long Short Term Memory Recurrent Neural Networks (LSTM-RNNs) based method for the Recognition of Sign Language and Semaphoric Hand Gestures is proposed; the second module presents a solution based on 2D skeleton and two-branch stacked LSTM-RNNs for action recognition in video sequences; finally, in the last module, a solution for basic non-acted emotion recognition by using 3D skeleton and Deep Neural Networks (DNNs) is provided. The performances of RNN-LSTMs are explored in depth, due to their ability to model the long term contextual information of temporal sequences, making them suitable for analysing body movements. All the modules were tested by using challenging datasets, well known in the state of the art, showing remarkable results compared to the current literature methods

    Unsupervised Learning of Long-Term Motion Dynamics for Videos

    Get PDF
    We present an unsupervised representation learning approach that compactly encodes the motion dependencies in videos. Given a pair of images from a video clip, our framework learns to predict the long-term 3D motions. To reduce the complexity of the learning framework, we propose to describe the motion as a sequence of atomic 3D flows computed with RGB-D modality. We use a Recurrent Neural Network based Encoder-Decoder framework to predict these sequences of flows. We argue that in order for the decoder to reconstruct these sequences, the encoder must learn a robust video representation that captures long-term motion dependencies and spatial-temporal relations. We demonstrate the effectiveness of our learned temporal representations on activity classification across multiple modalities and datasets such as NTU RGB+D and MSR Daily Activity 3D. Our framework is generic to any input modality, i.e., RGB, Depth, and RGB-D videos.Comment: CVPR 201

    Appearance-and-Relation Networks for Video Classification

    Full text link
    Spatiotemporal feature learning in videos is a fundamental problem in computer vision. This paper presents a new architecture, termed as Appearance-and-Relation Network (ARTNet), to learn video representation in an end-to-end manner. ARTNets are constructed by stacking multiple generic building blocks, called as SMART, whose goal is to simultaneously model appearance and relation from RGB input in a separate and explicit manner. Specifically, SMART blocks decouple the spatiotemporal learning module into an appearance branch for spatial modeling and a relation branch for temporal modeling. The appearance branch is implemented based on the linear combination of pixels or filter responses in each frame, while the relation branch is designed based on the multiplicative interactions between pixels or filter responses across multiple frames. We perform experiments on three action recognition benchmarks: Kinetics, UCF101, and HMDB51, demonstrating that SMART blocks obtain an evident improvement over 3D convolutions for spatiotemporal feature learning. Under the same training setting, ARTNets achieve superior performance on these three datasets to the existing state-of-the-art methods.Comment: CVPR18 camera-ready version. Code & models available at https://github.com/wanglimin/ARTNe
    • …
    corecore