1,637 research outputs found

    An original framework for understanding human actions and body language by using deep neural networks

    Get PDF
    The evolution of both fields of Computer Vision (CV) and Artificial Neural Networks (ANNs) has allowed the development of efficient automatic systems for the analysis of people's behaviour. By studying hand movements it is possible to recognize gestures, often used by people to communicate information in a non-verbal way. These gestures can also be used to control or interact with devices without physically touching them. In particular, sign language and semaphoric hand gestures are the two foremost areas of interest due to their importance in Human-Human Communication (HHC) and Human-Computer Interaction (HCI), respectively. While the processing of body movements play a key role in the action recognition and affective computing fields. The former is essential to understand how people act in an environment, while the latter tries to interpret people's emotions based on their poses and movements; both are essential tasks in many computer vision applications, including event recognition, and video surveillance. In this Ph.D. thesis, an original framework for understanding Actions and body language is presented. The framework is composed of three main modules: in the first one, a Long Short Term Memory Recurrent Neural Networks (LSTM-RNNs) based method for the Recognition of Sign Language and Semaphoric Hand Gestures is proposed; the second module presents a solution based on 2D skeleton and two-branch stacked LSTM-RNNs for action recognition in video sequences; finally, in the last module, a solution for basic non-acted emotion recognition by using 3D skeleton and Deep Neural Networks (DNNs) is provided. The performances of RNN-LSTMs are explored in depth, due to their ability to model the long term contextual information of temporal sequences, making them suitable for analysing body movements. All the modules were tested by using challenging datasets, well known in the state of the art, showing remarkable results compared to the current literature methods

    On human motion prediction using recurrent neural networks

    Full text link
    Human motion modelling is a classical problem at the intersection of graphics and computer vision, with applications spanning human-computer interaction, motion synthesis, and motion prediction for virtual and augmented reality. Following the success of deep learning methods in several computer vision tasks, recent work has focused on using deep recurrent neural networks (RNNs) to model human motion, with the goal of learning time-dependent representations that perform tasks such as short-term motion prediction and long-term human motion synthesis. We examine recent work, with a focus on the evaluation methodologies commonly used in the literature, and show that, surprisingly, state-of-the-art performance can be achieved by a simple baseline that does not attempt to model motion at all. We investigate this result, and analyze recent RNN methods by looking at the architectures, loss functions, and training procedures used in state-of-the-art approaches. We propose three changes to the standard RNN models typically used for human motion, which result in a simple and scalable RNN architecture that obtains state-of-the-art performance on human motion prediction.Comment: Accepted at CVPR 1

    Recognizing Human-Object Interactions in Videos

    Get PDF
    Understanding human actions that involve interacting with objects is very important due to the wide range of real-world applications, such as security surveillance and healthcare. In this thesis, three different approaches are presented for addressing the problem of human-object interactions (HOIs) recognition in videos. Firstly, we propose a hierarchical framework for analyzing human-object interactions in a video sequence. The framework comprises Long Short-Term Memory (LSTM) networks that capture human motion and temporal object information independently. These pieces of information are then combined through a bilinear layer and fed into a global deep LSTM to learn high-level information about HOIs. To concentrate on the key components of human and object temporal information, the proposed approach incorporates an attention mechanism into LSTMs. Secondly, we aim to achieve a holistic understanding of human-object interactions (HOIs) by exploiting both their local and global contexts through knowledge distillation. The local context graphs are used to learn the relationship between humans and objects at the frame level by capturing their co-occurrence at a specific time step. On the other hand, the global relation graph is constructed based on the video-level of human and object interactions, identifying their long-term relations throughout a video sequence. We investigate how knowledge from these context graphs can be distilled to their counterparts to improve HOI recognition. Lastly, we propose the Spatio-Temporal Interaction Transformer-based (STIT) network to reason about spatio-temporal changes of humans and objects. Specifically, the spatial transformers learn the local context of humans and objects at specific frame times. The temporal transformer then learns the relations at a higher level between spatial context representations at different time steps, capturing long-term dependencies across frames. We further investigate multiple hierarchy designs for learning human interactions. The effectiveness of each of the proposed methods mentioned above is evaluated using various video action datasets that include human-object interactions, such as Charades, CAD-120, and Something-Something V1
    • …
    corecore