13 research outputs found

    Video-based human action recognition using deep learning: a review

    Get PDF
    Human action recognition is an important application domain in computer vision. Its primary aim is to accurately describe human actions and their interactions from a previously unseen data sequence acquired by sensors. The ability to recognize, understand and predict complex human actions enables the construction of many important applications such as intelligent surveillance systems, human-computer interfaces, health care, security and military applications. In recent years, deep learning has been given particular attention by the computer vision community. This paper presents an overview of the current state-of-the-art in action recognition using video analysis with deep learning techniques. We present the most important deep learning models for recognizing human actions, analyze them to provide the current progress of deep learning algorithms applied to solve human action recognition problems in realistic videos highlighting their advantages and disadvantages. Based on the quantitative analysis using recognition accuracies reported in the literature, our study identies state-of-the-art deep architectures in action recognition and then provides current trends and open problems for future works in this led.This work was supported by the Cen-tre d'Etudes et d'Expertise sur les Risques, l'environnement la mobilité et l'aménagement (CEREMA) and the UC3M Conex-Marie Curie Program.No publicad

    Architectures d'apprentissage profond pour la reconnaissance d'actions humaines dans des séquences vidéo RGB-D monoculaires. Application à la surveillance dans les transports publics

    Get PDF
    Cette thèse porte sur la reconnaissance d'actions humaines dans des séquences vidéo RGB-D monoculaires. La question principale est, à partir d'une vidéo ou d'une séquence d'images donnée, de savoir comment reconnaître des actions particulières qui se produisent. Cette tâche est importante et est un défi majeur à cause d'un certain nombre de verrous scientifiques induits par la variabilité des conditions d'acquisition, comme l'éclairage, la position, l'orientation et le champ de vue de la caméra, ainsi que par la variabilité de la réalisation des actions, notamment de leur vitesse d'exécution. Pour surmonter certaines de ces difficultés, dans un premier temps, nous examinons et évaluons les techniques les plus récentes pour la reconnaissance d'actions dans des vidéos. Nous proposons ensuite une nouvelle approche basée sur des réseaux de neurones profonds pour la reconnaissance d'actions humaines à partir de séquences de squelettes 3D. Deux questions clés ont été traitées. Tout d'abord, comment représenter la dynamique spatio-temporelle d'une séquence de squelettes pour exploiter efficacement la capacité d'apprentissage des représentations de haut niveau des réseaux de neurones convolutifs (CNNs ou ConvNets). Ensuite, comment concevoir une architecture de CNN capable d'apprendre des caractéristiques spatio-temporelles discriminantes à partir de la représentation proposée dans un objectif de classification. Pour cela, nous introduisons deux nouvelles représentations du mouvement 3D basées sur des squelettes, appelées SPMF (Skeleton Posture-Motion Feature) et Enhanced-SPMF, qui encodent les postures et les mouvements humains extraits des séquences de squelettes sous la forme d'images couleur RGB. Pour les tâches d'apprentissage et de classification, nous proposons différentes architectures de CNNs, qui sont basées sur les modèles Residual Network (ResNet), Inception-ResNet-v2, Densely Connected Convolutional Network (DenseNet) et Efficient Neural Architecture Search (ENAS), pour extraire des caractéristiques robustes de la représentation sous forme d'image que nous proposons et pour les classer. Les résultats expérimentaux sur des bases de données publiques (MSR Action3D, Kinect Activity Recognition Dataset, SBU Kinect Interaction, et NTU-RGB+D) montrent que notre approche surpasse les méthodes de l'état de l'art. Nous proposons également une nouvelle technique pour l'estimation de postures humaines à partir d'une vidéo RGB. Pour cela, le modèle d'apprentissage profond appelé OpenPose est utilisé pour détecter les personnes et extraire leur posture en 2D. Un réseau de neurones profond est ensuite proposé pour apprendre la transformation permettant de reconstruire ces postures en trois dimensions. Les résultats expérimentaux sur la base de données Human3.6M montrent l'efficacité de la méthode proposée. Ces résultats ouvrent des perspectives pour une approche de la reconnaissance d'actions humaines à partir des séquences de squelettes 3D sans utiliser des capteurs de profondeur comme la Kinect. Nous avons également constitué la base CEMEST, une nouvelle base de données RGB-D illustrant des comportements de passagers dans les transports publics. Elle contient 203 vidéos de surveillance collectées dans une station du métro incluant des événements "normaux" et "anormaux". Nous avons obtenu des résultats prometteurs sur cette base en utilisant des techniques d'augmentation de données et de transfert d'apprentissage. Notre approche permet de concevoir des applications basées sur des techniques de l'apprentissage profond pour renforcer la qualité des services de transport en commun.This thesis is dealing with automatic recognition of human actions from monocular RGB-D video sequences. Our main goal is to recognize which human actions occur in unknown videos. This problem is a challenging task due to a number of obstacles caused by the variability of the acquisition conditions, including the lighting, the position, the orientation and the field of view of the camera, as well as the variability of actions which can be performed differently, notably in terms of speed. To tackle these problems, we first review and evaluate the most prominent state-of-the-art techniques to identify the current state of human action recognition in videos. We then propose a new approach for skeleton-based action recognition using Deep Neural Networks (DNNs). Two key questions have been addressed. First, how to efficiently represent the spatio-temporal patterns of skeletal data for fully exploiting the capacity in learning high-level representations of Deep Convolutional Neural Networks (D-CNNs). Second, how to design a powerful D-CNN architecture that is able to learn discriminative features from the proposed representation for classification task. As a result, we introduce two new 3D motion representations called SPMF (Skeleton Posture-Motion Feature) and Enhanced-SPMF that encode skeleton poses and their motions into color images. For learning and classification tasks, we design and train different D-CNN architectures based on the Residual Network (ResNet), Inception-ResNet-v2, Densely Connected Convolutional Network (DenseNet) and Efficient Neural Architecture Search (ENAS) to extract robust features from color-coded images and classify them. Experimental results on various public and challenging human action recognition datasets (MSR Action3D, Kinect Activity Recognition Dataset, SBU Kinect Interaction, and NTU-RGB+D) show that the proposed approach outperforms current state-of-the-art. We also conducted research on the problem of 3D human pose estimation from monocular RGB video sequences and exploited the estimated 3D poses for recognition task. Specifically, a deep learning-based model called OpenPose is deployed to detect 2D human poses. A DNN is then proposed and trained for learning a 2D-to-3D mapping in order to map the detected 2D keypoints into 3D poses. Our experiments on the Human3.6M dataset verified the effectiveness of the proposed method. These obtained results allow opening a new research direction for human action recognition from 3D skeletal data, when the depth cameras are failing. In addition, we collect and introduce in this thesis, CEMEST database, a new RGB-D dataset depicting passengers' behaviors in public transport. It consists of 203 untrimmed real-world surveillance videos of realistic "normal" and "abnormal" events. We achieve promising results on CEMEST with the support of data augmentation and transfer learning techniques. This enables the construction of real-world applications based on deep learning for enhancing public transportation management services

    Structured representation learning from complex data

    Full text link
    This thesis advances several theoretical and practical aspects of the recently introduced restricted Boltzmann machine - a powerful probabilistic and generative framework for modelling data and learning representations. The contributions of this study represent a systematic and common theme in learning structured representations from complex data

    Emotion-aware cross-modal domain adaptation in video sequences

    Get PDF

    Detection and prediction of urban archetypes at the pedestrian scale: computational toolsets, morphological metrics, and machine learning methods

    Get PDF
    Granular, dense, and mixed-use urban morphologies are hallmarks of walkable and vibrant streets. However, urban systems are notoriously complex and planned urban development, which grapples with varied interdependent and oft conflicting criteria, may — despite best intentions — yield aberrant morphologies fundamentally at odds with the needs of pedestrians and the resiliency of neighbourhoods. This work addresses the measurement, detection, and prediction of pedestrian-friendly urban archetypes by developing techniques for high-resolution urban analytics at the pedestrian scale. A spatial-analytic computational toolset, the cityseer-api Python package, is created to assess localised centrality, land-use, and statistical metrics using contextually sensitive workflows applied directly over the street network. cityseer-api subsequently facilitates a review of mixed-use and street network centrality methods to improve their utility concerning granular urban analysis. Unsupervised machine learning methods are applied to recover ‘signatures’ — urban archetypes — using Principal Component Analysis, Variational Autoencoders, and clustering methods from a high-resolution multi-variable and multi-scalar dataset consisting of centralities, land-uses, and population densities for Greater London. Supervised deep-learning methods applied to a similar dataset developed for 931 towns and cities in Great Britain demonstrate how, with the aid of domain knowledge, machine-learning classifiers can learn to discriminate between ‘artificial’ and ‘historical’ urban archetypes. These methods use complex systems thinking as a departure point and illustrate how high-resolution spatial-analytic quantitative methods can be combined with machine learning to extrapolate benchmarks in keeping with more qualitatively framed urban morphological conceptions. Such tools may aid urban design professionals in better anticipating the outcomes of varied design scenarios as part of iterative and scalable workflows. These techniques may likewise provide robust and demonstrable feedback as part of planning review and approvals processes

    Marginalised Stacked Denoising Autoencoders for Robust Representation of Real-Time Multi-View Action Recognition

    No full text
    Multi-view action recognition has gained a great interest in video surveillance, human computer interaction, and multimedia retrieval, where multiple cameras of different types are deployed to provide a complementary field of views. Fusion of multiple camera views evidently leads to more robust decisions on both tracking multiple targets and analysing complex human activities, especially where there are occlusions. In this paper, we incorporate the marginalised stacked denoising autoencoders (mSDA) algorithm to further improve the bag of words (BoWs) representation in terms of robustness and usefulness for multi-view action recognition. The resulting representations are fed into three simple fusion strategies as well as a multiple kernel learning algorithm at the classification stage. Based on the internal evaluation, the codebook size of BoWs and the number of layers of mSDA may not significantly affect recognition performance. According to results on three multi-view benchmark datasets, the proposed framework improves recognition performance across all three datasets and outputs record recognition performance, beating the state-of-art algorithms in the literature. It is also capable of performing real-time action recognition at a frame rate ranging from 33 to 45, which could be further improved by using more powerful machines in future applications

    Marginalised Stacked Denoising Autoencoders for Robust Representation of Real-Time Multi-View Action Recognition

    Get PDF
    Multi-view action recognition has gained a great interest in video surveillance, human computer interaction, and multimedia retrieval, where multiple cameras of different types are deployed to provide a complementary field of views. Fusion of multiple camera views evidently leads to more robust decisions on both tracking multiple targets and analysing complex human activities, especially where there are occlusions. In this paper, we incorporate the marginalised stacked denoising autoencoders (mSDA) algorithm to further improve the bag of words (BoWs) representation in terms of robustness and usefulness for multi-view action recognition. The resulting representations are fed into three simple fusion strategies as well as a multiple kernel learning algorithm at the classification stage. Based on the internal evaluation, the codebook size of BoWs and the number of layers of mSDA may not significantly affect recognition performance. According to results on three multi-view benchmark datasets, the proposed framework improves recognition performance across all three datasets and outputs record recognition performance, beating the state-of-art algorithms in the literature. It is also capable of performing real-time action recognition at a frame rate ranging from 33 to 45, which could be further improved by using more powerful machines in future applications

    Machine-learning methods for weak lensing analysis of the ESA Euclid sky survey

    Get PDF
    A clear picture has emerged from the last three decades of research: our Universe is expanding at an accelerated rate. The cause of this expansion remains elusive, but in essence acts as a repulsive force. This so-called dark energy represents about 69% of the energy content in the Universe. A further 26% of the energy is contained in dark matter, a form of matter that is invisible electromagnetically. Understanding the nature of these two major components of the Universe is at the top of the list of unsolved problems. To unveil answers, ambitious experiments are devised to survey an ever larger and deeper fraction of the sky. One such project is the European Space Agency (ESA) telescope Euclid, which will probe dark matter and infer desperately needed information about dark energy. Because light bundles follow null geodesics, their trajectories are affected by the mass distribution along the line of sight, which includes dark matter. This is gravitational lensing. In the vast majority of cases, deformations of the source objects are weak, and profiles are slightly sheared. The nature of the dark components can be fathomed by measuring the shear over a large fraction of the sky. The shear can be recovered by a statistical analysis of a large number of objects. In this thesis, we take on the development of the necessary tools to measure the shear. Shear measurement techniques have been developed and improved for more than two decades. Their performance, however, do not meet the unprecedented requirements imposed by future surveys. Requirements trickle down from the targeted determination of the cosmological parameters. We aim at preparing novel and innovative methods. These methods are tested against the Euclid requirements. Contributions can be classified into two major themes. A key step in the processing of weak gravitational lensing data is the correction of image deformations generated by the instrument itself. This point spread function (PSF) correction is the first theme. The second is the shear measurement itself, and in particular, producing accurate measurements. We explore machine-learning methods, and notably artificial neural networks. These methods are, for the most part, data-driven. Schemes must first be trained against a representative sample of data. Crafting optimal training sets and choosing the method parameters can be crucial for the performance. We dedicate an important fraction of this dissertation to describing simulations behind the datasets and motivating our parameter choices. We propose schemes to build a clean selection of stars and model the PSF to the Euclid requirements in the first part of this thesis. Shear measurements are notoriously biased because of their small size and their low intensity. We introduce an approach that produces unbiased estimates of shear. This is achieved by processing data from any shape measurement technique with artificial neural networks, and predicting corrected estimates of the shape of the galaxies, or directly the shear. We demonstrate that simple networks with simple trainings are sufficient to reach the Euclid requirements on shear measurements
    corecore