761,729 research outputs found

    Real-time human action and gesture recognition using skeleton joints information towards medical applications

    Full text link
    Des efforts importants ont été faits pour améliorer la précision de la détection des actions humaines à l’aide des articulations du squelette. Déterminer les actions dans un environnement bruyant reste une tâche difficile, car les coordonnées cartésiennes des articulations du squelette fournies par la caméra de détection à profondeur dépendent de la position de la caméra et de la position du squelette. Dans certaines applications d’interaction homme-machine, la position du squelette et la position de la caméra ne cessent de changer. La méthode proposée recommande d’utiliser des valeurs de position relatives plutôt que des valeurs de coordonnées cartésiennes réelles. Les récents progrès des réseaux de neurones à convolution (RNC) nous aident à obtenir une plus grande précision de prédiction en utilisant des entrées sous forme d’images. Pour représenter les articulations du squelette sous forme d’image, nous devons représenter les informations du squelette sous forme de matrice avec une hauteur et une largeur égale. Le nombre d’articulations du squelette fournit par certaines caméras de détection à profondeur est limité, et nous devons dépendre des valeurs de position relatives pour avoir une représentation matricielle des articulations du squelette. Avec la nouvelle représentation des articulations du squelette et le jeu de données MSR, nous pouvons obtenir des performances semblables à celles de l’état de l’art. Nous avons utilisé le décalage d’image au lieu de l’interpolation entre les images, ce qui nous aide également à obtenir des performances similaires à celle de l’état de l’art.There have been significant efforts in the direction of improving accuracy in detecting human action using skeleton joints. Recognizing human activities in a noisy environment is still challenging since the cartesian coordinate of the skeleton joints provided by depth camera depends on camera position and skeleton position. In a few of the human-computer interaction applications, skeleton position, and camera position keep changing. The proposed method recommends using relative positional values instead of actual cartesian coordinate values. Recent advancements in CNN help us to achieve higher prediction accuracy using input in image format. To represent skeleton joints in image format, we need to represent skeleton information in matrix form with equal height and width. With some depth cameras, the number of skeleton joints provided is limited, and we need to depend on relative positional values to have a matrix representation of skeleton joints. We can show the state-of-the-art prediction accuracy on MSR data with the help of the new representation of skeleton joints. We have used frames shifting instead of interpolation between frames, which helps us achieve state-of-the-art performance

    Two-Stage Human Activity Recognition Using 2D-ConvNet

    Get PDF
    There is huge requirement of continuous intelligent monitoring system for human activity recognition in various domains like public places, automated teller machines or healthcare sector. Increasing demand of automatic recognition of human activity in these sectors and need to reduce the cost involved in manual surveillance have motivated the research community towards deep learning techniques so that a smart monitoring system for recognition of human activities can be designed and developed. Because of low cost, high resolution and ease of availability of surveillance cameras, the authors developed a new two-stage intelligent framework for detection and recognition of human activity types inside the premises. This paper, introduces a novel framework to recognize single-limb and multi-limb human activities using a Convolution Neural Network. In the first phase single-limb and multi-limb activities are separated. Next, these separated single and multi-limb activities have been recognized using sequence-classification. For training and validation of our framework we have used the UTKinect-Action Dataset having 199 actions sequences performed by 10 users. We have achieved an overall accuracy of 97.88% in real-time recognition of the activity sequences

    A new framework for deep learning video based Human Action Recognition on the edge

    Get PDF
    Nowadays, video surveillance systems are commonly found in most public and private spaces. These systems typically consist of a network of cameras that feed into a central node. However, the processing aspect is evolving towards distributed approaches, leveraging edge-computing. These distributed systems are capable of effectively addressing the detection of people or events at each individual node. Most of these systems, rely on the use of deep-learning and segmentation algorithms which enable them to achieve high performance, but usually with a significant computational cost, hindering real-time execution. This paper presents an approach for people detection and action recognition in the wild, optimized for running on the edge, and that is able to work in real-time, in an embedded platform. Human Action Recognition (HAR) is performed by using a Recurrent Neural Network (RNN), specifically a Long Short-Term Memory (LSTM). The input to the LSTM is an ad-hoc, lightweight feature vector obtained from the bounding box of each detected person in the video surveillance image. The resulting system is highly portable and easily scalable, providing a powerful tool for real-world video surveillance applications (in the wild and real-time action recognition). The proposal has been exhaustively evaluated and compared against other state-of-the-art (SOTA) proposals in five datasets, including four widely used (KTH, WEIZMAN, WVU, IXMAX) and a novel one (GBA) recorded in the wild, that includes several people performing different actions simultaneously. The obtained results validate the proposal, since it achieves SOTA accuracy within a much more complicated video surveillance real scenario, and using a lightweight embedded hardware.European CommissionAgencia Estatal de InvestigaciĂłnUniversidad de Alcal

    Exemplar-Based Human Action Recognition with Template Matching from a Stream of Motion Capture

    Get PDF
    Recent works on human action recognition have focused on representing and classifying articulated body motion. These methods require a detailed knowledge of the action composition both in the spatial and temporal domains, which is a difficult task, most notably under real-time conditions. As such, there has been a recent shift towards the exemplar paradigm as an efficient low-level and invariant modelling approach. Motivated by recent success, we believe a real-time solution to the problem of human action recognition can be achieved. In this work, we present an exemplar-based approach where only a single action sequence is used to model an action class. Notably, rotations for each pose are parameterised in Exponential Map form. Delegate exemplars are selected using k-means clustering, where the cluster criteria is selected automatically. For each cluster, a delegate is identified and denoted as the exemplar by means of a similarity function. The number of exemplars is adaptive based on the complexity of the action sequence. For recognition, Dynamic Time Warping and template matching is employed to compare the similarity between a streamed observation and the action model. Experimental results using motion capture demonstrate our approach is superior to current state-of-the-art, with the additional ability to handle large and varied action sequences

    Towards practical automated human action recognition

    Full text link
    University of Technology, Sydney. Faculty of Engineering and Information Technology.Modern video surveillance requires addressing high-level concepts such as humans' actions and activities. Automated human action recognition is an interesting research area, as well as one of the main trends in the automated video survei1lance industry. The typical goal of action recognition is that of labelling an image sequence (video) using one out of a set of action labels. In general, it requires the extraction of a feature set from the relevant video, fo1lowed by the classification of the extracted features. Despite the many approaches for feature set extraction and classification proposed to date, some barriers for practical action recognition sti11 exist. We argue that recognition accuracy, speed, robustness and the required hardware are the main factors to build a practical human action recognition system to be run on a typical PC for a real-time video surveillance application. For example, a computationally-heavy set of measurements may prevent practical implementation on common platforms. The main focus of this thesis is challenging the main difficulties and proposing solution. towards a practical action recognition system. The main outstanding difficulties that we have challenged in this thesis include 1) initialisation issues with model training: 2) feature sets of limited computational weight sui table for real-ti me application; 3) model robustness to outliers; and 4) pending issues with the standardisation of software interfaces. In the following, we provide a description of our contributions to the resolution of these issues. Amongst the different classification approaches for classifying action , graphical model such as the hidden Markov model (HMM) have been widely exploited by many researchers. Such models include observation probabilities which are generally modelled by mixtures of Gaussian components. When learning an HMM by way of Expectation-Maximisation (EM) algorithms, arbitrary choices must be made for their initial parameters. The initial choices have a major impact on the parameters at convergence and, in turn, on the recognition accuracy. This dependence forces us to repeat training with different initial parameters until satisfactory cross-validation accuracy is attained. Such a process is overall empirical and time consuming. We argue that one-off initialisation can offer a better trade-off between training time and accuracy, and as one of the main contributions of this thesis, we propose two methods for deterministic initialisation of the Gaussian components' centres. The first method is a time segmentation-based approach which divides each training sequence into the requested number of clusters (product of the number of HMM states and the number of Gaussian components in each state) in the time domain. Then, clusters' centres are averaged among all the training sequences to compute the initial centre for each Gaussian component. The second approach is a histogram-based approach which tries to initialise the components' centres with the more popular values among the training data in terms of density (similar to mode seeking approaches). The histogram-based approach is performed incrementally, considering each feature at a time. Either centre initialisation approach is followed by dispatching the resulting Gaussian components onto HMM states. The reference component dispatching method exploits the arbitrary order for dispatching. In contrast, we again propose two more intelligent methods based on the effort to put components with closer centres in the same state which can improve the co1Tect recognition rate. Experiments over three human action video datasets (Weizmann [1 ], MuHAVi [2] and Hollywood [3]) prove that our proposed deterministic initialisation methods are capable of achieving accuracy above the average of repeated random initialisations (about 1 per cent to 3 per cent in 6 random run experiment) and comparable to the best. At the same time, one-off deterministic initialisation can save the required training time substantially compared to repeated random initialisations, e.g. up to 83% in the case of 6 runs of random initialisation. The proposed methods are general as they naturally extend to other models where observation densities are conditioned on discrete latent variables, such as dynamic Bayesian networks (DBNs) and switching models . As another contribution, we propose a simple and computationally lightweight feature set, named sectorial extreme points, which requires only 1.6 ms per frame for extraction on a reference PC. We believe a lightweight feature set is more appropriate for the task of action recognition in real-time surveillance applications with the usual requirement of processing 25 frames per second (PAL video rate). The proposed feature set represents the coordinates of the extreme points in the contour of a subject's foreground mask. The various experiments prove the strength of the proposed feature set in terms of classification accuracy, compared to similar feature sets, such as the star skeleton [4] (by more than 3%) and the well-known projection histograms (up to 7%). Another main issue in density modelling of the extracted features is the outlier problem. The extraction of human features from videos is often inaccurate and prone to outliers. Such outliers can severely affect density modelling when the Gaussian distribution is used as the model since it is short-tailed and highly sensitive to outliers. Hence, outliers can affect the classification accuracy of the HMM-based action recognition approaches that exploit Gaussian distribution as the base component. In contrast, the Student' s t-distribution is more robust to outliers thanks to its longer tail and can be exploited for density modelling to improve the recognition rate in the presence of abnormal data. As another main contribution, we present an HMM which uses mixtures of t-distributions as observation probabilities and apply it for the recognition task. The conducted experiments over the Weizmann and MuHAVi datasets with various feature sets report a remarkable improvement of up to 9% in classification accuracy by using HMM with mixtures of t-distributions instead of mixture of Gaussians. Using our own proposed sectorial extreme points feature set, we have achieved the maximum possible classification accuracy (100%) over the Weizmann dataset. This achievement should be considered jointly with the fact that we have used a lightweight feature set. On a different ground, and from the implementation viewpoint, surveillance software for automated human action recognition requires portability over a variety of platforms, from servers to mobile devices. The current products mainly target low level video analysis tasks, e.g. video annotation, instead of higher level ones, such as action recognition. Therefore, we explore the potential of the MPEG-7 standard to provide a standard interface platform (through descriptors and architectures) for human action recognition from surveillance cameras. As the last contribution of this work, we present two novel MPEG-7 descriptors, one symbolic and the other feature-based, alongside two different architectures: the server-intensive which is more suitable for "thin" client devices , such as PDAs and the client-intensive that is more appropriate for ''thick" clients, such as desktops. We evaluate the proposed descriptors and architectures by way of a scenario analysis. We believe that through the four contributions of this thesis, human action recognition systems have become more practical. While some contributions are specific to generative models such as the HMM, other contributions are more general and can be exploited with other classification approaches. We acknowledge that the entire area of human action recognition is progressing at an enormous pace, and that other outstanding issues are being resolved by research groups world-wide. We hope that the reader will enjoy the content of this work

    Towards Full-Body Gesture Analysis and Recognition

    Get PDF
    With computers being embedded in every walk of our life, there is an increasing demand forintuitive devices for human-computer interaction. As human beings use gestures as importantmeans of communication, devices based on gesture recognition systems will be effective for humaninteraction with computers. However, it is very important to keep such a system as non-intrusive aspossible, to reduce the limitations of interactions. Designing such non-intrusive, intuitive, camerabasedreal-time gesture recognition system has been an active area of research research in the fieldof computer vision.Gesture recognition invariably involves tracking body parts. We find many research works intracking body parts like eyes, lips, face etc. However, there is relatively little work being done onfull body tracking. Full-body tracking is difficult because it is expensive to model the full-body aseither 2D or 3D model and to track its movements.In this work, we propose a monocular gesture recognition system that focuses on recognizing a setof arm movements commonly used to direct traffic, guiding aircraft landing and for communicationover long distances. This is an attempt towards implementing gesture recognition systems thatrequire full body tracking, for e.g. an automated recognition semaphore flag signaling system.We have implemented a robust full-body tracking system, which forms the backbone of ourgesture analyzer. The tracker makes use of two dimensional link-joint (LJ) model, which representsthe human body, for tracking. Currently, we track the movements of the arms in a video sequence,however we have future plans to make the system real-time. We use distance transform techniquesto track the movements by fitting the parameters of LJ model in every frames of the video captured.The tracker\u27s output is fed a to state-machine which identifies the gestures made. We haveimplemented this system using four sub-systems. Namely1. Background subtraction sub-system, using Gaussian models and median filters.2. Full-body Tracker, using L-J Model APIs3. Quantizer, that converts tracker\u27s output into defined alphabets4. Gesture analyzer, that reads the alphabets into action performed.Currently, our gesture vocabulary contains gestures involving arms moving up and down which canbe used for detecting semaphore, flag signaling system. Also we can detect gestures like clappingand waving of arms

    Simple yet efficient real-time pose-based action recognition

    Full text link
    Recognizing human actions is a core challenge for autonomous systems as they directly share the same space with humans. Systems must be able to recognize and assess human actions in real-time. In order to train corresponding data-driven algorithms, a significant amount of annotated training data is required. We demonstrated a pipeline to detect humans, estimate their pose, track them over time and recognize their actions in real-time with standard monocular camera sensors. For action recognition, we encode the human pose into a new data format called Encoded Human Pose Image (EHPI) that can then be classified using standard methods from the computer vision community. With this simple procedure we achieve competitive state-of-the-art performance in pose-based action detection and can ensure real-time performance. In addition, we show a use case in the context of autonomous driving to demonstrate how such a system can be trained to recognize human actions using simulation data.Comment: Submitted to IEEE Intelligent Transportation Systems Conference (ITSC) 2019. Code will be available soon at https://github.com/noboevbo/ehpi_action_recognitio

    Action recognition using instrumented objects for stroke rehabilitation

    Get PDF
    Assisting patients to perform activities of daily living (ADLs) is a challenging task for both human and machine. Hence, developing a computer-based rehabilitation system to re-train patients to carry out daily activities is an essential step towards facilitating rehabilitation of stroke patients with apraxia and action disorganization syndrome (AADS). This thesis presents a real-time Hidden Markov Model (HMM) based human activity recognizer, and proposes a technique to reduce the time delay occurred during the decoding stage. Results are reported for complete tea-making trials. In this study, the input features are recorded using sensors attached to the objects involved in the tea making task, plus hand coordinate data captured using Kinect sensor. A coaster of sensors, comprising an accelerometer and three force-sensitive resistors, are packaged in a unit which can be easily attached to the base of an object. A parallel asynchronous set of detectors, each responsible for the detection of one sub-goal in the tea-making task, are used to address challenges arising from overlaps between human actions. In this work HMMs are used to exploit temporal dependencies between actions and emission distributions are modelled by two generative and discriminative modelling techniques namely Gaussian Mixture Models (GMMs) and Deep Neural Networks (DNNs). Our experimental results show that HMM-DNN based systems outperform the GMM-HMM based systems by 18%. The proposed activity recognition system with the modified HMM topology provides a practical solution to the action recognition problem and reduces the time delay by 64% with no loss in accuracy
    • …
    corecore