25 research outputs found

    Large-scale Continuous Gesture Recognition Using Convolutional Neural Networks

    Full text link
    This paper addresses the problem of continuous gesture recognition from sequences of depth maps using convolutional neutral networks (ConvNets). The proposed method first segments individual gestures from a depth sequence based on quantity of movement (QOM). For each segmented gesture, an Improved Depth Motion Map (IDMM), which converts the depth sequence into one image, is constructed and fed to a ConvNet for recognition. The IDMM effectively encodes both spatial and temporal information and allows the fine-tuning with existing ConvNet models for classification without introducing millions of parameters to learn. The proposed method is evaluated on the Large-scale Continuous Gesture Recognition of the ChaLearn Looking at People (LAP) challenge 2016. It achieved the performance of 0.2655 (Mean Jaccard Index) and ranked 3rd3^{rd} place in this challenge

    Large-scale Isolated Gesture Recognition Using Convolutional Neural Networks

    Full text link
    This paper proposes three simple, compact yet effective representations of depth sequences, referred to respectively as Dynamic Depth Images (DDI), Dynamic Depth Normal Images (DDNI) and Dynamic Depth Motion Normal Images (DDMNI). These dynamic images are constructed from a sequence of depth maps using bidirectional rank pooling to effectively capture the spatial-temporal information. Such image-based representations enable us to fine-tune the existing ConvNets models trained on image data for classification of depth sequences, without introducing large parameters to learn. Upon the proposed representations, a convolutional Neural networks (ConvNets) based method is developed for gesture recognition and evaluated on the Large-scale Isolated Gesture Recognition at the ChaLearn Looking at People (LAP) challenge 2016. The method achieved 55.57\% classification accuracy and ranked 2nd2^{nd} place in this challenge but was very close to the best performance even though we only used depth data.Comment: arXiv admin note: text overlap with arXiv:1608.0633

    Depth Pooling Based Large-scale 3D Action Recognition with Convolutional Neural Networks

    Get PDF
    This paper proposes three simple, compact yet effective representations of depth sequences, referred to respectively as Dynamic Depth Images (DDI), Dynamic Depth Normal Images (DDNI) and Dynamic Depth Motion Normal Images (DDMNI), for both isolated and continuous action recognition. These dynamic images are constructed from a segmented sequence of depth maps using hierarchical bidirectional rank pooling to effectively capture the spatial-temporal information. Specifically, DDI exploits the dynamics of postures over time and DDNI and DDMNI exploit the 3D structural information captured by depth maps. Upon the proposed representations, a ConvNet based method is developed for action recognition. The image-based representations enable us to fine-tune the existing Convolutional Neural Network (ConvNet) models trained on image data without training a large number of parameters from scratch. The proposed method achieved the state-of-art results on three large datasets, namely, the Large-scale Continuous Gesture Recognition Dataset (means Jaccard index 0.4109), the Large-scale Isolated Gesture Recognition Dataset (59.21%), and the NTU RGB+D Dataset (87.08% cross-subject and 84.22% cross-view) even though only the depth modality was used.Comment: arXiv admin note: text overlap with arXiv:1701.01814, arXiv:1608.0633

    Gesture Recognition Using Hidden Markov Models Augmented with Active Difference Signatures

    Get PDF
    With the recent invention of depth sensors, human gesture recognition has gained significant interest in the fields of computer vision and human computer interaction. Robust gesture recognition is a difficult problem because of the spatiotemporal variations in gesture formation, subject size, subject location, image fidelity, and subject occlusion. Gesture boundary detection, or the automatic detection of the onset and offset of a gesture in a sequence of gestures, is critical toward achieving robust gesture recognition. Existing gesture recognition methods perform the task of gesture segmentation either using resting frames in a gesture sequence or by using additional information such as audio, depth images, or RGB images. This ancillary information introduces high latency in gesture segmentation and recognition, thus making it inappropriate for real time applications. This thesis proposes a novel method to recognize time-varying human gestures from continuous video streams. The proposed method passes skeleton joint information into a Hidden Markov Model augmented with active difference signatures to achieve state-of-the-art gesture segmentation and recognition. Active body parts are used to calculate the likelihood of previously unseen data to facilitate gesture segmentation. Active difference signatures are used to describe temporal motion as well as static differences from a canonical resting position. Geometric features, such as joint angles, and joint topological distances are used along with active difference signatures as salient feature descriptors. These feature descriptors serve as unique signatures which identify hidden states in a Hidden Markov Model. The Hidden Markov Model is able to identify gestures in a robust fashion which is tolerant to spatiotemporal and human-to-human variation in gesture articulation. The proposed method is evaluated on both isolated and continuous datasets. An accuracy of 80.7% is achieved on the isolated MSR3D dataset and a mean Jaccard index of 0.58 is achieved on the continuous ChaLearn dataset. Results improve upon existing gesture recognition methods, which achieve a Jaccard index of 0.43 on the ChaLearn dataset. Comprehensive experiments investigate the feature selection, parameter optimization, and algorithmic methods to help understand the contributions of the proposed method

    Improving gesture recognition through spatial focus of attention

    Get PDF
    2018 Fall.Includes bibliographical references.Gestures are a common form of human communication and important for human computer interfaces (HCI). Most recent approaches to gesture recognition use deep learning within multi- channel architectures. We show that when spatial attention is focused on the hands, gesture recognition improves significantly, particularly when the channels are fused using a sparse network. We propose an architecture (FOANet) that divides processing among four modalities (RGB, depth, RGB flow, and depth flow), and three spatial focus of attention regions (global, left hand, and right hand). The resulting 12 channels are fused using sparse networks. This architecture improves performance on the ChaLearn IsoGD dataset from a previous best of 67.71% to 82.07%, and on the NVIDIA dynamic hand gesture dataset from 83.8% to 91.28%. We extend FOANet to perform gesture recognition on continuous streams of data. We show that the best temporal fusion strategies for multi-channel networks depends on the modality (RGB vs depth vs flow field) and target (global vs left hand vs right hand) of the channel. The extended architecture achieves optimum performance using Gaussian Pooling for global channels, LSTMs for focused (left hand or right hand) flow field channels, and late Pooling for focused RGB and depth channels. The resulting system achieves a mean Jaccard Index of 0.7740 compared to the previous best result of 0.6103 on the ChaLearn ConGD dataset without first pre-segmenting the videos into single gesture clips. Human vision has α and β channels for processing different modalities in addition to spatial attention similar to FOANet. However, unlike FOANet, attention is not implemented through separate neural channels. Instead, attention is implemented through top-down excitation of neurons corresponding to specific spatial locations within the α and β channels. Motivated by the covert attention in human vision, we propose a new architecture called CANet (Covert Attention Net), that merges spatial attention channels while preserving the concept of attention. The focus layers of CANet allows it to focus attention on hands without having dedicated attention channels. CANet outperforms FOANet by achieving an accuracy of 84.79% on ChaLearn IsoGD dataset while being efficient (≈35% of FOANet parameters and ≈70% of FOANet operations). In addition to producing state-of-the-art results on multiple gesture recognition datasets, this thesis also tries to understand the behavior of multi-channel networks (a la FOANet). Multi- channel architectures are becoming increasingly common, setting the state of the art for performance in gesture recognition and other domains. Unfortunately, we lack a clear explanation of why multi-channel architectures outperform single channel ones. This thesis considers two hypotheses. The Bagging hypothesis says that multi-channel architectures succeed because they average the result of multiple unbiased weak estimators in the form of different channels. The Society of Experts (SoE) hypothesis suggests that multi-channel architectures succeed because the channels differentiate themselves, developing expertise with regard to different aspects of the data. Fusion layers then get to combine complementary information. This thesis presents two sets of experiments to distinguish between these hypotheses and both sets of experiments support the SoE hypothesis, suggesting multi-channel architectures succeed because their channels become specialized. Finally we demonstrate the practical impact of the gesture recognition techniques discussed in this thesis in the context of a sophisticated human computer interaction system. We developed a prototype system with a limited form of peer-to-peer communication in the context of blocks world. The prototype allows the users to communicate with the avatar using gestures and speech and make the avatar build virtual block structures
    corecore