972 research outputs found

    Video-based Sign Language Recognition without Temporal Segmentation

    Full text link
    Millions of hearing impaired people around the world routinely use some variants of sign languages to communicate, thus the automatic translation of a sign language is meaningful and important. Currently, there are two sub-problems in Sign Language Recognition (SLR), i.e., isolated SLR that recognizes word by word and continuous SLR that translates entire sentences. Existing continuous SLR methods typically utilize isolated SLRs as building blocks, with an extra layer of preprocessing (temporal segmentation) and another layer of post-processing (sentence synthesis). Unfortunately, temporal segmentation itself is non-trivial and inevitably propagates errors into subsequent steps. Worse still, isolated SLR methods typically require strenuous labeling of each word separately in a sentence, severely limiting the amount of attainable training data. To address these challenges, we propose a novel continuous sign recognition framework, the Hierarchical Attention Network with Latent Space (LS-HAN), which eliminates the preprocessing of temporal segmentation. The proposed LS-HAN consists of three components: a two-stream Convolutional Neural Network (CNN) for video feature representation generation, a Latent Space (LS) for semantic gap bridging, and a Hierarchical Attention Network (HAN) for latent space based recognition. Experiments are carried out on two large scale datasets. Experimental results demonstrate the effectiveness of the proposed framework.Comment: 32nd AAAI Conference on Artificial Intelligence (AAAI-18), Feb. 2-7, 2018, New Orleans, Louisiana, US

    A Survey of Applications and Human Motion Recognition with Microsoft Kinect

    Get PDF
    Microsoft Kinect, a low-cost motion sensing device, enables users to interact with computers or game consoles naturally through gestures and spoken commands without any other peripheral equipment. As such, it has commanded intense interests in research and development on the Kinect technology. In this paper, we present, a comprehensive survey on Kinect applications, and the latest research and development on motion recognition using data captured by the Kinect sensor. On the applications front, we review the applications of the Kinect technology in a variety of areas, including healthcare, education and performing arts, robotics, sign language recognition, retail services, workplace safety training, as well as 3D reconstructions. On the technology front, we provide an overview of the main features of both versions of the Kinect sensor together with the depth sensing technologies used, and review literatures on human motion recognition techniques used in Kinect applications. We provide a classification of motion recognition techniques to highlight the different approaches used in human motion recognition. Furthermore, we compile a list of publicly available Kinect datasets. These datasets are valuable resources for researchers to investigate better methods for human motion recognition and lower-level computer vision tasks such as segmentation, object detection and human pose estimation

    An original framework for understanding human actions and body language by using deep neural networks

    Get PDF
    The evolution of both fields of Computer Vision (CV) and Artificial Neural Networks (ANNs) has allowed the development of efficient automatic systems for the analysis of people's behaviour. By studying hand movements it is possible to recognize gestures, often used by people to communicate information in a non-verbal way. These gestures can also be used to control or interact with devices without physically touching them. In particular, sign language and semaphoric hand gestures are the two foremost areas of interest due to their importance in Human-Human Communication (HHC) and Human-Computer Interaction (HCI), respectively. While the processing of body movements play a key role in the action recognition and affective computing fields. The former is essential to understand how people act in an environment, while the latter tries to interpret people's emotions based on their poses and movements; both are essential tasks in many computer vision applications, including event recognition, and video surveillance. In this Ph.D. thesis, an original framework for understanding Actions and body language is presented. The framework is composed of three main modules: in the first one, a Long Short Term Memory Recurrent Neural Networks (LSTM-RNNs) based method for the Recognition of Sign Language and Semaphoric Hand Gestures is proposed; the second module presents a solution based on 2D skeleton and two-branch stacked LSTM-RNNs for action recognition in video sequences; finally, in the last module, a solution for basic non-acted emotion recognition by using 3D skeleton and Deep Neural Networks (DNNs) is provided. The performances of RNN-LSTMs are explored in depth, due to their ability to model the long term contextual information of temporal sequences, making them suitable for analysing body movements. All the modules were tested by using challenging datasets, well known in the state of the art, showing remarkable results compared to the current literature methods

    Improving gesture recognition through spatial focus of attention

    Get PDF
    2018 Fall.Includes bibliographical references.Gestures are a common form of human communication and important for human computer interfaces (HCI). Most recent approaches to gesture recognition use deep learning within multi- channel architectures. We show that when spatial attention is focused on the hands, gesture recognition improves significantly, particularly when the channels are fused using a sparse network. We propose an architecture (FOANet) that divides processing among four modalities (RGB, depth, RGB flow, and depth flow), and three spatial focus of attention regions (global, left hand, and right hand). The resulting 12 channels are fused using sparse networks. This architecture improves performance on the ChaLearn IsoGD dataset from a previous best of 67.71% to 82.07%, and on the NVIDIA dynamic hand gesture dataset from 83.8% to 91.28%. We extend FOANet to perform gesture recognition on continuous streams of data. We show that the best temporal fusion strategies for multi-channel networks depends on the modality (RGB vs depth vs flow field) and target (global vs left hand vs right hand) of the channel. The extended architecture achieves optimum performance using Gaussian Pooling for global channels, LSTMs for focused (left hand or right hand) flow field channels, and late Pooling for focused RGB and depth channels. The resulting system achieves a mean Jaccard Index of 0.7740 compared to the previous best result of 0.6103 on the ChaLearn ConGD dataset without first pre-segmenting the videos into single gesture clips. Human vision has α and β channels for processing different modalities in addition to spatial attention similar to FOANet. However, unlike FOANet, attention is not implemented through separate neural channels. Instead, attention is implemented through top-down excitation of neurons corresponding to specific spatial locations within the α and β channels. Motivated by the covert attention in human vision, we propose a new architecture called CANet (Covert Attention Net), that merges spatial attention channels while preserving the concept of attention. The focus layers of CANet allows it to focus attention on hands without having dedicated attention channels. CANet outperforms FOANet by achieving an accuracy of 84.79% on ChaLearn IsoGD dataset while being efficient (≈35% of FOANet parameters and ≈70% of FOANet operations). In addition to producing state-of-the-art results on multiple gesture recognition datasets, this thesis also tries to understand the behavior of multi-channel networks (a la FOANet). Multi- channel architectures are becoming increasingly common, setting the state of the art for performance in gesture recognition and other domains. Unfortunately, we lack a clear explanation of why multi-channel architectures outperform single channel ones. This thesis considers two hypotheses. The Bagging hypothesis says that multi-channel architectures succeed because they average the result of multiple unbiased weak estimators in the form of different channels. The Society of Experts (SoE) hypothesis suggests that multi-channel architectures succeed because the channels differentiate themselves, developing expertise with regard to different aspects of the data. Fusion layers then get to combine complementary information. This thesis presents two sets of experiments to distinguish between these hypotheses and both sets of experiments support the SoE hypothesis, suggesting multi-channel architectures succeed because their channels become specialized. Finally we demonstrate the practical impact of the gesture recognition techniques discussed in this thesis in the context of a sophisticated human computer interaction system. We developed a prototype system with a limited form of peer-to-peer communication in the context of blocks world. The prototype allows the users to communicate with the avatar using gestures and speech and make the avatar build virtual block structures

    Machine learning methods for sign language recognition: a critical review and analysis.

    Get PDF
    Sign language is an essential tool to bridge the communication gap between normal and hearing-impaired people. However, the diversity of over 7000 present-day sign languages with variability in motion position, hand shape, and position of body parts making automatic sign language recognition (ASLR) a complex system. In order to overcome such complexity, researchers are investigating better ways of developing ASLR systems to seek intelligent solutions and have demonstrated remarkable success. This paper aims to analyse the research published on intelligent systems in sign language recognition over the past two decades. A total of 649 publications related to decision support and intelligent systems on sign language recognition (SLR) are extracted from the Scopus database and analysed. The extracted publications are analysed using bibliometric VOSViewer software to (1) obtain the publications temporal and regional distributions, (2) create the cooperation networks between affiliations and authors and identify productive institutions in this context. Moreover, reviews of techniques for vision-based sign language recognition are presented. Various features extraction and classification techniques used in SLR to achieve good results are discussed. The literature review presented in this paper shows the importance of incorporating intelligent solutions into the sign language recognition systems and reveals that perfect intelligent systems for sign language recognition are still an open problem. Overall, it is expected that this study will facilitate knowledge accumulation and creation of intelligent-based SLR and provide readers, researchers, and practitioners a roadmap to guide future direction

    Gesture Recognition Using Hidden Markov Models Augmented with Active Difference Signatures

    Get PDF
    With the recent invention of depth sensors, human gesture recognition has gained significant interest in the fields of computer vision and human computer interaction. Robust gesture recognition is a difficult problem because of the spatiotemporal variations in gesture formation, subject size, subject location, image fidelity, and subject occlusion. Gesture boundary detection, or the automatic detection of the onset and offset of a gesture in a sequence of gestures, is critical toward achieving robust gesture recognition. Existing gesture recognition methods perform the task of gesture segmentation either using resting frames in a gesture sequence or by using additional information such as audio, depth images, or RGB images. This ancillary information introduces high latency in gesture segmentation and recognition, thus making it inappropriate for real time applications. This thesis proposes a novel method to recognize time-varying human gestures from continuous video streams. The proposed method passes skeleton joint information into a Hidden Markov Model augmented with active difference signatures to achieve state-of-the-art gesture segmentation and recognition. Active body parts are used to calculate the likelihood of previously unseen data to facilitate gesture segmentation. Active difference signatures are used to describe temporal motion as well as static differences from a canonical resting position. Geometric features, such as joint angles, and joint topological distances are used along with active difference signatures as salient feature descriptors. These feature descriptors serve as unique signatures which identify hidden states in a Hidden Markov Model. The Hidden Markov Model is able to identify gestures in a robust fashion which is tolerant to spatiotemporal and human-to-human variation in gesture articulation. The proposed method is evaluated on both isolated and continuous datasets. An accuracy of 80.7% is achieved on the isolated MSR3D dataset and a mean Jaccard index of 0.58 is achieved on the continuous ChaLearn dataset. Results improve upon existing gesture recognition methods, which achieve a Jaccard index of 0.43 on the ChaLearn dataset. Comprehensive experiments investigate the feature selection, parameter optimization, and algorithmic methods to help understand the contributions of the proposed method
    • …
    corecore