98 research outputs found

    Out-of-plane action unit recognition using recurrent neural networks

    Get PDF
    A dissertation submitted to the Faculty of Science, University of the Witwatersrand, Johannesburg, in fulfilment of requirements for the degree of Master of Science. Johannesburg, 2015.The face is a fundamental tool to assist in interpersonal communication and interaction between people. Humans use facial expressions to consciously or subconsciously express their emotional states, such as anger or surprise. As humans, we are able to easily identify changes in facial expressions even in complicated scenarios, but the task of facial expression recognition and analysis is complex and challenging to a computer. The automatic analysis of facial expressions by computers has applications in several scientific subjects such as psychology, neurology, pain assessment, lie detection, intelligent environments, psychiatry, and emotion and paralinguistic communication. We look at methods of facial expression recognition, and in particular, the recognition of Facial Action Coding System’s (FACS) Action Units (AUs). Movements of individual muscles on the face are encoded by FACS from slightly different, instant changes in facial appearance. Contractions of specific facial muscles are related to a set of units called AUs. We make use of Speeded Up Robust Features (SURF) to extract keypoints from the face and use the SURF descriptors to create feature vectors. SURF provides smaller sized feature vectors than other commonly used feature extraction techniques. SURF is comparable to or outperforms other methods with respect to distinctiveness, robustness, and repeatability. It is also much faster than other feature detectors and descriptors. The SURF descriptor is scale and rotation invariant and is unaffected by small viewpoint changes or illumination changes. We use the SURF feature vectors to train a recurrent neural network (RNN) to recognize AUs from the Cohn-Kanade database. An RNN is able to handle temporal data received from image sequences in which an AU or combination of AUs are shown to develop from a neutral face. We are recognizing AUs as they provide a more fine-grained means of measurement that is independent of age, ethnicity, gender and different expression appearance. In addition to recognizing FACS AUs from the Cohn-Kanade database, we use our trained RNNs to recognize the development of pain in human subjects. We make use of the UNBC-McMaster pain database which contains image sequences of people experiencing pain. In some cases, the pain results in their face moving out-of-plane or some degree of in-plane movement. The temporal processing ability of RNNs can assist in classifying AUs where the face is occluded and not facing frontally for some part of the sequence. Results are promising when tested on the Cohn-Kanade database. We see higher overall recognition rates for upper face AUs than lower face AUs. Since keypoints are globally extracted from the face in our system, local feature extraction could provide improved recognition results in future work. We also see satisfactory recognition results when tested on samples with out-of-plane head movement, showing the temporal processing ability of RNNs

    HAND GESTURE RECOGNITION: A LITERATURE REVIEW

    Get PDF
    ABSTRAC

    Recognition of facial action units from video streams with recurrent neural networks : a new paradigm for facial expression recognition

    Get PDF
    Philosophiae Doctor - PhDThis research investigated the application of recurrent neural networks (RNNs) for recognition of facial expressions based on facial action coding system (FACS). Support vector machines (SVMs) were used to validate the results obtained by RNNs. In this approach, instead of recognizing whole facial expressions, the focus was on the recognition of action units (AUs) that are defined in FACS. Recurrent neural networks are capable of gaining knowledge from temporal data while SVMs, which are time invariant, are known to be very good classifiers. Thus, the research consists of four important components: comparison of the use of image sequences against single static images, benchmarking feature selection and network optimization approaches, study of inter-AU correlations by implementing multiple output RNNs, and study of difference images as an approach for performance improvement. In the comparative studies, image sequences were classified using a combination of Gabor filters and RNNs, while single static images were classified using Gabor filters and SVMs. Sets of 11 FACS AUs were classified by both approaches, where a single RNN/SVM classifier was used for classifying each AU. Results indicated that classifying FACS AUs using image sequences yielded better results than using static images. The average recognition rate (RR) and false alarm rate (FAR) using image sequences was 82.75% and 7.61%, respectively, while the classification using single static images yielded a RR and FAR of 79.47% and 9.22%, respectively. The better performance by the use of image sequences can be at- tributed to RNNs ability, as stated above, to extract knowledge from time-series data. Subsequent research then investigated benchmarking dimensionality reduction, feature selection and network optimization techniques, in order to improve the performance provided by the use of image sequences. Results showed that an optimized network, using weight decay, gave best RR and FAR of 85.38% and 6.24%, respectively. The next study was of the inter-AU correlations existing in the Cohn-Kanade database and their effect on classification models. To accomplish this, a model was developed for the classification of a set of AUs by a single multiple output RNN. Results indicated that high inter-AU correlations do in fact aid classification models to gain more knowledge and, thus, perform better. However, this was limited to AUs that start and reach apex at almost the same time. This suggests the need for availability of a larger database of AUs, which could provide both individual and AU combinations for further investigation. The final part of this research investigated use of difference images to track the motion of image pixels. Difference images provide both noise and feature reduction, an aspect that was studied. Results showed that the use of difference image sequences provided the best results, with RR and FAR of 87.95% and 3.45%, respectively, which is shown to be significant when compared to use of normal image sequences classified using RNNs. In conclusion, the research demonstrates that use of RNNs for classification of image sequences is a new and improved paradigm for facial expression recognition

    On recognition of gestures arising in flight deck officer (FDO) training

    Get PDF
    This thesis presents an on-line recognition machine RM for the continuous and isolated recognition of dynamic and static gestures that arise in Flight Deck Officer (FDO) training. This thesis considers 18 distinct and commonly used dynamic and static gestures of FDO. Tracker and computer vision based systems are used to acquire the gestures. The recognition machine is based on the generic pattern recognition framework. The gestures are represented as templates using summary statistics. The proposed recognition algorithm exploits temporal and spatial characteristics of the gestures via dynamic programming and Markovian process. The algorithm predicts the correspond-ing index of incremental input data in the templates in an on-line mode. Accumulated consistency in the sequence of prediction provides a similarity measurement (Score) between input data and the templates. Having estimated Score, some heuristics are employed to control the declaration in the final stages. The recognition machine addresses general gesture recognition issues: to recognize real time and dynamic gesture, no starting/end point and inter-intra personal tem-poral and spatial variance. The first two issues and temporal variance are addressed by the proposed algorithm. The spatial invariance is addressed by introducing inde-pendent units to construct gesture models. An important aspect of the algorithm is that it provides an intuitive mechanism for automatic detection of start/end frames of continuous gestures. The algorithm has the additional advantage of providing timely feedback for training purposes. In this thesis, we consider isolated and continuous gestures. The performance of RM is evaluated using six datasets - artificial (W_TTest), hand motion (Yang, Perrotta), Gesture Panel and FDO (tracker, vision). The Hidden Markov Model (HMM) and Dynamic Time Warping (DTW) are used to compare RM's results. Various data analyses techniques are deployed to reveal the complexity and inter similarity of the datasets before experiments are conducted. In the isolated recogni-tion experiments, the recognition machine obtains comparable results with HMM and outperforms DTW. In the continuous experiments, RM surpasses HMM in terms of sentence and word recognition. In addition to these experiments, a multilayer per-ceptron neural network (MLPNN) is introduced for the prediction process of RM to validate modularity of RM. The overall conclusion of the thesis is that, RM achieves comparable results which are in agreement with HMM and DTW. Furthermore, the recognition machine pro-vides more reliable and accurate recognition in the case of missing and noisy data. The recognition machine addresses some common limitations of these algorithms and general temporal pattern recognition in the context of FDO training. The recognition algorithm is thus suited for on-line recognition.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    Learning object behaviour models

    Get PDF
    The human visual system is capable of interpreting a remarkable variety of often subtle, learnt, characteristic behaviours. For instance we can determine the gender of a distant walking figure from their gait, interpret a facial expression as that of surprise, or identify suspicious behaviour in the movements of an individual within a car-park. Machine vision systems wishing to exploit such behavioural knowledge have been limited by the inaccuracies inherent in hand-crafted models and the absence of a unified framework for the perception of powerful behaviour models. The research described in this thesis attempts to address these limitations, using a statistical modelling approach to provide a framework in which detailed behavioural knowledge is acquired from the observation of long image sequences. The core of the behaviour modelling framework is an optimised sample-set representation of the probability density in a behaviour space defined by a novel temporal pattern formation strategy. This representation of behaviour is both concise and accurate and facilitates the recognition of actions or events and the assessment of behaviour typicality. The inclusion of generative capabilities is achieved via the addition of a learnt stochastic process model, thus facilitating the generation of predictions and realistic sample behaviours. Experimental results demonstrate the acquisition of behaviour models and suggest a variety of possible applications, including automated visual surveillance, object tracking, gesture recognition, and the generation of realistic object behaviours within animations, virtual worlds, and computer generated film sequences. The utility of the behaviour modelling framework is further extended through the modelling of object interaction. Two separate approaches are presented, and a technique is developed which, using learnt models of joint behaviour together with a stochastic tracking algorithm, can be used to equip a virtual object with the ability to interact in a natural way. Experimental results demonstrate the simulation of a plausible virtual partner during interaction between a user and the machine

    AI-generated Content for Various Data Modalities: A Survey

    Full text link
    AI-generated content (AIGC) methods aim to produce text, images, videos, 3D assets, and other media using AI algorithms. Due to its wide range of applications and the demonstrated potential of recent works, AIGC developments have been attracting lots of attention recently, and AIGC methods have been developed for various data modalities, such as image, video, text, 3D shape (as voxels, point clouds, meshes, and neural implicit fields), 3D scene, 3D human avatar (body and head), 3D motion, and audio -- each presenting different characteristics and challenges. Furthermore, there have also been many significant developments in cross-modality AIGC methods, where generative methods can receive conditioning input in one modality and produce outputs in another. Examples include going from various modalities to image, video, 3D shape, 3D scene, 3D avatar (body and head), 3D motion (skeleton and avatar), and audio modalities. In this paper, we provide a comprehensive review of AIGC methods across different data modalities, including both single-modality and cross-modality methods, highlighting the various challenges, representative works, and recent technical directions in each setting. We also survey the representative datasets throughout the modalities, and present comparative results for various modalities. Moreover, we also discuss the challenges and potential future research directions

    State of the Art in Face Recognition

    Get PDF
    Notwithstanding the tremendous effort to solve the face recognition problem, it is not possible yet to design a face recognition system with a potential close to human performance. New computer vision and pattern recognition approaches need to be investigated. Even new knowledge and perspectives from different fields like, psychology and neuroscience must be incorporated into the current field of face recognition to design a robust face recognition system. Indeed, many more efforts are required to end up with a human like face recognition system. This book tries to make an effort to reduce the gap between the previous face recognition research state and the future state

    Action recognition from RGB-D data

    Get PDF
    In recent years, action recognition based on RGB-D data has attracted increasing attention. Different from traditional 2D action recognition, RGB-D data contains extra depth and skeleton modalities. Different modalities have their own characteristics. This thesis presents seven novel methods to take advantages of the three modalities for action recognition. First, effective handcrafted features are designed and frequent pattern mining method is employed to mine the most discriminative, representative and nonredundant features for skeleton-based action recognition. Second, to take advantages of powerful Convolutional Neural Networks (ConvNets), it is proposed to represent spatio-temporal information carried in 3D skeleton sequences in three 2D images by encoding the joint trajectories and their dynamics into color distribution in the images, and ConvNets are adopted to learn the discriminative features for human action recognition. Third, for depth-based action recognition, three strategies of data augmentation are proposed to apply ConvNets to small training datasets. Forth, to take full advantage of the 3D structural information offered in the depth modality and its being insensitive to illumination variations, three simple, compact yet effective images-based representations are proposed and ConvNets are adopted for feature extraction and classification. However, both of previous two methods are sensitive to noise and could not differentiate well fine-grained actions. Fifth, it is proposed to represent a depth map sequence into three pairs of structured dynamic images at body, part and joint levels respectively through bidirectional rank pooling to deal with the issue. The structured dynamic image preserves the spatial-temporal information, enhances the structure information across both body parts/joints and different temporal scales, and takes advantages of ConvNets for action recognition. Sixth, it is proposed to extract and use scene flow for action recognition from RGB and depth data. Last, to exploit the joint information in multi-modal features arising from heterogeneous sources (RGB, depth), it is proposed to cooperatively train a single ConvNet (referred to as c-ConvNet) on both RGB features and depth features, and deeply aggregate the two modalities to achieve robust action recognition

    Deep learning-based EEG emotion recognition: Current trends and future perspectives

    Get PDF
    Automatic electroencephalogram (EEG) emotion recognition is a challenging component of human–computer interaction (HCI). Inspired by the powerful feature learning ability of recently-emerged deep learning techniques, various advanced deep learning models have been employed increasingly to learn high-level feature representations for EEG emotion recognition. This paper aims to provide an up-to-date and comprehensive survey of EEG emotion recognition, especially for various deep learning techniques in this area. We provide the preliminaries and basic knowledge in the literature. We review EEG emotion recognition benchmark data sets briefly. We review deep learning techniques in details, including deep belief networks, convolutional neural networks, and recurrent neural networks. We describe the state-of-the-art applications of deep learning techniques for EEG emotion recognition in detail. We analyze the challenges and opportunities in this field and point out its future directions

    Deep audio-visual speech recognition

    Get PDF
    Decades of research in acoustic speech recognition have led to systems that we use in our everyday life. However, even the most advanced speech recognition systems fail in the presence of noise. The degraded performance can be compensated by introducing visual speech information. However, Visual Speech Recognition (VSR) in naturalistic conditions is very challenging, in part due to the lack of architectures and annotations. This thesis contributes towards the problem of Audio-Visual Speech Recognition (AVSR) from different aspects. Firstly, we develop AVSR models for isolated words. In contrast to previous state-of-the-art methods that consists of a two-step approach, feature extraction and recognition, we present an End-to-End (E2E) approach inside a deep neural network, and this has led to a significant improvement in audio-only, visual-only and audio-visual experiments. We further replace Bi-directional Gated Recurrent Unit (BGRU) with Temporal Convolutional Networks (TCN) to greatly simplify the training procedure. Secondly, we extend our AVSR model for continuous speech by presenting a hybrid Connectionist Temporal Classification (CTC)/Attention model, that can be trained in an end-to-end manner. We then propose the addition of prediction-based auxiliary tasks to a VSR model and highlight the importance of hyper-parameter optimisation and appropriate data augmentations. Next, we present a self-supervised framework, Learning visual speech Representations from Audio via self-supervision (LiRA). Specifically, we train a ResNet+Conformer model to predict acoustic features from unlabelled visual speech, and find that this pre-trained model can be leveraged towards word-level and sentence-level lip-reading. We also investigate the Lombard effect influence in an end-to-end AVSR system, which is the first work using end-to-end deep architectures and presents results on unseen speakers. We show that even if a relatively small amount of Lombard speech is added to the training set then the performance in a real scenario, where noisy Lombard speech is present, can be significantly improved. Lastly, we propose a detection method against adversarial examples in an AVSR system, where the strong correlation between audio and visual streams is leveraged. The synchronisation confidence score is leveraged as a proxy for audio-visual correlation and based on it, we can detect adversarial attacks. We apply recent adversarial attacks on two AVSR models and the experimental results demonstrate that the proposed approach is an effective way for detecting such attacks.Open Acces
    corecore