    Video-based infant discomfort detection

    Masked Student Dataset of Expressions

    Facial expression recognition (FER) algorithms work well in constrained environments with little or no occlusion of the face. However, real-world face occlusion is prevalent, most notably with the need to use a face mask in the current Covid-19 scenario. While there are works on the problem of occlusion in FER, little has been done before on the particular face mask scenario. Moreover, the few works in this area largely use synthetically created masked FER datasets. Motivated by these challenges posed by the pandemic to FER, we present a novel dataset, the Masked Student Dataset of Expressions or MSD-E, consisting of 1,960 real-world non-masked and masked facial expression images collected from 142 individuals. Along with the issue of obfuscated facial features, we illustrate how other subtler issues in masked FER are represented in our dataset. We then provide baseline results using ResNet-18, finding that its performance dips in the non-masked case when trained for FER in the presence of masks. To tackle this, we test two training paradigms: contrastive learning and knowledge distillation, and find that they increase the model's performance in the masked scenario while maintaining its non-masked performance. We further visualise our results using t-SNE plots and Grad-CAM, demonstrating that these paradigms capitalise on the limited features available in the masked scenario. Finally, we benchmark SOTA methods on MSD-E.Comment: Thirteenth Indian Conference on Computer Vision, Graphics and Image Processing, ACM, 2022, Gandhinagar, Indi

    Automatic inference of latent emotion from spontaneous facial micro-expressions

    Emotional states exert a profound influence on individuals' overall well-being, impacting them both physically and psychologically. Accurate recognition and comprehension of human emotions represent a crucial area of scientific exploration. Facial expressions, vocal cues, body language, and physiological responses provide valuable insights into an individual's emotional state, with facial expressions being universally recognised as dependable indicators of emotions. This thesis centres around three vital research aspects concerning the automated inference of latent emotions from spontaneous facial micro-expressions, seeking to enhance and refine our understanding of this complex domain. Firstly, the research aims to detect and analyse activated Action Units (AUs) during the occurrence of micro-expressions. AUs correspond to facial muscle movements. Although previous studies have established links between AUs and conventional facial expressions, no such connections have been explored for micro-expressions. Therefore, this thesis develops computer vision techniques to automatically detect activated AUs in micro-expressions, bridging a gap in existing studies. Secondly, the study explores the evolution of micro-expression recognition techniques, ranging from early handcrafted feature-based approaches to modern deep-learning methods. These approaches have significantly contributed to the field of automatic emotion recognition. However, existing methods primarily focus on capturing local spatial relationships, neglecting global relationships between different facial regions. To address this limitation, a novel third-generation architecture is proposed. This architecture can concurrently capture both short and long-range spatiotemporal relationships in micro-expression data, aiming to enhance the accuracy of automatic emotion recognition and improve our understanding of micro-expressions. Lastly, the thesis investigates the integration of multimodal signals to enhance emotion recognition accuracy. Depth information complements conventional RGB data by providing enhanced spatial features for analysis, while the integration of physiological signals with facial micro-expressions improves emotion discrimination. By incorporating multimodal data, the objective is to enhance machines' understanding of latent emotions and improve latent emotion recognition accuracy in spontaneous micro-expression analysis


    Over the past five years, methods based on deep features have taken over the computer vision field. While dramatic performance improvements have been achieved for tasks such as face detection and verification, these methods usually need large amounts of annotated data. In practice, not all computer vision tasks have access to large amounts of annotated data. Facial expression analysis is such a task. In this dissertation, we focus on facial expression recognition and editing problems with small datasets. In addition, to cope with challenging conditions like pose and occlusion, we also study unaligned facial attribute detection and occluded expression recognition problems. This dissertation has been divided into four parts. In the first part, we present FaceNet2ExpNet, a novel idea to train a light-weight and high accuracy classification model for expression recognition with small datasets. We first propose a new distribution function to model the high-level neurons of the expression network. Based on this, a two-stage training algorithm is carefully designed. In the pre-training stage, we train the convolutional layers of the expression net, regularized by the face net; In the refining stage, we append fully-connected layers to the pre-trained convolutional layers and train the whole network jointly. Visualization shows that the model trained with our method captures improved high-level expression semantics. Evaluations on four public expression databases demonstrate that our method achieves better results than state-of-the-art. In the second part, we focus on robust facial expression recognition under occlusion and propose a landmark-guided attention branch to find and discard corrupted feature elements from recognition. An attention map is first generated to indicate if a specific facial part is occluded and guide our model to attend to the non-occluded regions. To further increase robustness, we propose a facial region branch to partition the feature maps into non-overlapping facial blocks and enforce each block to predict the expression independently. Depending on the synergistic effect of the two branches, our occlusion adaptive deep network significantly outperforms state-of-the-art methods on two challenging in-the-wild benchmark datasets and three real-world occluded expression datasets. In the third part, we propose a cascade network that simultaneously learns to localize face regions specific to attributes and performs attribute classification without alignment. First, a weakly-supervised face region localization network is designed to automatically detect regions (or parts) specific to attributes. Then multiple part-based networks and a whole-image-based network are separately constructed and combined together by the region switch layer and attribute relation layer for final attribute classification. A multi-net learning method and hint-based model compression are further proposed to get an effective localization model and a compact classification model, respectively. Our approach achieves significantly better performance than state-of-the-art methods on unaligned CelebA dataset, reducing the classification error by 30.9% In the final part of this dissertation, we propose an Expression Generative Adversarial Network (ExprGAN) for photo-realistic facial expression editing with controllable expression intensity. An expression controller module is specially designed to learn an expressive and compact expression code in addition to the encoder-decoder network. This novel architecture enables the expression intensity to be continuously adjusted from low to high. We further show that our ExprGAN can be applied for other tasks, such as expression transfer, image retrieval, and data augmentation for training improved face expression recognition models. To tackle the small size of the training database, an effective incremental learning scheme is proposed. Quantitative and qualitative evaluations on the widely used Oulu-CASIA dataset demonstrate the effectiveness of ExprGAN

    Affect Analysis and Membership Recognition in Group Settings

    PhD ThesisEmotions play an important role in our day-to-day life in various ways, including, but not limited to, how we humans communicate and behave. Machines can interact with humans more naturally and intelligently if they are able to recognise and understand humans’ emotions and express their own emotions. To achieve this goal, in the past two decades, researchers have been paying a lot of attention to the analysis of affective states, which has been studied extensively across various fields, such as neuroscience, psychology, cognitive science, and computer science. Most of the existing works focus on affect analysis in individual settings, where there is one person in an image or in a video. However, in the real world, people are very often with others, or interact in group settings. In this thesis, we will focus on affect analysis in group settings. Affect analysis in group settings is different from that in individual settings and provides more challenges due to dynamic interactions between the group members, various occlusions among people in the scene, and the complex context, e.g., who people are with, where people are staying and the mutual influences among people in the group. Because of these challenges, there are still a number of open issues that need further investigation in order to advance the state of the art, and explore the methodologies for affect analysis in group settings. These open topics include but are not limited to (1) is it possible to transfer the methods used for the affect recognition of a person in individual settings to the affect recognition of each individual in group settings? (2) is it possible to recognise the affect of one individual using the expressed behaviours of another member in the same group (i.e., cross-subject affect recognition)? (3) can non-verbal behaviours be used for the recognition of contextual information in group settings? In this thesis, we investigate the affect analysis in group settings and propose methods to explore the aforementioned research questions step by step. Firstly, we propose a method for individual affect recognition in both individual and group videos, which is also used for social context prediction, i.e., whether a person is alone or within a group. Secondly, we introduce a novel framework for cross-subject affect analysis in group videos. Specifically, we analyse the correlation of the affect among group members and investigate the automatic recognition of the affect of one subject using the behaviours expressed by another subject in the same group or in a different group. Furthermore, we propose methods for contextual information prediction in group settings, i.e., group membership recognition - to recognise which group of the person belongs. Comprehensive experiments are conducted using two datasets that one contains individual videos and one contains group videos. The experimental results show that (1) the methods used for affect recognition of a person in individual settings can be transferred to group settings; (2) the affect of one subject in a group can be better predicted using the expressive behaviours of another subject within the same group than using that of a subject from a different group; and (3) contextual information (i.e., whether a person is staying alone or within a group, and group membership) can be predicted successfully using non-verbal behaviours

    Machine learning approaches to video activity recognition: from computer vision to signal processing

    244 p.La investigación presentada se centra en técnicas de clasificación para dos tareas diferentes, aunque relacionadas, de tal forma que la segunda puede ser considerada parte de la primera: el reconocimiento de acciones humanas en vídeos y el reconocimiento de lengua de signos.En la primera parte, la hipótesis de partida es que la transformación de las señales de un vídeo mediante el algoritmo de Patrones Espaciales Comunes (CSP por sus siglas en inglés, comúnmente utilizado en sistemas de Electroencefalografía) puede dar lugar a nuevas características que serán útiles para la posterior clasificación de los vídeos mediante clasificadores supervisados. Se han realizado diferentes experimentos en varias bases de datos, incluyendo una creada durante esta investigación desde el punto de vista de un robot humanoide, con la intención de implementar el sistema de reconocimiento desarrollado para mejorar la interacción humano-robot.En la segunda parte, las técnicas desarrolladas anteriormente se han aplicado al reconocimiento de lengua de signos, pero además de ello se propone un método basado en la descomposición de los signos para realizar el reconocimiento de los mismos, añadiendo la posibilidad de una mejor explicabilidad. El objetivo final es desarrollar un tutor de lengua de signos capaz de guiar a los usuarios en el proceso de aprendizaje, dándoles a conocer los errores que cometen y el motivo de dichos errores
