8 research outputs found

    Exploiting Emotional Dependencies with Graph Convolutional Networks for Facial Expression Recognition

    Full text link
    Over the past few years, deep learning methods have shown remarkable results in many face-related tasks including automatic facial expression recognition (FER) in-the-wild. Meanwhile, numerous models describing the human emotional states have been proposed by the psychology community. However, we have no clear evidence as to which representation is more appropriate and the majority of FER systems use either the categorical or the dimensional model of affect. Inspired by recent work in multi-label classification, this paper proposes a novel multi-task learning (MTL) framework that exploits the dependencies between these two models using a Graph Convolutional Network (GCN) to recognize facial expressions in-the-wild. Specifically, a shared feature representation is learned for both discrete and continuous recognition in a MTL setting. Moreover, the facial expression classifiers and the valence-arousal regressors are learned through a GCN that explicitly captures the dependencies between them. To evaluate the performance of our method under real-world conditions we perform extensive experiments on the AffectNet and Aff-Wild2 datasets. The results of our experiments show that our method is capable of improving the performance across different datasets and backbone architectures. Finally, we also surpass the previous state-of-the-art methods on the categorical model of AffectNet.Comment: 9 pages, 8 figures, 5 tables, revised submission to the 16th IEEE International Conference on Automatic Face and Gesture Recognitio

    An audiovisual and contextual approach for categorical and continuous emotion recognition in-the-wild

    Full text link
    In this work we tackle the task of video-based audio-visual emotion recognition, within the premises of the 2nd Workshop and Competition on Affective Behavior Analysis in-the-wild (ABAW2). Poor illumination conditions, head/body orientation and low image resolution constitute factors that can potentially hinder performance in case of methodologies that solely rely on the extraction and analysis of facial features. In order to alleviate this problem, we leverage both bodily and contextual features, as part of a broader emotion recognition framework. We choose to use a standard CNN-RNN cascade as the backbone of our proposed model for sequence-to-sequence (seq2seq) learning. Apart from learning through the RGB input modality, we construct an aural stream which operates on sequences of extracted mel-spectrograms. Our extensive experiments on the challenging and newly assembled Aff-Wild2 dataset verify the validity of our intuitive multi-stream and multi-modal approach towards emotion recognition in-the-wild. Emphasis is being laid on the the beneficial influence of the human body and scene context, as aspects of the emotion recognition process that have been left relatively unexplored up to this point. All the code was implemented using PyTorch and is publicly available.Comment: 7 pages, 1 figure, 3 tables, accepted to the 2nd Workshop and Competition on Affective Behavior Analysis In-the-Wild (ABAW2

    Neural Sign Reenactor: Deep Photorealistic Sign Language Retargeting

    Full text link
    In this paper, we introduce a neural rendering pipeline for transferring the facial expressions, head pose, and body movements of one person in a source video to another in a target video. We apply our method to the challenging case of Sign Language videos: given a source video of a sign language user, we can faithfully transfer the performed manual (e.g., handshape, palm orientation, movement, location) and non-manual (e.g., eye gaze, facial expressions, mouth patterns, head, and body movements) signs to a target video in a photo-realistic manner. Our method can be used for Sign Language Anonymization, Sign Language Production (synthesis module), as well as for reenacting other types of full body activities (dancing, acting performance, exercising, etc.). We conduct detailed qualitative and quantitative evaluations and comparisons, which demonstrate the particularly promising and realistic results that we obtain and the advantages of our method over existing approaches.Comment: Accepted at AI4CC Workshop at CVPR 202

    Visual Speech-Aware Perceptual 3D Facial Expression Reconstruction from Videos

    Full text link
    The recent state of the art on monocular 3D face reconstruction from image data has made some impressive advancements, thanks to the advent of Deep Learning. However, it has mostly focused on input coming from a single RGB image, overlooking the following important factors: a) Nowadays, the vast majority of facial image data of interest do not originate from single images but rather from videos, which contain rich dynamic information. b) Furthermore, these videos typically capture individuals in some form of verbal communication (public talks, teleconferences, audiovisual human-computer interactions, interviews, monologues/dialogues in movies, etc). When existing 3D face reconstruction methods are applied in such videos, the artifacts in the reconstruction of the shape and motion of the mouth area are often severe, since they do not match well with the speech audio. To overcome the aforementioned limitations, we present the first method for visual speech-aware perceptual reconstruction of 3D mouth expressions. We do this by proposing a "lipread" loss, which guides the fitting process so that the elicited perception from the 3D reconstructed talking head resembles that of the original video footage. We demonstrate that, interestingly, the lipread loss is better suited for 3D reconstruction of mouth movements compared to traditional landmark losses, and even direct 3D supervision. Furthermore, the devised method does not rely on any text transcriptions or corresponding audio, rendering it ideal for training in unlabeled datasets. We verify the efficiency of our method through exhaustive objective evaluations on three large-scale datasets, as well as subjective evaluation with two web-based user studies

    E-Prevention: Advanced Support System for Monitoring and Relapse Prevention in Patients with Psychotic Disorders Analyzing Long-Term Multimodal Data from Wearables and Video Captures

    No full text
    Wearable technologies and digital phenotyping foster unique opportunities for designing novel intelligent electronic services that can address various well-being issues in patients with mental disorders (i.e., schizophrenia and bipolar disorder), thus having the potential to revolutionize psychiatry and its clinical practice. In this paper, we present e-Prevention, an innovative integrated system for medical support that facilitates effective monitoring and relapse prevention in patients with mental disorders. The technologies offered through e-Prevention include: (i) long-term continuous recording of biometric and behavioral indices through a smartwatch; (ii) video recordings of patients while being interviewed by a clinician, using a tablet; (iii) automatic and systematic storage of these data in a dedicated Cloud server and; (iv) the ability of relapse detection and prediction. This paper focuses on the description of the e-Prevention system and the methodologies developed for the identification of feature representations that correlate with and can predict psychopathology and relapses in patients with mental disorders. Specifically, we tackle the problem of relapse detection and prediction using Machine and Deep Learning techniques on all collected data. The results are promising, indicating that such predictions could be made and leading eventually to the prediction of psychopathology and the prevention of relapses
    corecore