1,079 research outputs found

    Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis in Videos

    Get PDF
    When designing a video affective content analysis algorithm, one of the most important steps is the selection of discriminative features for the effective representation of video segments. The majority of existing affective content analysis methods either use low-level audio-visual features or generate handcrafted higher level representations based on these low-level features. We propose in this work to use deep learning methods, in particular convolutional neural networks (CNNs), in order to automatically learn and extract mid-level representations from raw data. To this end, we exploit the audio and visual modality of videos by employing Mel-Frequency Cepstral Coefficients (MFCC) and color values in the HSV color space. We also incorporate dense trajectory based motion features in order to further enhance the performance of the analysis. By means of multi-class support vector machines (SVMs) and fusion mechanisms, music video clips are classified into one of four affective categories representing the four quadrants of the Valence-Arousal (VA) space. Results obtained on a subset of the DEAP dataset show (1) that higher level representations perform better than low-level features, and (2) that incorporating motion information leads to a notable performance gain, independently from the chosen representation

    Relaxed Spatio-Temporal Deep Feature Aggregation for Real-Fake Expression Prediction

    Get PDF
    Frame-level visual features are generally aggregated in time with the techniques such as LSTM, Fisher Vectors, NetVLAD etc. to produce a robust video-level representation. We here introduce a learnable aggregation technique whose primary objective is to retain short-time temporal structure between frame-level features and their spatial interdependencies in the representation. Also, it can be easily adapted to the cases where there have very scarce training samples. We evaluate the method on a real-fake expression prediction dataset to demonstrate its superiority. Our method obtains 65% score on the test dataset in the official MAP evaluation and there is only one misclassified decision with the best reported result in the Chalearn Challenge (i.e. 66:7%) . Lastly, we believe that this method can be extended to different problems such as action/event recognition in future.Comment: Submitted to International Conference on Computer Vision Workshop

    Automatic Recognition of Facial Displays of Unfelt Emotions

    Get PDF
    Humans modify their facial expressions in order to communicate their internal states and sometimes to mislead observers regarding their true emotional states. Evidence in experimental psychology shows that discriminative facial responses are short and subtle. This suggests that such behavior would be easier to distinguish when captured in high resolution at an increased frame rate. We are proposing SASE-FE, the first dataset of facial expressions that are either congruent or incongruent with underlying emotion states. We show that overall the problem of recognizing whether facial movements are expressions of authentic emotions or not can be successfully addressed by learning spatio-temporal representations of the data. For this purpose, we propose a method that aggregates features along fiducial trajectories in a deeply learnt space. Performance of the proposed model shows that on average it is easier to distinguish among genuine facial expressions of emotion than among unfelt facial expressions of emotion and that certain emotion pairs such as contempt and disgust are more difficult to distinguish than the rest. Furthermore, the proposed methodology improves state of the art results on CK+ and OULU-CASIA datasets for video emotion recognition, and achieves competitive results when classifying facial action units on BP4D datas

    Automatic Recognition of Facial Displays of Unfelt Emotions

    Get PDF
    Humans modify their facial expressions in order to communicate their internal states and sometimes to mislead observers regarding their true emotional states. Evidence in experimental psychology shows that discriminative facial responses are short and subtle. This suggests that such behavior would be easier to distinguish when captured in high resolution at an increased frame rate. We are proposing SASE-FE, the first dataset of facial expressions that are either congruent or incongruent with underlying emotion states. We show that overall the problem of recognizing whether facial movements are expressions of authentic emotions or not can be successfully addressed by learning spatio-temporal representations of the data. For this purpose, we propose a method that aggregates features along fiducial trajectories in a deeply learnt space. Performance of the proposed model shows that on average, it is easier to distinguish among genuine facial expressions of emotion than among unfelt facial expressions of emotion and that certain emotion pairs such as contempt and disgust are more difficult to distinguish than the rest. Furthermore, the proposed methodology improves state of the art results on CK+ and OULU-CASIA datasets for video emotion recognition, and achieves competitive results when classifying facial action units on BP4D datase

    Collaborative Learning in Computer Vision

    Get PDF
    The science of designing machines to extract meaningful information from digital images, videos, and other visual inputs is known as Computer Vision (CV). Deep learning algorithms cope CV problems by automatically learning task-specific features. Especially, Deep Neural Networks (DNNs) have become an essential component in CV solutions due to their ability to encode large amounts of data and capacity to manipulate billions of model parameters. Unlike machines, humans learn by rapidly constructing abstract models. This is undoubtedly due to the fact that good teachers supply their students with much more than just the correct answer; they also provide intuitive comments, comparisons, and explanations. In deep learning, the availability of such auxiliary information at training time (but not at test time) is referred to as learning by Privileged Information (PI). Typically, predictions (e.g., soft labels) produced by a bigger and better network teacher are used as structured knowledge to supervise the training of a smaller network student, helping the student network to generalize better than that trained from scratch. This dissertation focuses on the category of deep learning systems known as Collaborative Learning, where one DNN model helps other models or several models help each other during training to achieve strong generalization and thus high performance. The question we address here is thus the following: how can we take advantage of PI for training a deep learning model, knowing that, at test time, such PI might be missing? In this context, we introduce new methods to tackle several challenging real-world computer vision problems. First, we propose a method for model compression that leverages PI in a teacher-student framework along with customizable block-wise optimization for learning a target-specific lightweight structure of the neural network. In particular, the proposed resource-aware optimization is employed on suitable parts of the student network while respecting the expected resource budget (e.g., floating-point operations per inference and model parameters). In addition, soft predictions produced by the teacher network are leveraged as a source of PI, forcing the student to preserve baseline performance during network structure optimization. Second, we propose a multiple-model learning method for action recognition, specifically devised for challenging video footages in which actions are not explicitly visualized, but rather, only implicitly referred. We use such videos as stimuli and involve a large sample of subjects to collect a high-definition EEG and video dataset. Next, we employ collaborative learning in a multi-modal setting i.e., the EEG (teacher) model helps the video (student) model by distilling the knowledge (implicit meaning of visual stimuli) to it, sharply boosting the recognition performance. The goal of Unsupervised Domain Adaptation (UDA) methods is to use the labeled source together with the unlabeled target domain data to train a model that generalizes well on the target domain. In contrast, we cast UDA as a pseudo-label refinery problem in the challenging source-free scenario i.e., in cases where the source domain data is inaccessible during training. We propose Negative Ensemble Learning (NEL) technique, a unified method for adaptive noise filtering and progressive pseudo-label refinement. In particular, the ensemble members collaboratively learn with a Disjoint Set of Residual Labels, an outcome of the output prediction consensus, to refine the challenging noise associated with the inferred pseudo-labels. A single model trained with the refined pseudo-labels leads to superior performance on the target domain, without using source data samples at all. We conclude this dissertation with a method extending our previous study by incorporating Continual Learning in the Source-Free UDA. Our new method comprises of two stages: a Source-Free UDA pipeline based on pseudo-label refinement, and a procedure for extracting class-conditioned source-style images by leveraging the pre-trained source model. While stage 1 holds the same collaborative peculiarities, in stage 2, the collaboration exists in an indirect manner i.e., it is the source model that provides the only possibility to generate source-style synthetic images which eventually helps the final model in preserving good performance on both source and target domains. In each study, we consider heterogeneous CV tasks. Nevertheless, with an extensive pool of experiments on various benchmarks carrying diverse complexities and challenges, we show that the collaborative learning framework outperforms the related state-of-the-art methods by a considerable margin

    Non-contact Multimodal Indoor Human Monitoring Systems: A Survey

    Full text link
    Indoor human monitoring systems leverage a wide range of sensors, including cameras, radio devices, and inertial measurement units, to collect extensive data from users and the environment. These sensors contribute diverse data modalities, such as video feeds from cameras, received signal strength indicators and channel state information from WiFi devices, and three-axis acceleration data from inertial measurement units. In this context, we present a comprehensive survey of multimodal approaches for indoor human monitoring systems, with a specific focus on their relevance in elderly care. Our survey primarily highlights non-contact technologies, particularly cameras and radio devices, as key components in the development of indoor human monitoring systems. Throughout this article, we explore well-established techniques for extracting features from multimodal data sources. Our exploration extends to methodologies for fusing these features and harnessing multiple modalities to improve the accuracy and robustness of machine learning models. Furthermore, we conduct comparative analysis across different data modalities in diverse human monitoring tasks and undertake a comprehensive examination of existing multimodal datasets. This extensive survey not only highlights the significance of indoor human monitoring systems but also affirms their versatile applications. In particular, we emphasize their critical role in enhancing the quality of elderly care, offering valuable insights into the development of non-contact monitoring solutions applicable to the needs of aging populations.Comment: 19 pages, 5 figure

    Survey on Emotional Body Gesture Recognition

    Get PDF
    Automatic emotion recognition has become a trending research topic in the past decade. While works based on facial expressions or speech abound, recognizing affect from body gestures remains a less explored topic. We present a new comprehensive survey hoping to boost research in the field. We first introduce emotional body gestures as a component of what is commonly known as "body language" and comment general aspects as gender differences and culture dependence. We then define a complete framework for automatic emotional body gesture recognition. We introduce person detection and comment static and dynamic body pose estimation methods both in RGB and 3D. We then comment the recent literature related to representation learning and emotion recognition from images of emotionally expressive gestures. We also discuss multi-modal approaches that combine speech or face with body gestures for improved emotion recognition. While pre-processing methodologies (e.g., human detection and pose estimation) are nowadays mature technologies fully developed for robust large scale analysis, we show that for emotion recognition the quantity of labelled data is scarce. There is no agreement on clearly defined output spaces and the representations are shallow and largely based on naive geometrical representations