177 research outputs found

    Acume: A New Visualization Tool for Understanding Facial Expression and Gesture Data

    Get PDF
    Facial and head actions contain significant affective information. To date, these actions have mostly been studied in isolation because the space of naturalistic combinations is vast. Interactive visualization tools could enable new explorations of dynamically changing combinations of actions as people interact with natural stimuli. This paper describes a new open-source tool that enables navigation of and interaction with dynamic face and gesture data across large groups of people, making it easy to see when multiple facial actions co-occur, and how these patterns compare and cluster across groups of participants. We share two case studies that demonstrate how the tool allows researchers to quickly view an entire corpus of data for single or multiple participants, stimuli and actions. Acume yielded patterns of actions across participants and across stimuli, and helped give insight into how our automated facial analysis methods could be better designed. The results of these case studies are used to demonstrate the efficacy of the tool. The open-source code is designed to directly address the needs of the face and gesture research community, while also being extensible and flexible for accommodating other kinds of behavioral data. Source code, application and documentation are available at http://affect.media.mit.edu/acume.Procter & Gamble Compan

    Audio-Visual Fusion for Emotion Recognition in the Valence-Arousal Space Using Joint Cross-Attention

    Full text link
    Automatic emotion recognition (ER) has recently gained lot of interest due to its potential in many real-world applications. In this context, multimodal approaches have been shown to improve performance (over unimodal approaches) by combining diverse and complementary sources of information, providing some robustness to noisy and missing modalities. In this paper, we focus on dimensional ER based on the fusion of facial and vocal modalities extracted from videos, where complementary audio-visual (A-V) relationships are explored to predict an individual's emotional states in valence-arousal space. Most state-of-the-art fusion techniques rely on recurrent networks or conventional attention mechanisms that do not effectively leverage the complementary nature of A-V modalities. To address this problem, we introduce a joint cross-attentional model for A-V fusion that extracts the salient features across A-V modalities, that allows to effectively leverage the inter-modal relationships, while retaining the intra-modal relationships. In particular, it computes the cross-attention weights based on correlation between the joint feature representation and that of the individual modalities. By deploying the joint A-V feature representation into the cross-attention module, it helps to simultaneously leverage both the intra and inter modal relationships, thereby significantly improving the performance of the system over the vanilla cross-attention module. The effectiveness of our proposed approach is validated experimentally on challenging videos from the RECOLA and AffWild2 datasets. Results indicate that our joint cross-attentional A-V fusion model provides a cost-effective solution that can outperform state-of-the-art approaches, even when the modalities are noisy or absent.Comment: arXiv admin note: substantial text overlap with arXiv:2203.14779, arXiv:2111.0522

    Assessing the Effectiveness of Automated Emotion Recognition in Adults and Children for Clinical Investigation

    Get PDF
    Recent success stories in automated object or face recognition, partly fuelled by deep learning artiļ¬cial neural network (ANN) architectures, has led to the advancement of biometric research platforms and, to some extent, the resurrection of Artiļ¬cial Intelligence (AI). In line with this general trend, inter-disciplinary approaches have taken place to automate the recognition of emotions in adults or children for the beneļ¬t of various applications such as identiļ¬cation of children emotions prior to a clinical investigation. Within this context, it turns out that automating emotion recognition is far from being straight forward with several challenges arising for both science(e.g., methodology underpinned by psychology) and technology (e.g., iMotions biometric research platform). In this paper, we present a methodology, experiment and interesting ļ¬ndings, which raise the following research questions for the recognition of emotions and attention in humans: a) adequacy of well-established techniques such as the International Affective Picture System (IAPS), b) adequacy of state-of-the-art biometric research platforms, c) the extent to which emotional responses may be different among children or adults. Our ļ¬ndings and ļ¬rst attempts to answer some of these research questions, are all based on a mixed sample of adults and children, who took part in the experiment resulting into a statistical analysis of numerous variables. These are related with, both automatically and interactively, captured responses of participants to a sample of IAPS pictures

    Measuring Voter's Candidate Preference Based on Affective Responses to Election Debates

    Get PDF
    In this paper we present the first analysis of facial responses to electoral debates measured automatically over the Internet. We show that significantly different responses can be detected from viewers with different political preferences and that similar expressions at significant moments can have very different meanings depending on the actions that appear subsequently. We used an Internet based framework to collect 611 naturalistic and spontaneous facial responses to five video clips from the 3rd presidential debate during the 2012 American presidential election campaign. Using this framework we were able to collect over 60% of these video responses (374 videos) within one day of the live debate and over 80% within three days. No participants were compensated for taking the survey. We present and evaluate a method for predicting independent voter preference based on automatically measured facial responses and self-reported preferences from the viewers. We predict voter preference with an average accuracy of over 73% (AUC 0.779)

    Predicting Online Media Effectiveness Based on Smile Responses Gathered Over the Internet

    Get PDF
    We present an automated method for classifying ā€œlikingā€ and ā€œdesire to view againā€ based on over 1,500 facial responses to media collected over the Internet. This is a very challenging pattern recognition problem that involves robust detection of smile intensities in uncontrolled settings and classification of naturalistic and spontaneous temporal data with large individual differences. We examine the manifold of responses and analyze the false positives and false negatives that result from classification. The results demonstrate the possibility for an ecologically valid, unobtrusive, evaluation of commercial ā€œlikingā€ and ā€œdesire to view againā€, strong predictors of marketing success, based only on facial responses. The area under the curve for the best ā€œlikingā€ and ā€œdesire to view againā€ classifiers was 0.8 and 0.78 respectively when using a challenging leave-one-commercial-out testing regime. The technique could be employed in personalizing video ads that are presented to people whilst they view programming over the Internet or in copy testing of ads to unobtrusively quantify effectiveness.MIT Media Lab Consortiu

    Predicting Ad Liking and Purchase Intent: Large-Scale Analysis of Facial Responses to Ads

    Get PDF
    Billions of online video ads are viewed every month. We present a large-scale analysis of facial responses to video content measured over the Internet and their relationship to marketing effectiveness. We collected over 12,000 facial responses from 1,223 people to 170 ads from a range of markets and product categories. The facial responses were automatically coded frame-by-frame. Collection and coding of these 3.7 million frames would not have been feasible with traditional research methods. We show that detected expressions are sparse but that aggregate responses reveal rich emotion trajectories. By modeling the relationship between the facial responses and ad effectiveness, we show that ad liking can be predicted accurately (ROC AUC = 0.85) from webcam facial responses. Furthermore, the prediction of a change in purchase intent is possible (ROC AUC = 0.78). Ad liking is shown by eliciting expressions, particularly positive expressions. Driving purchase intent is more complex than just making viewers smile: peak positive responses that are immediately preceded by a brand appearance are more likely to be effective. The results presented here demonstrate a reliable and generalizable system for predicting ad effectiveness automatically from facial responses without a need to elicit self-report responses from the viewers. In addition we can gain insight into the structure of effective ads.MIT Media Lab ConsortiumNEC CorporationMAR

    FusionSense: Emotion Classification using Feature Fusion of Multimodal Data and Deep learning in a Brain-inspired Spiking Neural Network

    Get PDF
    Using multimodal signals to solve the problem of emotion recognition is one of the emerging trends in affective computing. Several studies have utilized state of the art deep learning methods and combined physiological signals, such as the electrocardiogram (EEG), electroencephalogram (ECG), skin temperature, along with facial expressions, voice, posture to name a few, in order to classify emotions. Spiking neural networks (SNNs) represent the third generation of neural networks and employ biologically plausible models of neurons. SNNs have been shown to handle Spatio-temporal data, which is essentially the nature of the data encountered in emotion recognition problem, in an efficient manner. In this work, for the first time, we propose the application of SNNs in order to solve the emotion recognition problem with the multimodal dataset. Specifically, we use the NeuCube framework, which employs an evolving SNN architecture to classify emotional valence and evaluate the performance of our approach on the MAHNOB-HCI dataset. The multimodal data used in our work consists of facial expressions along with physiological signals such as ECG, skin temperature, skin conductance, respiration signal, mouth length, and pupil size. We perform classification under the Leave-One-Subject-Out (LOSO) cross-validation mode. Our results show that the proposed approach achieves an accuracy of 73.15% for classifying binary valence when applying feature-level fusion, which is comparable to other deep learning methods. We achieve this accuracy even without using EEG, which other deep learning methods have relied on to achieve this level of accuracy. In conclusion, we have demonstrated that the SNN can be successfully used for solving the emotion recognition problem with multimodal data and also provide directions for future research utilizing SNN for Affective computing. In addition to the good accuracy, the SNN recognition system is requires incrementally trainable on new data in an adaptive way. It only one pass training, which makes it suitable for practical and on-line applications. These features are not manifested in other methods for this problem.Peer reviewe

    Affectiva-MIT Facial Expression Dataset (AM-FED): Naturalistic and Spontaneous Facial Expressions Collected In-the-Wild

    Get PDF
    Computer classification of facial expressions requires large amounts of data and this data needs to reflect the diversity of conditions seen in real applications. Public datasets help accelerate the progress of research by providing researchers with a benchmark resource. We present a comprehensively labeled dataset of ecologically valid spontaneous facial responses recorded in natural settings over the Internet. To collect the data, online viewers watched one of three intentionally amusing Super Bowl commercials and were simultaneously filmed using their webcam. They answered three self-report questions about their experience. A subset of viewers additionally gave consent for their data to be shared publicly with other researchers. This subset consists of 242 facial videos (168,359 frames) recorded in real world conditions. The dataset is comprehensively labeled for the following: 1) frame-by-frame labels for the presence of 10 symmetrical FACS action units, 4 asymmetric (unilateral) FACS action units, 2 head movements, smile, general expressiveness, feature tracker fails and gender; 2) the location of 22 automatically detected landmark points; 3) self-report responses of familiarity with, liking of, and desire to watch again for the stimuli videos and 4) baseline performance of detection algorithms on this dataset. This data is available for distribution to researchers online, the EULA can be found at: http://www.affectiva.com/facial-expression-dataset-am-fed/
    • ā€¦
    corecore