2 research outputs found
Multimodal Local-Global Ranking Fusion for Emotion Recognition
Emotion recognition is a core research area at the intersection of artificial
intelligence and human communication analysis. It is a significant technical
challenge since humans display their emotions through complex idiosyncratic
combinations of the language, visual and acoustic modalities. In contrast to
traditional multimodal fusion techniques, we approach emotion recognition from
both direct person-independent and relative person-dependent perspectives. The
direct person-independent perspective follows the conventional emotion
recognition approach which directly infers absolute emotion labels from
observed multimodal features. The relative person-dependent perspective
approaches emotion recognition in a relative manner by comparing partial video
segments to determine if there was an increase or decrease in emotional
intensity. Our proposed model integrates these direct and relative prediction
perspectives by dividing the emotion recognition task into three easier
subtasks. The first subtask involves a multimodal local ranking of relative
emotion intensities between two short segments of a video. The second subtask
uses local rankings to infer global relative emotion ranks with a Bayesian
ranking algorithm. The third subtask incorporates both direct predictions from
observed multimodal behaviors and relative emotion ranks from local-global
rankings for final emotion prediction. Our approach displays excellent
performance on an audio-visual emotion recognition benchmark and improves over
other algorithms for multimodal fusion.Comment: ACM International Conference on Multimodal Interaction (ICMI 2018
Slices of Attention in Asynchronous Video Job Interviews
The impact of non verbal behaviour in a hiring decision remains an open
question. Investigating this question is important, as it could provide a
better understanding on how to train candidates for job interviews and make
recruiters be aware of influential non verbal behaviour. This research has
recently been accelerated due to the development of tools for the automatic
analysis of social signals, and the emergence of machine learning methods.
However, these studies are still mainly based on hand engineered features,
which imposes a limit to the discovery of influential social signals. On the
other side, deep learning methods are a promising tool to discover complex
patterns without the necessity of feature engineering. In this paper, we focus
on studying influential non verbal social signals in asynchronous job video
interviews that are discovered by deep learning methods. We use a previously
published deep learning system that aims at inferring the hirability of a
candidate with regard to a sequence of interview questions. One particularity
of this system is the use of attention mechanisms, which aim at identifying the
relevant parts of an answer. Thus, information at a fine-grained temporal level
could be extracted using global (at the interview level) annotations on
hirability. While most of the deep learning systems use attention mechanisms to
offer a quick visualization of slices when a rise of attention occurs, we
perform an in-depth analysis to understand what happens during these moments.
First, we propose a methodology to automatically extract slices where there is
a rise of attention (attention slices). Second, we study the content of
attention slices by comparing them with randomly sampled slices. Finally, we
show that they bear significantly more information for hirability than randomly
sampled slices.Comment: Accepted at 2019 8th International Conference on Affective Computing
and Intelligent Interaction (ACII