Automatic emotion recognition (ER) has recently gained lot of interest due to
its potential in many real-world applications. In this context, multimodal
approaches have been shown to improve performance (over unimodal approaches) by
combining diverse and complementary sources of information, providing some
robustness to noisy and missing modalities. In this paper, we focus on
dimensional ER based on the fusion of facial and vocal modalities extracted
from videos, where complementary audio-visual (A-V) relationships are explored
to predict an individual's emotional states in valence-arousal space. Most
state-of-the-art fusion techniques rely on recurrent networks or conventional
attention mechanisms that do not effectively leverage the complementary nature
of A-V modalities. To address this problem, we introduce a joint
cross-attentional model for A-V fusion that extracts the salient features
across A-V modalities, that allows to effectively leverage the inter-modal
relationships, while retaining the intra-modal relationships. In particular, it
computes the cross-attention weights based on correlation between the joint
feature representation and that of the individual modalities. By deploying the
joint A-V feature representation into the cross-attention module, it helps to
simultaneously leverage both the intra and inter modal relationships, thereby
significantly improving the performance of the system over the vanilla
cross-attention module. The effectiveness of our proposed approach is validated
experimentally on challenging videos from the RECOLA and AffWild2 datasets.
Results indicate that our joint cross-attentional A-V fusion model provides a
cost-effective solution that can outperform state-of-the-art approaches, even
when the modalities are noisy or absent.Comment: arXiv admin note: substantial text overlap with arXiv:2203.14779,
arXiv:2111.0522