13,314 research outputs found
Dilated Context Integrated Network with Cross-Modal Consensus for Temporal Emotion Localization in Videos
Understanding human emotions is a crucial ability for intelligent robots to
provide better human-robot interactions. The existing works are limited to
trimmed video-level emotion classification, failing to locate the temporal
window corresponding to the emotion. In this paper, we introduce a new task,
named Temporal Emotion Localization in videos~(TEL), which aims to detect human
emotions and localize their corresponding temporal boundaries in untrimmed
videos with aligned subtitles. TEL presents three unique challenges compared to
temporal action localization: 1) The emotions have extremely varied temporal
dynamics; 2) The emotion cues are embedded in both appearances and complex
plots; 3) The fine-grained temporal annotations are complicated and
labor-intensive. To address the first two challenges, we propose a novel
dilated context integrated network with a coarse-fine two-stream architecture.
The coarse stream captures varied temporal dynamics by modeling
multi-granularity temporal contexts. The fine stream achieves complex plots
understanding by reasoning the dependency between the multi-granularity
temporal contexts from the coarse stream and adaptively integrates them into
fine-grained video segment features. To address the third challenge, we
introduce a cross-modal consensus learning paradigm, which leverages the
inherent semantic consensus between the aligned video and subtitle to achieve
weakly-supervised learning. We contribute a new testing set with 3,000
manually-annotated temporal boundaries so that future research on the TEL
problem can be quantitatively evaluated. Extensive experiments show the
effectiveness of our approach on temporal emotion localization. The repository
of this work is at
https://github.com/YYJMJC/Temporal-Emotion-Localization-in-Videos.Comment: Accepted by ACM Multimedia 202
Graph-based Facial Affect Analysis: A Review of Methods, Applications and Challenges
Facial affect analysis (FAA) using visual signals is important in
human-computer interaction. Early methods focus on extracting appearance and
geometry features associated with human affects, while ignoring the latent
semantic information among individual facial changes, leading to limited
performance and generalization. Recent work attempts to establish a graph-based
representation to model these semantic relationships and develop frameworks to
leverage them for various FAA tasks. In this paper, we provide a comprehensive
review of graph-based FAA, including the evolution of algorithms and their
applications. First, the FAA background knowledge is introduced, especially on
the role of the graph. We then discuss approaches that are widely used for
graph-based affective representation in literature and show a trend towards
graph construction. For the relational reasoning in graph-based FAA, existing
studies are categorized according to their usage of traditional methods or deep
models, with a special emphasis on the latest graph neural networks.
Performance comparisons of the state-of-the-art graph-based FAA methods are
also summarized. Finally, we discuss the challenges and potential directions.
As far as we know, this is the first survey of graph-based FAA methods. Our
findings can serve as a reference for future research in this field.Comment: 20 pages, 12 figures, 5 table
Disentangled Variational Autoencoder for Emotion Recognition in Conversations
In Emotion Recognition in Conversations (ERC), the emotions of target
utterances are closely dependent on their context. Therefore, existing works
train the model to generate the response of the target utterance, which aims to
recognise emotions leveraging contextual information. However, adjacent
response generation ignores long-range dependencies and provides limited
affective information in many cases. In addition, most ERC models learn a
unified distributed representation for each utterance, which lacks
interpretability and robustness. To address these issues, we propose a
VAD-disentangled Variational AutoEncoder (VAD-VAE), which first introduces a
target utterance reconstruction task based on Variational Autoencoder, then
disentangles three affect representations Valence-Arousal-Dominance (VAD) from
the latent space. We also enhance the disentangled representations by
introducing VAD supervision signals from a sentiment lexicon and minimising the
mutual information between VAD distributions. Experiments show that VAD-VAE
outperforms the state-of-the-art model on two datasets. Further analysis proves
the effectiveness of each proposed module and the quality of disentangled VAD
representations. The code is available at
https://github.com/SteveKGYang/VAD-VAE.Comment: Accepted by IEEE Transactions on Affective Computin
Context De-confounded Emotion Recognition
Context-Aware Emotion Recognition (CAER) is a crucial and challenging task
that aims to perceive the emotional states of the target person with contextual
information. Recent approaches invariably focus on designing sophisticated
architectures or mechanisms to extract seemingly meaningful representations
from subjects and contexts. However, a long-overlooked issue is that a context
bias in existing datasets leads to a significantly unbalanced distribution of
emotional states among different context scenarios. Concretely, the harmful
bias is a confounder that misleads existing models to learn spurious
correlations based on conventional likelihood estimation, significantly
limiting the models' performance. To tackle the issue, this paper provides a
causality-based perspective to disentangle the models from the impact of such
bias, and formulate the causalities among variables in the CAER task via a
tailored causal graph. Then, we propose a Contextual Causal Intervention Module
(CCIM) based on the backdoor adjustment to de-confound the confounder and
exploit the true causal effect for model training. CCIM is plug-in and
model-agnostic, which improves diverse state-of-the-art approaches by
considerable margins. Extensive experiments on three benchmark datasets
demonstrate the effectiveness of our CCIM and the significance of causal
insight.Comment: Accepted by CVPR 2023. CCIM is available at
https://github.com/ydk122024/CCI
Utilizing External Knowledge to Enhance Semantics in Emotion Detection
Enabling machines to emotion recognition in conversation is challenging, mainly because the information in human dialogue innately conveys emotions by long-term experience, abundant knowledge, context, and the intricate patterns between the affective states. We address the task of emotion recognition in conversations using external knowledge to enhance semantics. We propose KES model, a new framework that incorporates different elements of external knowledge and conversational semantic role labeling, where build upon them to learn interactions between interlocutors participating in a conversation. We design a self-attention layer specialized for enhanced semantic text features with external commonsense knowledge. Then, two different networks composed of LSTM are responsible for tracking individual internal state and context external state. In addition, the proposed model has experimented on three datasets in emotion detection in conversation. The experimental results show that our model outperforms the state-of-the-art approaches on most of the tested datasets
- …