4,740 research outputs found
Multimodal Sentiment Analysis Based on Deep Learning: Recent Progress
Multimodal sentiment analysis is an important research topic in the field of NLP, aiming to analyze speakers\u27 sentiment tendencies through features extracted from textual, visual, and acoustic modalities. Its main methods are based on machine learning and deep learning. Machine learning-based methods rely heavily on labeled data. But deep learning-based methods can overcome this shortcoming and capture the in-depth semantic information and modal characteristics of the data, as well as the interactive information between multimodal data. In this paper, we survey the deep learning-based methods, including fusion of text and image and fusion of text, image, audio, and video. Specifically, we discuss the main problems of these methods and the future directions. Finally, we review the work of multimodal sentiment analysis in conversation
UniMSE: Towards Unified Multimodal Sentiment Analysis and Emotion Recognition
Multimodal sentiment analysis (MSA) and emotion recognition in conversation
(ERC) are key research topics for computers to understand human behaviors. From
a psychological perspective, emotions are the expression of affect or feelings
during a short period, while sentiments are formed and held for a longer
period. However, most existing works study sentiment and emotion separately and
do not fully exploit the complementary knowledge behind the two. In this paper,
we propose a multimodal sentiment knowledge-sharing framework (UniMSE) that
unifies MSA and ERC tasks from features, labels, and models. We perform
modality fusion at the syntactic and semantic levels and introduce contrastive
learning between modalities and samples to better capture the difference and
consistency between sentiments and emotions. Experiments on four public
benchmark datasets, MOSI, MOSEI, MELD, and IEMOCAP, demonstrate the
effectiveness of the proposed method and achieve consistent improvements
compared with state-of-the-art methods.Comment: Accepted to EMNLP 2022 main conferenc
GraphMFT: A Graph Network based Multimodal Fusion Technique for Emotion Recognition in Conversation
Multimodal machine learning is an emerging area of research, which has
received a great deal of scholarly attention in recent years. Up to now, there
are few studies on multimodal Emotion Recognition in Conversation (ERC). Since
Graph Neural Networks (GNNs) possess the powerful capacity of relational
modeling, they have an inherent advantage in the field of multimodal learning.
GNNs leverage the graph constructed from multimodal data to perform intra- and
inter-modal information interaction, which effectively facilitates the
integration and complementation of multimodal data. In this work, we propose a
novel Graph network based Multimodal Fusion Technique (GraphMFT) for emotion
recognition in conversation. Multimodal data can be modeled as a graph, where
each data object is regarded as a node, and both intra- and inter-modal
dependencies existing between data objects can be regarded as edges. GraphMFT
utilizes multiple improved graph attention networks to capture intra-modal
contextual information and inter-modal complementary information. In addition,
the proposed GraphMFT attempts to address the challenges of existing
graph-based multimodal conversational emotion recognition models such as MMGCN.
Empirical results on two public multimodal datasets reveal that our model
outperforms the State-Of-The-Art (SOTA) approaches with the accuracy of 67.90%
and 61.30%.Comment: Accepted by Neurocomputin
LineConGraphs: Line Conversation Graphs for Effective Emotion Recognition using Graph Neural Networks
Emotion Recognition in Conversations (ERC) is a critical aspect of affective
computing, and it has many practical applications in healthcare, education,
chatbots, and social media platforms. Earlier approaches for ERC analysis
involved modeling both speaker and long-term contextual information using graph
neural network architectures. However, it is ideal to deploy
speaker-independent models for real-world applications. Additionally, long
context windows can potentially create confusion in recognizing the emotion of
an utterance in a conversation. To overcome these limitations, we propose novel
line conversation graph convolutional network (LineConGCN) and graph attention
(LineConGAT) models for ERC analysis. These models are speaker-independent and
built using a graph construction strategy for conversations -- line
conversation graphs (LineConGraphs). The conversational context in
LineConGraphs is short-term -- limited to one previous and future utterance,
and speaker information is not part of the graph. We evaluate the performance
of our proposed models on two benchmark datasets, IEMOCAP and MELD, and show
that our LineConGAT model outperforms the state-of-the-art methods with an
F1-score of 64.58% and 76.50%. Moreover, we demonstrate that embedding
sentiment shift information into line conversation graphs further enhances the
ERC performance in the case of GCN models.Comment: 13 pages, 6 figure
GraphCFC: A Directed Graph Based Cross-Modal Feature Complementation Approach for Multimodal Conversational Emotion Recognition
Emotion Recognition in Conversation (ERC) plays a significant part in
Human-Computer Interaction (HCI) systems since it can provide empathetic
services. Multimodal ERC can mitigate the drawbacks of uni-modal approaches.
Recently, Graph Neural Networks (GNNs) have been widely used in a variety of
fields due to their superior performance in relation modeling. In multimodal
ERC, GNNs are capable of extracting both long-distance contextual information
and inter-modal interactive information. Unfortunately, since existing methods
such as MMGCN directly fuse multiple modalities, redundant information may be
generated and diverse information may be lost. In this work, we present a
directed Graph based Cross-modal Feature Complementation (GraphCFC) module that
can efficiently model contextual and interactive information. GraphCFC
alleviates the problem of heterogeneity gap in multimodal fusion by utilizing
multiple subspace extractors and Pair-wise Cross-modal Complementary (PairCC)
strategy. We extract various types of edges from the constructed graph for
encoding, thus enabling GNNs to extract crucial contextual and interactive
information more accurately when performing message passing. Furthermore, we
design a GNN structure called GAT-MLP, which can provide a new unified network
framework for multimodal learning. The experimental results on two benchmark
datasets show that our GraphCFC outperforms the state-of-the-art (SOTA)
approaches.Comment: 13 page
Incorporating Learner Emotions through Sentiment Analysis in Adaptive E-learning Systems: A Pilot Study
This research delves into the exciting avenue of incorporating learner emotions into adaptive E-learning systems through sentiment analysis techniques. Utilizing a pilot study with 40 undergraduate computer science students, we investigated the ability of an adaptive system to detect boredom and frustration in learner forum posts and subsequently personalize content or offer support based on these emotional states. This approach proved demonstrably successful, as learners in the experimental group who received emotion-based adaptation exhibited both increased engagement (reflected in higher time spent on tasks) and improved learning outcomes (evidenced by higher post-test scores). Furthermore, qualitative feedback revealed positive responses to the personalized interventions, indicating that learners appreciated the tailored support provided by the system. While acknowledging limitations such as the small sample size and single subject area, this study firmly establishes the promising potential of emotion-aware adaptive systems. By addressing the emotional dynamics of the learning process, such systems can pave the way for truly personalized and responsive E-learning environments that cater to individual learner needs and foster deeper engagement, positive learning experiences, and ultimately, success for all students
CFN-ESA: A Cross-Modal Fusion Network with Emotion-Shift Awareness for Dialogue Emotion Recognition
Multimodal Emotion Recognition in Conversation (ERC) has garnered growing
attention from research communities in various fields. In this paper, we
propose a cross-modal fusion network with emotion-shift awareness (CFN-ESA) for
ERC. Extant approaches employ each modality equally without distinguishing the
amount of emotional information, rendering it hard to adequately extract
complementary and associative information from multimodal data. To cope with
this problem, in CFN-ESA, textual modalities are treated as the primary source
of emotional information, while visual and acoustic modalities are taken as the
secondary sources. Besides, most multimodal ERC models ignore emotion-shift
information and overfocus on contextual information, leading to the failure of
emotion recognition under emotion-shift scenario. We elaborate an emotion-shift
module to address this challenge. CFN-ESA mainly consists of the unimodal
encoder (RUME), cross-modal encoder (ACME), and emotion-shift module (LESM).
RUME is applied to extract conversation-level contextual emotional cues while
pulling together the data distributions between modalities; ACME is utilized to
perform multimodal interaction centered on textual modality; LESM is used to
model emotion shift and capture related information, thereby guide the learning
of the main task. Experimental results demonstrate that CFN-ESA can effectively
promote performance for ERC and remarkably outperform the state-of-the-art
models.Comment: 13 pages, 10 figure
Revisiting Disentanglement and Fusion on Modality and Context in Conversational Multimodal Emotion Recognition
It has been a hot research topic to enable machines to understand human
emotions in multimodal contexts under dialogue scenarios, which is tasked with
multimodal emotion analysis in conversation (MM-ERC). MM-ERC has received
consistent attention in recent years, where a diverse range of methods has been
proposed for securing better task performance. Most existing works treat MM-ERC
as a standard multimodal classification problem and perform multimodal feature
disentanglement and fusion for maximizing feature utility. Yet after revisiting
the characteristic of MM-ERC, we argue that both the feature multimodality and
conversational contextualization should be properly modeled simultaneously
during the feature disentanglement and fusion steps. In this work, we target
further pushing the task performance by taking full consideration of the above
insights. On the one hand, during feature disentanglement, based on the
contrastive learning technique, we devise a Dual-level Disentanglement
Mechanism (DDM) to decouple the features into both the modality space and
utterance space. On the other hand, during the feature fusion stage, we propose
a Contribution-aware Fusion Mechanism (CFM) and a Context Refusion Mechanism
(CRM) for multimodal and context integration, respectively. They together
schedule the proper integrations of multimodal and context features.
Specifically, CFM explicitly manages the multimodal feature contributions
dynamically, while CRM flexibly coordinates the introduction of dialogue
contexts. On two public MM-ERC datasets, our system achieves new
state-of-the-art performance consistently. Further analyses demonstrate that
all our proposed mechanisms greatly facilitate the MM-ERC task by making full
use of the multimodal and context features adaptively. Note that our proposed
methods have the great potential to facilitate a broader range of other
conversational multimodal tasks.Comment: Accepted by ACM MM 202
- …