4,527 research outputs found
Semi-supervised Deep Generative Modelling of Incomplete Multi-Modality Emotional Data
There are threefold challenges in emotion recognition. First, it is difficult
to recognize human's emotional states only considering a single modality.
Second, it is expensive to manually annotate the emotional data. Third,
emotional data often suffers from missing modalities due to unforeseeable
sensor malfunction or configuration issues. In this paper, we address all these
problems under a novel multi-view deep generative framework. Specifically, we
propose to model the statistical relationships of multi-modality emotional data
using multiple modality-specific generative networks with a shared latent
space. By imposing a Gaussian mixture assumption on the posterior approximation
of the shared latent variables, our framework can learn the joint deep
representation from multiple modalities and evaluate the importance of each
modality simultaneously. To solve the labeled-data-scarcity problem, we extend
our multi-view model to semi-supervised learning scenario by casting the
semi-supervised classification problem as a specialized missing data imputation
task. To address the missing-modality problem, we further extend our
semi-supervised multi-view model to deal with incomplete data, where a missing
view is treated as a latent variable and integrated out during inference. This
way, the proposed overall framework can utilize all available (both labeled and
unlabeled, as well as both complete and incomplete) data to improve its
generalization ability. The experiments conducted on two real multi-modal
emotion datasets demonstrated the superiority of our framework.Comment: arXiv admin note: text overlap with arXiv:1704.07548, 2018 ACM
Multimedia Conference (MM'18
ModDrop: adaptive multi-modal gesture recognition
We present a method for gesture detection and localisation based on
multi-scale and multi-modal deep learning. Each visual modality captures
spatial information at a particular spatial scale (such as motion of the upper
body or a hand), and the whole system operates at three temporal scales. Key to
our technique is a training strategy which exploits: i) careful initialization
of individual modalities; and ii) gradual fusion involving random dropping of
separate channels (dubbed ModDrop) for learning cross-modality correlations
while preserving uniqueness of each modality-specific representation. We
present experiments on the ChaLearn 2014 Looking at People Challenge gesture
recognition track, in which we placed first out of 17 teams. Fusing multiple
modalities at several spatial and temporal scales leads to a significant
increase in recognition rates, allowing the model to compensate for errors of
the individual classifiers as well as noise in the separate channels.
Futhermore, the proposed ModDrop training technique ensures robustness of the
classifier to missing signals in one or several channels to produce meaningful
predictions from any number of available modalities. In addition, we
demonstrate the applicability of the proposed fusion scheme to modalities of
arbitrary nature by experiments on the same dataset augmented with audio.Comment: 14 pages, 7 figure
Missing Modality meets Meta Sampling (M3S): An Efficient Universal Approach for Multimodal Sentiment Analysis with Missing Modality
Multimodal sentiment analysis (MSA) is an important way of observing mental
activities with the help of data captured from multiple modalities. However,
due to the recording or transmission error, some modalities may include
incomplete data. Most existing works that address missing modalities usually
assume a particular modality is completely missing and seldom consider a
mixture of missing across multiple modalities. In this paper, we propose a
simple yet effective meta-sampling approach for multimodal sentiment analysis
with missing modalities, namely Missing Modality-based Meta Sampling (M3S). To
be specific, M3S formulates a missing modality sampling strategy into the modal
agnostic meta-learning (MAML) framework. M3S can be treated as an efficient
add-on training component on existing models and significantly improve their
performances on multimodal data with a mixture of missing modalities. We
conduct experiments on IEMOCAP, SIMS and CMU-MOSI datasets, and superior
performance is achieved compared with recent state-of-the-art methods
Multimodal Sentiment Analysis: A Survey
Multimodal sentiment analysis has become an important research area in the
field of artificial intelligence. With the latest advances in deep learning,
this technology has reached new heights. It has great potential for both
application and research, making it a popular research topic. This review
provides an overview of the definition, background, and development of
multimodal sentiment analysis. It also covers recent datasets and advanced
models, emphasizing the challenges and future prospects of this technology.
Finally, it looks ahead to future research directions. It should be noted that
this review provides constructive suggestions for promising research directions
and building better performing multimodal sentiment analysis models, which can
help researchers in this field.Comment: It needs to be returned for major modification
Transformer-based Self-supervised Multimodal Representation Learning for Wearable Emotion Recognition
Recently, wearable emotion recognition based on peripheral physiological
signals has drawn massive attention due to its less invasive nature and its
applicability in real-life scenarios. However, how to effectively fuse
multimodal data remains a challenging problem. Moreover, traditional
fully-supervised based approaches suffer from overfitting given limited labeled
data. To address the above issues, we propose a novel self-supervised learning
(SSL) framework for wearable emotion recognition, where efficient multimodal
fusion is realized with temporal convolution-based modality-specific encoders
and a transformer-based shared encoder, capturing both intra-modal and
inter-modal correlations. Extensive unlabeled data is automatically assigned
labels by five signal transforms, and the proposed SSL model is pre-trained
with signal transformation recognition as a pretext task, allowing the
extraction of generalized multimodal representations for emotion-related
downstream tasks. For evaluation, the proposed SSL model was first pre-trained
on a large-scale self-collected physiological dataset and the resulting encoder
was subsequently frozen or fine-tuned on three public supervised emotion
recognition datasets. Ultimately, our SSL-based method achieved
state-of-the-art results in various emotion classification tasks. Meanwhile,
the proposed model proved to be more accurate and robust compared to
fully-supervised methods on low data regimes.Comment: Accepted IEEE Transactions On Affective Computin
Evaluating Temporal Patterns in Applied Infant Affect Recognition
Agents must monitor their partners' affective states continuously in order to
understand and engage in social interactions. However, methods for evaluating
affect recognition do not account for changes in classification performance
that may occur during occlusions or transitions between affective states. This
paper addresses temporal patterns in affect classification performance in the
context of an infant-robot interaction, where infants' affective states
contribute to their ability to participate in a therapeutic leg movement
activity. To support robustness to facial occlusions in video recordings, we
trained infant affect recognition classifiers using both facial and body
features. Next, we conducted an in-depth analysis of our best-performing models
to evaluate how performance changed over time as the models encountered missing
data and changing infant affect. During time windows when features were
extracted with high confidence, a unimodal model trained on facial features
achieved the same optimal performance as multimodal models trained on both
facial and body features. However, multimodal models outperformed unimodal
models when evaluated on the entire dataset. Additionally, model performance
was weakest when predicting an affective state transition and improved after
multiple predictions of the same affective state. These findings emphasize the
benefits of incorporating body features in continuous affect recognition for
infants. Our work highlights the importance of evaluating variability in model
performance both over time and in the presence of missing data when applying
affect recognition to social interactions.Comment: 8 pages, 6 figures, 10th International Conference on Affective
Computing and Intelligent Interaction (ACII 2022
- …