1,035 research outputs found
Temporal Cross-Media Retrieval with Soft-Smoothing
Multimedia information have strong temporal correlations that shape the way
modalities co-occur over time. In this paper we study the dynamic nature of
multimedia and social-media information, where the temporal dimension emerges
as a strong source of evidence for learning the temporal correlations across
visual and textual modalities. So far, cross-media retrieval models, explored
the correlations between different modalities (e.g. text and image) to learn a
common subspace, in which semantically similar instances lie in the same
neighbourhood. Building on such knowledge, we propose a novel temporal
cross-media neural architecture, that departs from standard cross-media
methods, by explicitly accounting for the temporal dimension through temporal
subspace learning. The model is softly-constrained with temporal and
inter-modality constraints that guide the new subspace learning task by
favouring temporal correlations between semantically similar and temporally
close instances. Experiments on three distinct datasets show that accounting
for time turns out to be important for cross-media retrieval. Namely, the
proposed method outperforms a set of baselines on the task of temporal
cross-media retrieval, demonstrating its effectiveness for performing temporal
subspace learning.Comment: To appear in ACM MM 201
A Deep Multi-Level Attentive network for Multimodal Sentiment Analysis
Multimodal sentiment analysis has attracted increasing attention with broad
application prospects. The existing methods focuses on single modality, which
fails to capture the social media content for multiple modalities. Moreover, in
multi-modal learning, most of the works have focused on simply combining the
two modalities, without exploring the complicated correlations between them.
This resulted in dissatisfying performance for multimodal sentiment
classification. Motivated by the status quo, we propose a Deep Multi-Level
Attentive network, which exploits the correlation between image and text
modalities to improve multimodal learning. Specifically, we generate the
bi-attentive visual map along the spatial and channel dimensions to magnify
CNNs representation power. Then we model the correlation between the image
regions and semantics of the word by extracting the textual features related to
the bi-attentive visual features by applying semantic attention. Finally,
self-attention is employed to automatically fetch the sentiment-rich multimodal
features for the classification. We conduct extensive evaluations on four
real-world datasets, namely, MVSA-Single, MVSA-Multiple, Flickr, and Getty
Images, which verifies the superiority of our method.Comment: 11 pages, 7 figure
General Debiasing for Multimodal Sentiment Analysis
Existing work on Multimodal Sentiment Analysis (MSA) utilizes multimodal
information for prediction yet unavoidably suffers from fitting the spurious
correlations between multimodal features and sentiment labels. For example, if
most videos with a blue background have positive labels in a dataset, the model
will rely on such correlations for prediction, while ``blue background'' is not
a sentiment-related feature. To address this problem, we define a general
debiasing MSA task, which aims to enhance the Out-Of-Distribution (OOD)
generalization ability of MSA models by reducing their reliance on spurious
correlations. To this end, we propose a general debiasing framework based on
Inverse Probability Weighting (IPW), which adaptively assigns small weights to
the samples with larger bias i.e., the severer spurious correlations). The key
to this debiasing framework is to estimate the bias of each sample, which is
achieved by two steps: 1) disentangling the robust features and biased features
in each modality, and 2) utilizing the biased features to estimate the bias.
Finally, we employ IPW to reduce the effects of large-biased samples,
facilitating robust feature learning for sentiment prediction. To examine the
model's generalization ability, we keep the original testing sets on two
benchmarks and additionally construct multiple unimodal and multimodal OOD
testing sets. The empirical results demonstrate the superior generalization
ability of our proposed framework. We have released the code and data to
facilitate the reproduction
Recommended from our members
A Quantum-Inspired Multimodal Sentiment Analysis Framework
Multimodal sentiment analysis aims to capture diversified sentiment information implied in data that are of different modalities (e.g., an image that is associated with a textual description or a set of textual labels). The key challenge is rooted on the “semantic gap” between different low-level content features and high-level semantic information. Existing approaches generally utilize a combination of multimodal features in a somehow heuristic way. However, how to employ and combine multiple information from different sources effectively is still an important yet largely unsolved problem. To address the problem, in this paper, we propose a Quantum-inspired Multimodal Sentiment Analysis (QMSA) framework. The framework consists of a Quantum-inspired Multimodal Representation (QMR) model (which aims to fill the “semantic gap” and model the correlations between different modalities via density matrix), and a Multimodal decision Fusion strategy inspired by Quantum Interference (QIMF) in the double-slit experiment (in which the sentiment label is analogous to a photon, and the data modalities are analogous to slits). Extensive experiments are conducted on two large scale datasets, which are collected from the Getty Images and Flickr photo sharing platform. The experimental results show that our approach significantly outperforms a wide range of baselines and state-of-the-art methods
- …