1,035 research outputs found

    Temporal Cross-Media Retrieval with Soft-Smoothing

    Full text link
    Multimedia information have strong temporal correlations that shape the way modalities co-occur over time. In this paper we study the dynamic nature of multimedia and social-media information, where the temporal dimension emerges as a strong source of evidence for learning the temporal correlations across visual and textual modalities. So far, cross-media retrieval models, explored the correlations between different modalities (e.g. text and image) to learn a common subspace, in which semantically similar instances lie in the same neighbourhood. Building on such knowledge, we propose a novel temporal cross-media neural architecture, that departs from standard cross-media methods, by explicitly accounting for the temporal dimension through temporal subspace learning. The model is softly-constrained with temporal and inter-modality constraints that guide the new subspace learning task by favouring temporal correlations between semantically similar and temporally close instances. Experiments on three distinct datasets show that accounting for time turns out to be important for cross-media retrieval. Namely, the proposed method outperforms a set of baselines on the task of temporal cross-media retrieval, demonstrating its effectiveness for performing temporal subspace learning.Comment: To appear in ACM MM 201

    A Deep Multi-Level Attentive network for Multimodal Sentiment Analysis

    Full text link
    Multimodal sentiment analysis has attracted increasing attention with broad application prospects. The existing methods focuses on single modality, which fails to capture the social media content for multiple modalities. Moreover, in multi-modal learning, most of the works have focused on simply combining the two modalities, without exploring the complicated correlations between them. This resulted in dissatisfying performance for multimodal sentiment classification. Motivated by the status quo, we propose a Deep Multi-Level Attentive network, which exploits the correlation between image and text modalities to improve multimodal learning. Specifically, we generate the bi-attentive visual map along the spatial and channel dimensions to magnify CNNs representation power. Then we model the correlation between the image regions and semantics of the word by extracting the textual features related to the bi-attentive visual features by applying semantic attention. Finally, self-attention is employed to automatically fetch the sentiment-rich multimodal features for the classification. We conduct extensive evaluations on four real-world datasets, namely, MVSA-Single, MVSA-Multiple, Flickr, and Getty Images, which verifies the superiority of our method.Comment: 11 pages, 7 figure

    General Debiasing for Multimodal Sentiment Analysis

    Full text link
    Existing work on Multimodal Sentiment Analysis (MSA) utilizes multimodal information for prediction yet unavoidably suffers from fitting the spurious correlations between multimodal features and sentiment labels. For example, if most videos with a blue background have positive labels in a dataset, the model will rely on such correlations for prediction, while ``blue background'' is not a sentiment-related feature. To address this problem, we define a general debiasing MSA task, which aims to enhance the Out-Of-Distribution (OOD) generalization ability of MSA models by reducing their reliance on spurious correlations. To this end, we propose a general debiasing framework based on Inverse Probability Weighting (IPW), which adaptively assigns small weights to the samples with larger bias i.e., the severer spurious correlations). The key to this debiasing framework is to estimate the bias of each sample, which is achieved by two steps: 1) disentangling the robust features and biased features in each modality, and 2) utilizing the biased features to estimate the bias. Finally, we employ IPW to reduce the effects of large-biased samples, facilitating robust feature learning for sentiment prediction. To examine the model's generalization ability, we keep the original testing sets on two benchmarks and additionally construct multiple unimodal and multimodal OOD testing sets. The empirical results demonstrate the superior generalization ability of our proposed framework. We have released the code and data to facilitate the reproduction
    corecore