Search CORE

36,674 research outputs found

Multimodal Prediction based on Graph Representations

Author: Dourado Icaro Cavalcante
Tabbone Salvatore
Torres Ricardo da Silva
Publication venue
Publication date: 03/07/2020
Field of study

This paper proposes a learning model, based on rank-fusion graphs, for general applicability in multimodal prediction tasks, such as multimodal regression and image classification. Rank-fusion graphs encode information from multiple descriptors and retrieval models, thus being able to capture underlying relationships between modalities, samples, and the collection itself. The solution is based on the encoding of multiple ranks for a query (or test sample), defined according to different criteria, into a graph. Later, we project the generated graph into an induced vector space, creating fusion vectors, targeting broader generality and efficiency. A fusion vector estimator is then built to infer whether a multimodal input object refers to a class or not. Our method is capable of promoting a fusion model better than early-fusion and late-fusion alternatives. Performed experiments in the context of multiple multimodal and visual datasets, as well as several descriptors and retrieval models, demonstrate that our learning model is highly effective for different prediction scenarios involving visual, textual, and multimodal features, yielding better effectiveness than state-of-the-art methods

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

Unsupervised Visual and Textual Information Fusion in Multimedia Retrieval - A Graph-based Point of View

Author: Ah-Pine Julien
Clinchant Stéphane
Csurka Gabriela
Publication venue
Publication date: 27/01/2014
Field of study

Multimedia collections are more than ever growing in size and diversity. Effective multimedia retrieval systems are thus critical to access these datasets from the end-user perspective and in a scalable way. We are interested in repositories of image/text multimedia objects and we study multimodal information fusion techniques in the context of content based multimedia information retrieval. We focus on graph based methods which have proven to provide state-of-the-art performances. We particularly examine two of such methods : cross-media similarities and random walk based scores. From a theoretical viewpoint, we propose a unifying graph based framework which encompasses the two aforementioned approaches. Our proposal allows us to highlight the core features one should consider when using a graph based technique for the combination of visual and textual information. We compare cross-media and random walk based results using three different real-world datasets. From a practical standpoint, our extended empirical analysis allow us to provide insights and guidelines about the use of graph based methods for multimodal information fusion in content based multimedia information retrieval.Comment: An extended version of the paper: Visual and Textual Information Fusion in Multimedia Retrieval using Semantic Filtering and Graph based Methods, by J. Ah-Pine, G. Csurka and S. Clinchant, submitted to ACM Transactions on Information System

arXiv.org e-Print Archive

CiteSeerX

Revisiting Disentanglement and Fusion on Modality and Context in Conversational Multimodal Emotion Recognition

Author: Chua Tat-Seng
Fei Hao
Ji Donghong
Li Bobo
Li Fei
Liao Lizi
Teng Chong
Zhao Yu
Publication venue
Publication date: 12/08/2023
Field of study

It has been a hot research topic to enable machines to understand human emotions in multimodal contexts under dialogue scenarios, which is tasked with multimodal emotion analysis in conversation (MM-ERC). MM-ERC has received consistent attention in recent years, where a diverse range of methods has been proposed for securing better task performance. Most existing works treat MM-ERC as a standard multimodal classification problem and perform multimodal feature disentanglement and fusion for maximizing feature utility. Yet after revisiting the characteristic of MM-ERC, we argue that both the feature multimodality and conversational contextualization should be properly modeled simultaneously during the feature disentanglement and fusion steps. In this work, we target further pushing the task performance by taking full consideration of the above insights. On the one hand, during feature disentanglement, based on the contrastive learning technique, we devise a Dual-level Disentanglement Mechanism (DDM) to decouple the features into both the modality space and utterance space. On the other hand, during the feature fusion stage, we propose a Contribution-aware Fusion Mechanism (CFM) and a Context Refusion Mechanism (CRM) for multimodal and context integration, respectively. They together schedule the proper integrations of multimodal and context features. Specifically, CFM explicitly manages the multimodal feature contributions dynamically, while CRM flexibly coordinates the introduction of dialogue contexts. On two public MM-ERC datasets, our system achieves new state-of-the-art performance consistently. Further analyses demonstrate that all our proposed mechanisms greatly facilitate the MM-ERC task by making full use of the multimodal and context features adaptively. Note that our proposed methods have the great potential to facilitate a broader range of other conversational multimodal tasks.Comment: Accepted by ACM MM 202

arXiv.org e-Print Archive

A multimodal mixture-of-experts model for dynamic emotion prediction in movies

Author: Goyal Ankit
Guha Tanaya
Kumar Naveen
Narayanan Shrikanth S.
Publication venue: IEEE
Publication date: 01/01/2016
Field of study

This paper addresses the problem of continuous emotion prediction in movies from multimodal cues. The rich emotion content in movies is inherently multimodal, where emotion is evoked through both audio (music, speech) and video modalities. To capture such affective information, we put forth a set of audio and video features that includes several novel features such as, Video Compressibility and Histogram of Facial Area (HFA). We propose a Mixture of Experts (MoE)-based fusion model that dynamically combines information from the audio and video modalities for predicting the emotion evoked in movies. A learning module, based on hard Expectation-Maximization (EM) algorithm, is presented for the MoE model. Experiments on a database of popular movies demonstrate that our MoE-based fusion method outperforms popular fusion strategies (e.g. early and late fusion) in the context of dynamic emotion prediction

Warwick Research Archives Portal Repository

Component attention network for multimodal dance improvisation recognition

Author: Björkman Mårten
Fu Jia
Pashami Sepideh
Tan Jiarui
Yin Wenjie
Publication venue
Publication date: 24/08/2023
Field of study

Dance improvisation is an active research topic in the arts. Motion analysis of improvised dance can be challenging due to its unique dynamics. Data-driven dance motion analysis, including recognition and generation, is often limited to skeletal data. However, data of other modalities, such as audio, can be recorded and benefit downstream tasks. This paper explores the application and performance of multimodal fusion methods for human motion recognition in the context of dance improvisation. We propose an attention-based model, component attention network (CANet), for multimodal fusion on three levels: 1) feature fusion with CANet, 2) model fusion with CANet and graph convolutional network (GCN), and 3) late fusion with a voting strategy. We conduct thorough experiments to analyze the impact of each modality in different fusion methods and distinguish critical temporal or component features. We show that our proposed model outperforms the two baseline methods, demonstrating its potential for analyzing improvisation in dance.Comment: Accepted to 25th ACM International Conference on Multimodal Interaction (ICMI 2023

arXiv.org e-Print Archive

A Proposal for Processing and Fusioning Multiple Information Sources in Multimodal Dialog Systems

Author: García Jesús
Griol David
Molina José M.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2014
Field of study

Proceedings of: PAAMS 2014 International Workshops. Agent-based Approaches for the Transportation Modelling and Optimisation (AATMO' 14 ) & Intelligent Systems for Context-based Information Fusion (ISCIF' 14). Salamanca, Spain, June 4-6, 2014.Multimodal dialog systems can be defined as computer systems that process two or more user input modes and combine them with multimedia system output. This paper is focused on the multimodal input, providing a proposal to process and fusion the multiple input modalities in the dialog manager of the system, so that a single combined input is used to select the next system action. We describe an application of our technique to build multimodal systems that process user's spoken utterances, tactile and keyboard inputs, and information related to the context of the interaction. This information is divided in our proposal into external and internal context, user's internal, represented in our contribution by the detection of their intention during the dialog and their emotional state.This work was supported in part by Projects MINECO TEC2012-37832-C02-01, CICYT TEC2011-28626-C02-02, CAM CONTEXTS (S2009/TIC-1485)

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Universidad Carlos III de Madrid e-Archivo

Summarize the Past to Predict the Future: Natural Language Descriptions of Context Boost Multimodal Object Interaction

Author: Gavryushin Alexey
Hilliges Otmar
Kuo Yen-Ling
Pasca Razvan-George
Wang Xi
Publication venue
Publication date: 22/01/2023
Field of study

We study the task of object interaction anticipation in egocentric videos. Successful prediction of future actions and objects requires an understanding of the spatio-temporal context formed by past actions and object relationships. We propose TransFusion, a multimodal transformer-based architecture, that effectively makes use of the representational power of language by summarizing past actions concisely. TransFusion leverages pre-trained image captioning models and summarizes the caption, focusing on past actions and objects. This action context together with a single input frame is processed by a multimodal fusion module to forecast the next object interactions. Our model enables more efficient end-to-end learning by replacing dense video features with language representations, allowing us to benefit from knowledge encoded in large pre-trained models. Experiments on Ego4D and EPIC-KITCHENS-100 show the effectiveness of our multimodal fusion model and the benefits of using language-based context summaries. Our method outperforms state-of-the-art approaches by 40.4% in overall mAP on the Ego4D test set. We show the generality of TransFusion via experiments on EPIC-KITCHENS-100. Video and code are available at: https://eth-ait.github.io/transfusion-proj/

arXiv.org e-Print Archive

Recommended from our members

Multimodal biometrics score level fusion using non-confidence information

Author: Chaw Poh C
Publication venue
Publication date: 01/01/2011
Field of study

Multimodal biometrics refers to automatic authentication methods that depend on multiple modalities of measurable physical characteristics. It alleviates most of the restrictions of single biometrics. To combine the multimodal biometrics scores, three different categories of fusion approaches including rule based, classification based and density based approaches are available. When choosing an approach, one has to consider not only the fusion performance, but also system requirements and other circumstances. In the context of verification, classification errors arise from samples in the overlapping region (or non- confidence region) between genuine users and impostors. In score space, a further separation of the samples outside the non-confidence region does not result in further verification improvements. Therefore, information contained in the non-confidence region might be useful for improving the fusion process. Up to this point, no attempts are reported in the literature that tries to enhance the fusion process using this additional information. In this work, the use of this information is explored in rule based and density based approaches mentioned above

Nottingham Trent Institutional Repository (IRep)