36,674 research outputs found
Multimodal Prediction based on Graph Representations
This paper proposes a learning model, based on rank-fusion graphs, for
general applicability in multimodal prediction tasks, such as multimodal
regression and image classification. Rank-fusion graphs encode information from
multiple descriptors and retrieval models, thus being able to capture
underlying relationships between modalities, samples, and the collection
itself. The solution is based on the encoding of multiple ranks for a query (or
test sample), defined according to different criteria, into a graph. Later, we
project the generated graph into an induced vector space, creating fusion
vectors, targeting broader generality and efficiency. A fusion vector estimator
is then built to infer whether a multimodal input object refers to a class or
not. Our method is capable of promoting a fusion model better than early-fusion
and late-fusion alternatives. Performed experiments in the context of multiple
multimodal and visual datasets, as well as several descriptors and retrieval
models, demonstrate that our learning model is highly effective for different
prediction scenarios involving visual, textual, and multimodal features,
yielding better effectiveness than state-of-the-art methods
Unsupervised Visual and Textual Information Fusion in Multimedia Retrieval - A Graph-based Point of View
Multimedia collections are more than ever growing in size and diversity.
Effective multimedia retrieval systems are thus critical to access these
datasets from the end-user perspective and in a scalable way. We are interested
in repositories of image/text multimedia objects and we study multimodal
information fusion techniques in the context of content based multimedia
information retrieval. We focus on graph based methods which have proven to
provide state-of-the-art performances. We particularly examine two of such
methods : cross-media similarities and random walk based scores. From a
theoretical viewpoint, we propose a unifying graph based framework which
encompasses the two aforementioned approaches. Our proposal allows us to
highlight the core features one should consider when using a graph based
technique for the combination of visual and textual information. We compare
cross-media and random walk based results using three different real-world
datasets. From a practical standpoint, our extended empirical analysis allow us
to provide insights and guidelines about the use of graph based methods for
multimodal information fusion in content based multimedia information
retrieval.Comment: An extended version of the paper: Visual and Textual Information
Fusion in Multimedia Retrieval using Semantic Filtering and Graph based
Methods, by J. Ah-Pine, G. Csurka and S. Clinchant, submitted to ACM
Transactions on Information System
Revisiting Disentanglement and Fusion on Modality and Context in Conversational Multimodal Emotion Recognition
It has been a hot research topic to enable machines to understand human
emotions in multimodal contexts under dialogue scenarios, which is tasked with
multimodal emotion analysis in conversation (MM-ERC). MM-ERC has received
consistent attention in recent years, where a diverse range of methods has been
proposed for securing better task performance. Most existing works treat MM-ERC
as a standard multimodal classification problem and perform multimodal feature
disentanglement and fusion for maximizing feature utility. Yet after revisiting
the characteristic of MM-ERC, we argue that both the feature multimodality and
conversational contextualization should be properly modeled simultaneously
during the feature disentanglement and fusion steps. In this work, we target
further pushing the task performance by taking full consideration of the above
insights. On the one hand, during feature disentanglement, based on the
contrastive learning technique, we devise a Dual-level Disentanglement
Mechanism (DDM) to decouple the features into both the modality space and
utterance space. On the other hand, during the feature fusion stage, we propose
a Contribution-aware Fusion Mechanism (CFM) and a Context Refusion Mechanism
(CRM) for multimodal and context integration, respectively. They together
schedule the proper integrations of multimodal and context features.
Specifically, CFM explicitly manages the multimodal feature contributions
dynamically, while CRM flexibly coordinates the introduction of dialogue
contexts. On two public MM-ERC datasets, our system achieves new
state-of-the-art performance consistently. Further analyses demonstrate that
all our proposed mechanisms greatly facilitate the MM-ERC task by making full
use of the multimodal and context features adaptively. Note that our proposed
methods have the great potential to facilitate a broader range of other
conversational multimodal tasks.Comment: Accepted by ACM MM 202
A multimodal mixture-of-experts model for dynamic emotion prediction in movies
This paper addresses the problem of continuous emotion prediction in movies from multimodal cues. The rich emotion content in movies is inherently multimodal, where emotion is evoked through both audio (music, speech) and video modalities. To capture such affective information, we put forth a set of audio and video features that includes several novel features such as, Video Compressibility and Histogram of Facial Area (HFA). We propose a Mixture of Experts (MoE)-based fusion model that dynamically combines information from the audio and video modalities for predicting the emotion evoked in movies. A learning module, based on hard Expectation-Maximization (EM) algorithm, is presented for the MoE model. Experiments on a database of popular movies demonstrate that our MoE-based fusion method outperforms popular fusion strategies (e.g. early and late fusion) in the context of dynamic emotion prediction
Component attention network for multimodal dance improvisation recognition
Dance improvisation is an active research topic in the arts. Motion analysis
of improvised dance can be challenging due to its unique dynamics. Data-driven
dance motion analysis, including recognition and generation, is often limited
to skeletal data. However, data of other modalities, such as audio, can be
recorded and benefit downstream tasks. This paper explores the application and
performance of multimodal fusion methods for human motion recognition in the
context of dance improvisation. We propose an attention-based model, component
attention network (CANet), for multimodal fusion on three levels: 1) feature
fusion with CANet, 2) model fusion with CANet and graph convolutional network
(GCN), and 3) late fusion with a voting strategy. We conduct thorough
experiments to analyze the impact of each modality in different fusion methods
and distinguish critical temporal or component features. We show that our
proposed model outperforms the two baseline methods, demonstrating its
potential for analyzing improvisation in dance.Comment: Accepted to 25th ACM International Conference on Multimodal
Interaction (ICMI 2023
A Proposal for Processing and Fusioning Multiple Information Sources in Multimodal Dialog Systems
Proceedings of: PAAMS 2014 International Workshops. Agent-based Approaches for the Transportation Modelling and Optimisation (AATMO' 14 ) & Intelligent Systems for Context-based Information Fusion (ISCIF' 14). Salamanca, Spain, June 4-6, 2014.Multimodal dialog systems can be defined as computer systems that process two or more user input modes and combine them with multimedia system output. This paper is focused on the multimodal input, providing a proposal to process and fusion the multiple input modalities in the dialog manager of the system, so that a single combined input is used to select the next system action. We describe an application of our technique to build multimodal systems that process user's spoken utterances, tactile and keyboard inputs, and information related to the context of the interaction. This information is divided in our proposal into external and internal context, user's internal, represented in our contribution by the detection of their intention during the dialog and their emotional state.This work was supported in part by Projects MINECO TEC2012-37832-C02-01, CICYT TEC2011-28626-C02-02, CAM CONTEXTS (S2009/TIC-1485)
Summarize the Past to Predict the Future: Natural Language Descriptions of Context Boost Multimodal Object Interaction
We study the task of object interaction anticipation in egocentric videos.
Successful prediction of future actions and objects requires an understanding
of the spatio-temporal context formed by past actions and object relationships.
We propose TransFusion, a multimodal transformer-based architecture, that
effectively makes use of the representational power of language by summarizing
past actions concisely. TransFusion leverages pre-trained image captioning
models and summarizes the caption, focusing on past actions and objects. This
action context together with a single input frame is processed by a multimodal
fusion module to forecast the next object interactions. Our model enables more
efficient end-to-end learning by replacing dense video features with language
representations, allowing us to benefit from knowledge encoded in large
pre-trained models. Experiments on Ego4D and EPIC-KITCHENS-100 show the
effectiveness of our multimodal fusion model and the benefits of using
language-based context summaries. Our method outperforms state-of-the-art
approaches by 40.4% in overall mAP on the Ego4D test set. We show the
generality of TransFusion via experiments on EPIC-KITCHENS-100. Video and code
are available at: https://eth-ait.github.io/transfusion-proj/
Recommended from our members
Multimodal biometrics score level fusion using non-confidence information
Multimodal biometrics refers to automatic authentication methods that depend on multiple modalities of measurable physical characteristics. It alleviates most of the restrictions of single biometrics. To combine the multimodal biometrics scores, three different categories of fusion approaches including rule based, classification based and density based approaches are available. When choosing an approach, one has to consider not only the fusion performance, but also system requirements and other circumstances. In the context of verification, classification errors arise from samples in the overlapping region (or non- confidence region) between genuine users and impostors. In score space, a further separation of the samples outside the non-confidence region does not result in further verification improvements. Therefore, information contained in the non-confidence region might be useful for improving the fusion process. Up to this point, no attempts are reported in the literature that tries to enhance the fusion process using this additional information. In this work, the use of this information is explored in rule based and density based approaches mentioned above
- âŠ