37,911 research outputs found
More than Vanilla Fusion: a Simple, Decoupling-free, Attention Module for Multimodal Fusion Based on Signal Theory
The vanilla fusion methods still dominate a large percentage of mainstream
audio-visual tasks. However, the effectiveness of vanilla fusion from a
theoretical perspective is still worth discussing. Thus, this paper reconsiders
the signal fused in the multimodal case from a bionics perspective and proposes
a simple, plug-and-play, attention module for vanilla fusion based on
fundamental signal theory and uncertainty theory. In addition, previous work on
multimodal dynamic gradient modulation still relies on decoupling the
modalities. So, a decoupling-free gradient modulation scheme has been designed
in conjunction with the aforementioned attention module, which has various
advantages over the decoupled one. Experiment results show that just a few
lines of code can achieve up to 2.0% performance improvements to several
multimodal classification methods. Finally, quantitative evaluation of other
fusion tasks reveals the potential for additional application scenarios
TACOformer:Token-channel compounded Cross Attention for Multimodal Emotion Recognition
Recently, emotion recognition based on physiological signals has emerged as a
field with intensive research. The utilization of multi-modal, multi-channel
physiological signals has significantly improved the performance of emotion
recognition systems, due to their complementarity. However, effectively
integrating emotion-related semantic information from different modalities and
capturing inter-modal dependencies remains a challenging issue. Many existing
multimodal fusion methods ignore either token-to-token or channel-to-channel
correlations of multichannel signals from different modalities, which limits
the classification capability of the models to some extent. In this paper, we
propose a comprehensive perspective of multimodal fusion that integrates
channel-level and token-level cross-modal interactions. Specifically, we
introduce a unified cross attention module called Token-chAnnel COmpound (TACO)
Cross Attention to perform multimodal fusion, which simultaneously models
channel-level and token-level dependencies between modalities. Additionally, we
propose a 2D position encoding method to preserve information about the spatial
distribution of EEG signal channels, then we use two transformer encoders ahead
of the fusion module to capture long-term temporal dependencies from the EEG
signal and the peripheral physiological signal, respectively.
Subject-independent experiments on emotional dataset DEAP and Dreamer
demonstrate that the proposed model achieves state-of-the-art performance.Comment: Accepted by IJCAI 2023- AI4TS worksho
Multimodal person recognition for human-vehicle interaction
Next-generation vehicles will undoubtedly feature biometric person recognition as part of an effort to improve the driving experience. Today's technology prevents such systems from operating satisfactorily under adverse conditions. A proposed framework for achieving person recognition successfully combines different biometric modalities, borne out in two case studies
Multimodal music information processing and retrieval: survey and future challenges
Towards improving the performance in various music information processing
tasks, recent studies exploit different modalities able to capture diverse
aspects of music. Such modalities include audio recordings, symbolic music
scores, mid-level representations, motion, and gestural data, video recordings,
editorial or cultural tags, lyrics and album cover arts. This paper critically
reviews the various approaches adopted in Music Information Processing and
Retrieval and highlights how multimodal algorithms can help Music Computing
applications. First, we categorize the related literature based on the
application they address. Subsequently, we analyze existing information fusion
approaches, and we conclude with the set of challenges that Music Information
Retrieval and Sound and Music Computing research communities should focus in
the next years
- …