Search CORE

35 research outputs found

MultiZoo & MultiBench: A Standardized Toolkit for Multimodal Deep Learning

Author: Agarwal Arav
Cheng Yun
Fan Xiang
Liang Paul Pu
Lyu Yiwei
Morency Louis-Philippe
Salakhutdinov Ruslan
Publication venue
Publication date: 28/06/2023
Field of study

Learning multimodal representations involves integrating information from multiple heterogeneous sources of data. In order to accelerate progress towards understudied modalities and tasks while ensuring real-world robustness, we release MultiZoo, a public toolkit consisting of standardized implementations of > 20 core multimodal algorithms and MultiBench, a large-scale benchmark spanning 15 datasets, 10 modalities, 20 prediction tasks, and 6 research areas. Together, these provide an automated end-to-end machine learning pipeline that simplifies and standardizes data loading, experimental setup, and model evaluation. To enable holistic evaluation, we offer a comprehensive methodology to assess (1) generalization, (2) time and space complexity, and (3) modality robustness. MultiBench paves the way towards a better understanding of the capabilities and limitations of multimodal models, while ensuring ease of use, accessibility, and reproducibility. Our toolkits are publicly available, will be regularly updated, and welcome inputs from the community.Comment: JMLR Open Source Software 2023, Code available at https://github.com/pliang279/MultiBenc

arXiv.org e-Print Archive

Text-oriented Modality Reinforcement Network for Multimodal Sentiment Analysis from Unaligned Multimodal Sequences

Author: Chen Jiawei
Lei Yuxuan
Li Mingcheng
Wang Shunli
Yang Dingkang
Zhang Lihua
Publication venue
Publication date: 24/07/2023
Field of study

Multimodal Sentiment Analysis (MSA) aims to mine sentiment information from text, visual, and acoustic modalities. Previous works have focused on representation learning and feature fusion strategies. However, most of these efforts ignored the disparity in the semantic richness of different modalities and treated each modality in the same manner. That may lead to strong modalities being neglected and weak modalities being overvalued. Motivated by these observations, we propose a Text-oriented Modality Reinforcement Network (TMRN), which focuses on the dominance of the text modality in MSA. More specifically, we design a Text-Centered Cross-modal Attention (TCCA) module to make full interaction for text/acoustic and text/visual pairs, and a Text-Gated Self-Attention (TGSA) module to guide the self-reinforcement of the other two modalities. Furthermore, we present an adaptive fusion mechanism to decide the proportion of different modalities involved in the fusion process. Finally, we combine the feature matrices into vectors to get the final representation for the downstream tasks. Experimental results show that our TMRN outperforms the state-of-the-art methods on two MSA benchmarks.Comment: Accepted by CICAI 2023 (Finalist of Best Student Paper Award

arXiv.org e-Print Archive

Alternative Telescopic Displacement: An Efficient Multimodal Alignment Method

Author: Liu Zihong Luo Chengzhi
Lu Zong
Qin Jiahao
Xu Yitao
Zhang Xiaojun
Publication venue
Publication date: 29/06/2023
Field of study

Feature alignment is the primary means of fusing multimodal data. We propose a feature alignment method that fully fuses multimodal information, which alternately shifts and expands feature information from different modalities to have a consistent representation in a feature space. The proposed method can robustly capture high-level interactions between features of different modalities, thus significantly improving the performance of multimodal learning. We also show that the proposed method outperforms other popular multimodal schemes on multiple tasks. Experimental evaluation of ETT and MIT-BIH-Arrhythmia, datasets shows that the proposed method achieves state of the art performance.Comment: 8 pages,7 figure

arXiv.org e-Print Archive

Neuro-Inspired Hierarchical Multimodal Learning

Author: Bogdan Paul
Cao Defu
Cheng Mingxi
Fang Tianqing
Gupta Gaurav
Li Shixuan
Li Yaxing
Liu Gengshuo
Xiao Xiongye
Publication venue
Publication date: 10/11/2023
Field of study

Integrating and processing information from various sources or modalities are critical for obtaining a comprehensive and accurate perception of the real world. Drawing inspiration from neuroscience, we develop the Information-Theoretic Hierarchical Perception (ITHP) model, which utilizes the concept of information bottleneck. Distinct from most traditional fusion models that aim to incorporate all modalities as input, our model designates the prime modality as input, while the remaining modalities act as detectors in the information pathway. Our proposed perception model focuses on constructing an effective and compact information flow by achieving a balance between the minimization of mutual information between the latent state and the input modal state, and the maximization of mutual information between the latent states and the remaining modal states. This approach leads to compact latent state representations that retain relevant information while minimizing redundancy, thereby substantially enhancing the performance of downstream tasks. Experimental evaluations on both the MUStARD and CMU-MOSI datasets demonstrate that our model consistently distills crucial information in multimodal learning scenarios, outperforming state-of-the-art benchmarks

arXiv.org e-Print Archive