35 research outputs found
MultiZoo & MultiBench: A Standardized Toolkit for Multimodal Deep Learning
Learning multimodal representations involves integrating information from
multiple heterogeneous sources of data. In order to accelerate progress towards
understudied modalities and tasks while ensuring real-world robustness, we
release MultiZoo, a public toolkit consisting of standardized implementations
of > 20 core multimodal algorithms and MultiBench, a large-scale benchmark
spanning 15 datasets, 10 modalities, 20 prediction tasks, and 6 research areas.
Together, these provide an automated end-to-end machine learning pipeline that
simplifies and standardizes data loading, experimental setup, and model
evaluation. To enable holistic evaluation, we offer a comprehensive methodology
to assess (1) generalization, (2) time and space complexity, and (3) modality
robustness. MultiBench paves the way towards a better understanding of the
capabilities and limitations of multimodal models, while ensuring ease of use,
accessibility, and reproducibility. Our toolkits are publicly available, will
be regularly updated, and welcome inputs from the community.Comment: JMLR Open Source Software 2023, Code available at
https://github.com/pliang279/MultiBenc
Text-oriented Modality Reinforcement Network for Multimodal Sentiment Analysis from Unaligned Multimodal Sequences
Multimodal Sentiment Analysis (MSA) aims to mine sentiment information from
text, visual, and acoustic modalities. Previous works have focused on
representation learning and feature fusion strategies. However, most of these
efforts ignored the disparity in the semantic richness of different modalities
and treated each modality in the same manner. That may lead to strong
modalities being neglected and weak modalities being overvalued. Motivated by
these observations, we propose a Text-oriented Modality Reinforcement Network
(TMRN), which focuses on the dominance of the text modality in MSA. More
specifically, we design a Text-Centered Cross-modal Attention (TCCA) module to
make full interaction for text/acoustic and text/visual pairs, and a Text-Gated
Self-Attention (TGSA) module to guide the self-reinforcement of the other two
modalities. Furthermore, we present an adaptive fusion mechanism to decide the
proportion of different modalities involved in the fusion process. Finally, we
combine the feature matrices into vectors to get the final representation for
the downstream tasks. Experimental results show that our TMRN outperforms the
state-of-the-art methods on two MSA benchmarks.Comment: Accepted by CICAI 2023 (Finalist of Best Student Paper Award
Alternative Telescopic Displacement: An Efficient Multimodal Alignment Method
Feature alignment is the primary means of fusing multimodal data. We propose
a feature alignment method that fully fuses multimodal information, which
alternately shifts and expands feature information from different modalities to
have a consistent representation in a feature space. The proposed method can
robustly capture high-level interactions between features of different
modalities, thus significantly improving the performance of multimodal
learning. We also show that the proposed method outperforms other popular
multimodal schemes on multiple tasks. Experimental evaluation of ETT and
MIT-BIH-Arrhythmia, datasets shows that the proposed method achieves state of
the art performance.Comment: 8 pages,7 figure
Neuro-Inspired Hierarchical Multimodal Learning
Integrating and processing information from various sources or modalities are
critical for obtaining a comprehensive and accurate perception of the real
world. Drawing inspiration from neuroscience, we develop the
Information-Theoretic Hierarchical Perception (ITHP) model, which utilizes the
concept of information bottleneck. Distinct from most traditional fusion models
that aim to incorporate all modalities as input, our model designates the prime
modality as input, while the remaining modalities act as detectors in the
information pathway. Our proposed perception model focuses on constructing an
effective and compact information flow by achieving a balance between the
minimization of mutual information between the latent state and the input modal
state, and the maximization of mutual information between the latent states and
the remaining modal states. This approach leads to compact latent state
representations that retain relevant information while minimizing redundancy,
thereby substantially enhancing the performance of downstream tasks.
Experimental evaluations on both the MUStARD and CMU-MOSI datasets demonstrate
that our model consistently distills crucial information in multimodal learning
scenarios, outperforming state-of-the-art benchmarks