Integrating and processing information from various sources or modalities are
critical for obtaining a comprehensive and accurate perception of the real
world. Drawing inspiration from neuroscience, we develop the
Information-Theoretic Hierarchical Perception (ITHP) model, which utilizes the
concept of information bottleneck. Distinct from most traditional fusion models
that aim to incorporate all modalities as input, our model designates the prime
modality as input, while the remaining modalities act as detectors in the
information pathway. Our proposed perception model focuses on constructing an
effective and compact information flow by achieving a balance between the
minimization of mutual information between the latent state and the input modal
state, and the maximization of mutual information between the latent states and
the remaining modal states. This approach leads to compact latent state
representations that retain relevant information while minimizing redundancy,
thereby substantially enhancing the performance of downstream tasks.
Experimental evaluations on both the MUStARD and CMU-MOSI datasets demonstrate
that our model consistently distills crucial information in multimodal learning
scenarios, outperforming state-of-the-art benchmarks