2 research outputs found
Robust Multimodal Learning with Missing Modalities via Parameter-Efficient Adaptation
Multimodal learning seeks to utilize data from multiple sources to improve
the overall performance of downstream tasks. It is desirable for redundancies
in the data to make multimodal systems robust to missing or corrupted
observations in some correlated modalities. However, we observe that the
performance of several existing multimodal networks significantly deteriorates
if one or multiple modalities are absent at test time. To enable robustness to
missing modalities, we propose simple and parameter-efficient adaptation
procedures for pretrained multimodal networks. In particular, we exploit
low-rank adaptation and modulation of intermediate features to compensate for
the missing modalities. We demonstrate that such adaptation can partially
bridge performance drop due to missing modalities and outperform independent,
dedicated networks trained for the available modality combinations in some
cases. The proposed adaptation requires extremely small number of parameters
(e.g., fewer than 0.7% of the total parameters in most experiments). We conduct
a series of experiments to highlight the robustness of our proposed method
using diverse datasets for RGB-thermal and RGB-Depth semantic segmentation,
multimodal material segmentation, and multimodal sentiment analysis tasks. Our
proposed method demonstrates versatility across various tasks and datasets, and
outperforms existing methods for robust multimodal learning with missing
modalities.Comment: 18 pages, 3 figures, 11 table
MMSFormer: Multimodal Transformer for Material and Semantic Segmentation
Leveraging information across diverse modalities is known to enhance performance on multimodal segmentation tasks. However, effectively fusing information from different modalities remains challenging due to the unique characteristics of each modality. In this paper, we propose a novel fusion strategy that can effectively fuse information from different modality combinations. We also propose a new model named Multi-Modal Segmentation TransFormer (MMSFormer) that incorporates the proposed fusion strategy to perform multimodal material and semantic segmentation tasks. MMSFormer outperforms current state-of-the-art models on three different datasets. As we begin with only one input modality, performance improves progressively as additional modalities are incorporated, showcasing the effectiveness of the fusion block in combining useful information from diverse input modalities. Ablation studies show that different modules in the fusion block are crucial for overall model performance. Furthermore, our ablation studies also highlight the capacity of different input modalities to improve performance in the identification of different types of materials