Understanding human perceptions presents a formidable multimodal challenge
for computers, encompassing aspects such as sentiment tendencies and sense of
humor. While various methods have recently been introduced to extract
modality-invariant and specific information from diverse modalities, with the
goal of enhancing the efficacy of multimodal learning, few works emphasize this
aspect in large language models. In this paper, we introduce a novel multimodal
prompt strategy tailored for tuning large language models. Our method assesses
the correlation among different modalities and isolates the modality-invariant
and specific components, which are then utilized for prompt tuning. This
approach enables large language models to efficiently and effectively
assimilate information from various modalities. Furthermore, our strategy is
designed with scalability in mind, allowing the integration of features from
any modality into pretrained large language models. Experimental results on
public datasets demonstrate that our proposed method significantly improves
performance compared to previous methods