Multimodal Sentiment Analysis leverages multimodal signals to detect the
sentiment of a speaker. Previous approaches concentrate on performing
multimodal fusion and representation learning based on general knowledge
obtained from pretrained models, which neglects the effect of domain-specific
knowledge. In this paper, we propose Contrastive Knowledge Injection (ConKI)
for multimodal sentiment analysis, where specific-knowledge representations for
each modality can be learned together with general knowledge representations
via knowledge injection based on an adapter architecture. In addition, ConKI
uses a hierarchical contrastive learning procedure performed between knowledge
types within every single modality, across modalities within each sample, and
across samples to facilitate the effective learning of the proposed
representations, hence improving multimodal sentiment predictions. The
experiments on three popular multimodal sentiment analysis benchmarks show that
ConKI outperforms all prior methods on a variety of performance metrics.Comment: Accepted by ACL Findings 202