Multimodal information extraction is attracting research attention nowadays,
which requires aggregating representations from different modalities. In this
paper, we present the Intra- and Inter-Sample Relationship Modeling (I2SRM)
method for this task, which contains two modules. Firstly, the intra-sample
relationship modeling module operates on a single sample and aims to learn
effective representations. Embeddings from textual and visual modalities are
shifted to bridge the modality gap caused by distinct pre-trained language and
image models. Secondly, the inter-sample relationship modeling module considers
relationships among multiple samples and focuses on capturing the interactions.
An AttnMixup strategy is proposed, which not only enables collaboration among
samples but also augments data to improve generalization. We conduct extensive
experiments on the multimodal named entity recognition datasets Twitter-2015
and Twitter-2017, and the multimodal relation extraction dataset MNRE. Our
proposed method I2SRM achieves competitive results, 77.12% F1-score on
Twitter-2015, 88.40% F1-score on Twitter-2017, and 84.12% F1-score on MNRE