How can we better extract entities and relations from text? Using multimodal
extraction with images and text obtains more signals for entities and
relations, and aligns them through graphs or hierarchical fusion, aiding in
extraction. Despite attempts at various fusions, previous works have overlooked
many unlabeled image-caption pairs, such as NewsCLIPing. This paper proposes
innovative pre-training objectives for entity-object and relation-image
alignment, extracting objects from images and aligning them with entity and
relation prompts for soft pseudo-labels. These labels are used as
self-supervised signals for pre-training, enhancing the ability to extract
entities and relations. Experiments on three datasets show an average 3.41% F1
improvement over prior SOTA. Additionally, our method is orthogonal to previous
multimodal fusions, and using it on prior SOTA fusions further improves 5.47%
F1.Comment: Accepted to ACM Multimedia 202