Vision language pre-training aims to learn alignments between vision and
language from a large amount of data. We proposed multi-grained vision language
pre-training, a unified approach which can learn vision language alignments in
multiple granularity. This paper advances the proposed method by unifying image
and video encoding in one model and scaling up the model with large-scale data.
We present X2-VLM, a pre-trained VLM with a modular architecture for both
image-text tasks and video-text tasks. Experiment results show that X2-VLM
performs the best on base and large scale for both image-text and video-text
tasks, making a good trade-off between performance and model scale. Moreover,
we show that the modular design of X2-VLM results in high transferability
for X2-VLM to be utilized in any language or domain. For example, by simply
replacing the text encoder with XLM-R, X2-VLM outperforms state-of-the-art
multilingual multi-modal pre-trained models without any multilingual
pre-training. The code and pre-trained models will be available at
github.com/zengyan-97/X2-VLM.Comment: 21 pages, 8 figure