Pre-trained Vision-Language Foundation Models utilizing extensive image-text
paired data have demonstrated unprecedented image-text association
capabilities, achieving remarkable results across various downstream tasks. A
critical challenge is how to make use of existing large-scale pre-trained VLMs,
which are trained on common objects, to perform the domain-specific transfer
for accomplishing domain-related downstream tasks. In this paper, we propose a
new framework that includes the Domain Foundation Model (DFM), bridging the gap
between the General Foundation Model (GFM) and domain-specific downstream
tasks. Moreover, we present an image-text paired dataset in the field of remote
sensing (RS), RS5M, which has 5 million RS images with English descriptions.
The dataset is obtained from filtering publicly available image-text paired
datasets and captioning label-only RS datasets with pre-trained VLM. These
constitute the first large-scale RS image-text paired dataset. Additionally, we
tried several Parameter-Efficient Fine-Tuning methods on RS5M to implement the
DFM. Experimental results show that our proposed dataset are highly effective
for various tasks, improving upon the baseline by 8%∼16% in
zero-shot classification tasks, and obtaining good results in both
Vision-Language Retrieval and Semantic Localization tasks.
\url{https://github.com/om-ai-lab/RS5M}Comment: RS5M dataset v