RS5M: A Large Scale Vision-Language Dataset for Remote Sensing
  Vision-Language Foundation Model

Guo, Yulong; Yin, Jianwei; Zhang, Zilun; Zhao, Tiancheng

RS5M: A Large Scale Vision-Language Dataset for Remote Sensing Vision-Language Foundation Model

Authors: Yulong Guo
Jianwei Yin
Zilun Zhang
Tiancheng Zhao
Publication date: 31 August 2023
Publisher

Abstract

Pre-trained Vision-Language Foundation Models utilizing extensive image-text paired data have demonstrated unprecedented image-text association capabilities, achieving remarkable results across various downstream tasks. A critical challenge is how to make use of existing large-scale pre-trained VLMs, which are trained on common objects, to perform the domain-specific transfer for accomplishing domain-related downstream tasks. In this paper, we propose a new framework that includes the Domain Foundation Model (DFM), bridging the gap between the General Foundation Model (GFM) and domain-specific downstream tasks. Moreover, we present an image-text paired dataset in the field of remote sensing (RS), RS5M, which has 5 million RS images with English descriptions. The dataset is obtained from filtering publicly available image-text paired datasets and captioning label-only RS datasets with pre-trained VLM. These constitute the first large-scale RS image-text paired dataset. Additionally, we tried several Parameter-Efficient Fine-Tuning methods on RS5M to implement the DFM. Experimental results show that our proposed dataset are highly effective for various tasks, improving upon the baseline by

8 \% \sim 16 \%

in zero-shot classification tasks, and obtaining good results in both Vision-Language Retrieval and Semantic Localization tasks. \url{https://github.com/om-ai-lab/RS5M}Comment: RS5M dataset v

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2306.11300

Last time updated on 22/06/2023