RegionBLIP: A Unified Multi-modal Pre-training Framework for Holistic
  and Regional Comprehension

Wang, Fan; Wang, Zhibing; Wu, Sitong; Yu, Chaohui; Zhang, Shaofeng; Zhou, Qiang

RegionBLIP: A Unified Multi-modal Pre-training Framework for Holistic and Regional Comprehension

Authors: Fan Wang
Zhibing Wang
Sitong Wu
Chaohui Yu
Shaofeng Zhang
Qiang Zhou
Publication date: 3 August 2023
Publisher

Abstract

In this work, we investigate extending the comprehension of Multi-modal Large Language Models (MLLMs) to regional objects. To this end, we propose to extract features corresponding to regional objects as soft prompts for LLM, which provides a straightforward and scalable approach and eliminates the need for LLM fine-tuning. To effectively extract regional features from regular image features and irregular point cloud features, we present a novel and unified position-assisted feature extraction module. Furthermore, training an MLLM from scratch is highly time-consuming. Thus, we propose incrementally extending existing pre-trained MLLMs to comprehend more modalities and the regional objects of those modalities. Specifically, we freeze the Q-Former from BLIP-2, an impressive MLLM, and optimize the modality-specific Lora parameters in Q-Former and LLM for each newly introduced modality. The freezing of the Q-Former eliminates the need for extensive pre-training on massive image-text data. The freezed Q-Former pre-trained from massive image-text data is also beneficial for the pre-training on image-region-text data. We name our framework RegionBLIP. We pre-train RegionBLIP on image-region-text, point-cloud-text, and point-cloud-region-text data. Experimental results verify that \Ours{} can preserve the image comprehension capability of BILP-2 and further gain a comprehension of the newly introduced point cloud modality and regional objects. The Data, Code, and Pre-trained models will be available at https://github.com/mightyzau/RegionBLIP

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2308.02299

Last time updated on 08/08/2023