In this work, we investigate extending the comprehension of Multi-modal Large
Language Models (MLLMs) to regional objects. To this end, we propose to extract
features corresponding to regional objects as soft prompts for LLM, which
provides a straightforward and scalable approach and eliminates the need for
LLM fine-tuning. To effectively extract regional features from regular image
features and irregular point cloud features, we present a novel and unified
position-assisted feature extraction module. Furthermore, training an MLLM from
scratch is highly time-consuming. Thus, we propose incrementally extending
existing pre-trained MLLMs to comprehend more modalities and the regional
objects of those modalities. Specifically, we freeze the Q-Former from BLIP-2,
an impressive MLLM, and optimize the modality-specific Lora parameters in
Q-Former and LLM for each newly introduced modality. The freezing of the
Q-Former eliminates the need for extensive pre-training on massive image-text
data. The freezed Q-Former pre-trained from massive image-text data is also
beneficial for the pre-training on image-region-text data. We name our
framework RegionBLIP. We pre-train RegionBLIP on image-region-text,
point-cloud-text, and point-cloud-region-text data. Experimental results verify
that \Ours{} can preserve the image comprehension capability of BILP-2 and
further gain a comprehension of the newly introduced point cloud modality and
regional objects. The Data, Code, and Pre-trained models will be available at
https://github.com/mightyzau/RegionBLIP