193 research outputs found
ICPC: Instance-Conditioned Prompting with Contrastive Learning for Semantic Segmentation
Modern supervised semantic segmentation methods are usually finetuned based
on the supervised or self-supervised models pre-trained on ImageNet. Recent
work shows that transferring the knowledge from CLIP to semantic segmentation
via prompt learning can achieve promising performance. The performance boost
comes from the feature enhancement with multimodal alignment, i.e., the dot
product between vision and text embeddings. However, how to improve the
multimodal alignment for better transfer performance in dense tasks remains
underexplored. In this work, we focus on improving the quality of vision-text
alignment from two aspects of prompting design and loss function, and present
an instance-conditioned prompting with contrastive learning (ICPC) framework.
First, compared with the static prompt designs, we reveal that dynamic
prompting conditioned on image content can more efficiently utilize the text
encoder for complex dense tasks. Second, we propose an align-guided contrastive
loss to refine the alignment of vision and text embeddings. We further propose
lightweight multi-scale alignment for better performance. Extensive experiments
on three large-scale datasets (ADE20K, COCO-Stuff10k, and ADE20K-Full)
demonstrate that ICPC brings consistent improvements across diverse backbones.
Taking ResNet-50 as an example, ICPC outperforms the state-of-the-art
counterpart by 1.71%, 1.05%, and 1.41% mIoU on the three datasets,
respectively
RegionBLIP: A Unified Multi-modal Pre-training Framework for Holistic and Regional Comprehension
In this work, we investigate extending the comprehension of Multi-modal Large
Language Models (MLLMs) to regional objects. To this end, we propose to extract
features corresponding to regional objects as soft prompts for LLM, which
provides a straightforward and scalable approach and eliminates the need for
LLM fine-tuning. To effectively extract regional features from regular image
features and irregular point cloud features, we present a novel and unified
position-assisted feature extraction module. Furthermore, training an MLLM from
scratch is highly time-consuming. Thus, we propose incrementally extending
existing pre-trained MLLMs to comprehend more modalities and the regional
objects of those modalities. Specifically, we freeze the Q-Former from BLIP-2,
an impressive MLLM, and optimize the modality-specific Lora parameters in
Q-Former and LLM for each newly introduced modality. The freezing of the
Q-Former eliminates the need for extensive pre-training on massive image-text
data. The freezed Q-Former pre-trained from massive image-text data is also
beneficial for the pre-training on image-region-text data. We name our
framework RegionBLIP. We pre-train RegionBLIP on image-region-text,
point-cloud-text, and point-cloud-region-text data. Experimental results verify
that \Ours{} can preserve the image comprehension capability of BILP-2 and
further gain a comprehension of the newly introduced point cloud modality and
regional objects. The Data, Code, and Pre-trained models will be available at
https://github.com/mightyzau/RegionBLIP
Improved Neural Radiance Fields Using Pseudo-depth and Fusion
Since the advent of Neural Radiance Fields, novel view synthesis has received
tremendous attention. The existing approach for the generalization of radiance
field reconstruction primarily constructs an encoding volume from nearby source
images as additional inputs. However, these approaches cannot efficiently
encode the geometric information of real scenes with various scale
objects/structures. In this work, we propose constructing multi-scale encoding
volumes and providing multi-scale geometry information to NeRF models. To make
the constructed volumes as close as possible to the surfaces of objects in the
scene and the rendered depth more accurate, we propose to perform depth
prediction and radiance field reconstruction simultaneously. The predicted
depth map will be used to supervise the rendered depth, narrow the depth range,
and guide points sampling. Finally, the geometric information contained in
point volume features may be inaccurate due to occlusion, lighting, etc. To
this end, we propose enhancing the point volume feature from depth-guided
neighbor feature fusion. Experiments demonstrate the superior performance of
our method in both novel view synthesis and dense geometry modeling without
per-scene optimization
- …