16 research outputs found
See More and Know More: Zero-shot Point Cloud Segmentation via Multi-modal Visual Data
Zero-shot point cloud segmentation aims to make deep models capable of
recognizing novel objects in point cloud that are unseen in the training phase.
Recent trends favor the pipeline which transfers knowledge from seen classes
with labels to unseen classes without labels. They typically align visual
features with semantic features obtained from word embedding by the supervision
of seen classes' annotations. However, point cloud contains limited information
to fully match with semantic features. In fact, the rich appearance information
of images is a natural complement to the textureless point cloud, which is not
well explored in previous literature. Motivated by this, we propose a novel
multi-modal zero-shot learning method to better utilize the complementary
information of point clouds and images for more accurate visual-semantic
alignment. Extensive experiments are performed in two popular benchmarks, i.e.,
SemanticKITTI and nuScenes, and our method outperforms current SOTA methods
with 52% and 49% improvement on average for unseen class mIoU, respectively.Comment: Accepted by ICCV 202
Point Cloud Pre-training with Diffusion Models
Pre-training a model and then fine-tuning it on downstream tasks has
demonstrated significant success in the 2D image and NLP domains. However, due
to the unordered and non-uniform density characteristics of point clouds, it is
non-trivial to explore the prior knowledge of point clouds and pre-train a
point cloud backbone. In this paper, we propose a novel pre-training method
called Point cloud Diffusion pre-training (PointDif). We consider the point
cloud pre-training task as a conditional point-to-point generation problem and
introduce a conditional point generator. This generator aggregates the features
extracted by the backbone and employs them as the condition to guide the
point-to-point recovery from the noisy point cloud, thereby assisting the
backbone in capturing both local and global geometric priors as well as the
global point density distribution of the object. We also present a recurrent
uniform sampling optimization strategy, which enables the model to uniformly
recover from various noise levels and learn from balanced supervision. Our
PointDif achieves substantial improvement across various real-world datasets
for diverse downstream tasks such as classification, segmentation and
detection. Specifically, PointDif attains 70.0% mIoU on S3DIS Area 5 for the
segmentation task and achieves an average improvement of 2.4% on ScanObjectNN
for the classification task compared to TAP. Furthermore, our pre-training
framework can be flexibly applied to diverse point cloud backbones and bring
considerable gains
CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIP
Contrastive Language-Image Pre-training (CLIP) achieves promising results in
2D zero-shot and few-shot learning. Despite the impressive performance in 2D,
applying CLIP to help the learning in 3D scene understanding has yet to be
explored. In this paper, we make the first attempt to investigate how CLIP
knowledge benefits 3D scene understanding. We propose CLIP2Scene, a simple yet
effective framework that transfers CLIP knowledge from 2D image-text
pre-trained models to a 3D point cloud network. We show that the pre-trained 3D
network yields impressive performance on various downstream tasks, i.e.,
annotation-free and fine-tuning with labelled data for semantic segmentation.
Specifically, built upon CLIP, we design a Semantic-driven Cross-modal
Contrastive Learning framework that pre-trains a 3D network via semantic and
spatial-temporal consistency regularization. For the former, we first leverage
CLIP's text semantics to select the positive and negative point samples and
then employ the contrastive loss to train the 3D network. In terms of the
latter, we force the consistency between the temporally coherent point cloud
features and their corresponding image features. We conduct experiments on
SemanticKITTI, nuScenes, and ScanNet. For the first time, our pre-trained
network achieves annotation-free 3D semantic segmentation with 20.8% and 25.08%
mIoU on nuScenes and ScanNet, respectively. When fine-tuned with 1% or 100%
labelled data, our method significantly outperforms other self-supervised
methods, with improvements of 8% and 1% mIoU, respectively. Furthermore, we
demonstrate the generalizability for handling cross-domain datasets. Code is
publicly available https://github.com/runnanchen/CLIP2Scene.Comment: CVPR 202
Rethinking Range View Representation for LiDAR Segmentation
LiDAR segmentation is crucial for autonomous driving perception. Recent
trends favor point- or voxel-based methods as they often yield better
performance than the traditional range view representation. In this work, we
unveil several key factors in building powerful range view models. We observe
that the "many-to-one" mapping, semantic incoherence, and shape deformation are
possible impediments against effective learning from range view projections. We
present RangeFormer -- a full-cycle framework comprising novel designs across
network architecture, data augmentation, and post-processing -- that better
handles the learning and processing of LiDAR point clouds from the range view.
We further introduce a Scalable Training from Range view (STR) strategy that
trains on arbitrary low-resolution 2D range images, while still maintaining
satisfactory 3D segmentation accuracy. We show that, for the first time, a
range view method is able to surpass the point, voxel, and multi-view fusion
counterparts in the competing LiDAR semantic and panoptic segmentation
benchmarks, i.e., SemanticKITTI, nuScenes, and ScribbleKITTI.Comment: ICCV 2023; 24 pages, 10 figures, 14 tables; Webpage at
https://ldkong.com/RangeForme
Inter-Region Affinity Distillation for Road Marking Segmentation
We study the problem of distilling knowledge from a large deep teacher
network to a much smaller student network for the task of road marking
segmentation. In this work, we explore a novel knowledge distillation (KD)
approach that can transfer 'knowledge' on scene structure more effectively from
a teacher to a student model. Our method is known as Inter-Region Affinity KD
(IntRA-KD). It decomposes a given road scene image into different regions and
represents each region as a node in a graph. An inter-region affinity graph is
then formed by establishing pairwise relationships between nodes based on their
similarity in feature distribution. To learn structural knowledge from the
teacher network, the student is required to match the graph generated by the
teacher. The proposed method shows promising results on three large-scale road
marking segmentation benchmarks, i.e., ApolloScape, CULane and LLAMAS, by
taking various lightweight models as students and ResNet-101 as the teacher.
IntRA-KD consistently brings higher performance gains on all lightweight
models, compared to previous distillation methods. Our code is available at
https://github.com/cardwing/Codes-for-IntRA-KD.Comment: 10 pages, 10 figures; This paper is accepted by CVPR 2020; Our code
is available at https://github.com/cardwing/Codes-for-IntRA-K