We study the task of weakly-supervised point cloud semantic segmentation with
sparse annotations (e.g., less than 0.1% points are labeled), aiming to reduce
the expensive cost of dense annotations. Unfortunately, with extremely sparse
annotated points, it is very difficult to extract both contextual and object
information for scene understanding such as semantic segmentation. Motivated by
masked modeling (e.g., MAE) in image and video representation learning, we seek
to endow the power of masked modeling to learn contextual information from
sparsely-annotated points. However, directly applying MAE to 3D point clouds
with sparse annotations may fail to work. First, it is nontrivial to
effectively mask out the informative visual context from 3D point clouds.
Second, how to fully exploit the sparse annotations for context modeling
remains an open question. In this paper, we propose a simple yet effective
Contextual Point Cloud Modeling (CPCM) method that consists of two parts: a
region-wise masking (RegionMask) strategy and a contextual masked training
(CMT) method. Specifically, RegionMask masks the point cloud continuously in
geometric space to construct a meaningful masked prediction task for subsequent
context learning. CMT disentangles the learning of supervised segmentation and
unsupervised masked context prediction for effectively learning the very
limited labeled points and mass unlabeled points, respectively. Extensive
experiments on the widely-tested ScanNet V2 and S3DIS benchmarks demonstrate
the superiority of CPCM over the state-of-the-art.Comment: Accepted by ICCV 202