Semantic segmentation of point clouds usually requires exhausting efforts of
human annotations, hence it attracts wide attention to the challenging topic of
learning from unlabeled or weaker forms of annotations. In this paper, we take
the first attempt for fully unsupervised semantic segmentation of point clouds,
which aims to delineate semantically meaningful objects without any form of
annotations. Previous works of unsupervised pipeline on 2D images fails in this
task of point clouds, due to: 1) Clustering Ambiguity caused by limited
magnitude of data and imbalanced class distribution; 2) Irregularity Ambiguity
caused by the irregular sparsity of point cloud. Therefore, we propose a novel
framework, PointDC, which is comprised of two steps that handle the
aforementioned problems respectively: Cross-Modal Distillation (CMD) and
Super-Voxel Clustering (SVC). In the first stage of CMD, multi-view visual
features are back-projected to the 3D space and aggregated to a unified point
feature to distill the training of the point representation. In the second
stage of SVC, the point features are aggregated to super-voxels and then fed to
the iterative clustering process for excavating semantic classes. PointDC
yields a significant improvement over the prior state-of-the-art unsupervised
methods, on both the ScanNet-v2 (+18.4 mIoU) and S3DIS (+11.5 mIoU) semantic
segmentation benchmarks