Object discovery is a core task in computer vision. While fast progresses
have been made in supervised object detection, its unsupervised counterpart
remains largely unexplored. With the growth of data volume, the expensive cost
of annotations is the major limitation hindering further study. Therefore,
discovering objects without annotations has great significance. However, this
task seems impractical on still-image or point cloud alone due to the lack of
discriminative information. Previous studies underlook the crucial temporal
information and constraints naturally behind multi-modal inputs. In this paper,
we propose 4D unsupervised object discovery, jointly discovering objects from
4D data -- 3D point clouds and 2D RGB images with temporal information. We
present the first practical approach for this task by proposing a ClusterNet on
3D point clouds, which is jointly iteratively optimized with a 2D localization
network. Extensive experiments on the large-scale Waymo Open Dataset suggest
that the localization network and ClusterNet achieve competitive performance on
both class-agnostic 2D object detection and 3D instance segmentation, bridging
the gap between unsupervised methods and full supervised ones. Codes and models
will be made available at https://github.com/Robertwyq/LSMOL.Comment: Accepted by NeurIPS 2022. 17 pages, 6 figure