3 research outputs found
Self-supervised Modal and View Invariant Feature Learning
Most of the existing self-supervised feature learning methods for 3D data
either learn 3D features from point cloud data or from multi-view images. By
exploring the inherent multi-modality attributes of 3D objects, in this paper,
we propose to jointly learn modal-invariant and view-invariant features from
different modalities including image, point cloud, and mesh with heterogeneous
networks for 3D data. In order to learn modal- and view-invariant features, we
propose two types of constraints: cross-modal invariance constraint and
cross-view invariant constraint. Cross-modal invariance constraint forces the
network to maximum the agreement of features from different modalities for same
objects, while the cross-view invariance constraint forces the network to
maximum agreement of features from different views of images for same objects.
The quality of learned features has been tested on different downstream tasks
with three modalities of data including point cloud, multi-view images, and
mesh. Furthermore, the invariance cross different modalities and views are
evaluated with the cross-modal retrieval task. Extensive evaluation results
demonstrate that the learned features are robust and have strong
generalizability across different tasks
P4Contrast: Contrastive Learning with Pairs of Point-Pixel Pairs for RGB-D Scene Understanding
Self-supervised representation learning is a critical problem in computer
vision, as it provides a way to pretrain feature extractors on large unlabeled
datasets that can be used as an initialization for more efficient and effective
training on downstream tasks. A promising approach is to use contrastive
learning to learn a latent space where features are close for similar data
samples and far apart for dissimilar ones. This approach has demonstrated
tremendous success for pretraining both image and point cloud feature
extractors, but it has been barely investigated for multi-modal RGB-D scans,
especially with the goal of facilitating high-level scene understanding. To
solve this problem, we propose contrasting "pairs of point-pixel pairs", where
positives include pairs of RGB-D points in correspondence, and negatives
include pairs where one of the two modalities has been disturbed and/or the two
RGB-D points are not in correspondence. This provides extra flexibility in
making hard negatives and helps networks to learn features from both
modalities, not just the more discriminating one of the two. Experiments show
that this proposed approach yields better performance on three large-scale
RGB-D scene understanding benchmarks (ScanNet, SUN RGB-D, and 3RScan) than
previous pretraining approaches
Self-Supervised Pretraining of 3D Features on any Point-Cloud
Pretraining on large labeled datasets is a prerequisite to achieve good
performance in many computer vision tasks like 2D object recognition, video
classification etc. However, pretraining is not widely used for 3D recognition
tasks where state-of-the-art methods train models from scratch. A primary
reason is the lack of large annotated datasets because 3D data is both
difficult to acquire and time consuming to label. We present a simple
self-supervised pertaining method that can work with any 3D data - single or
multiview, indoor or outdoor, acquired by varied sensors, without 3D
registration. We pretrain standard point cloud and voxel based model
architectures, and show that joint pretraining further improves performance. We
evaluate our models on 9 benchmarks for object detection, semantic
segmentation, and object classification, where they achieve state-of-the-art
results and can outperform supervised pretraining. We set a new
state-of-the-art for object detection on ScanNet (69.0% mAP) and SUNRGBD (63.5%
mAP). Our pretrained models are label efficient and improve performance for
classes with few examples