39 research outputs found
Multi-View Representation is What You Need for Point-Cloud Pre-Training
A promising direction for pre-training 3D point clouds is to leverage the
massive amount of data in 2D, whereas the domain gap between 2D and 3D creates
a fundamental challenge. This paper proposes a novel approach to point-cloud
pre-training that learns 3D representations by leveraging pre-trained 2D
networks. Different from the popular practice of predicting 2D features first
and then obtaining 3D features through dimensionality lifting, our approach
directly uses a 3D network for feature extraction. We train the 3D feature
extraction network with the help of the novel 2D knowledge transfer loss, which
enforces the 2D projections of the 3D feature to be consistent with the output
of pre-trained 2D networks. To prevent the feature from discarding 3D signals,
we introduce the multi-view consistency loss that additionally encourages the
projected 2D feature representations to capture pixel-wise correspondences
across different views. Such correspondences induce 3D geometry and effectively
retain 3D features in the projected 2D features. Experimental results
demonstrate that our pre-trained model can be successfully transferred to
various downstream tasks, including 3D shape classification, part segmentation,
3D object detection, and semantic segmentation, achieving state-of-the-art
performance.Comment: 14 pages, 6 figure
Point2Sequence: Learning the Shape Representation of 3D Point Clouds with an Attention-based Sequence to Sequence Network
Exploring contextual information in the local region is important for shape
understanding and analysis. Existing studies often employ hand-crafted or
explicit ways to encode contextual information of local regions. However, it is
hard to capture fine-grained contextual information in hand-crafted or explicit
manners, such as the correlation between different areas in a local region,
which limits the discriminative ability of learned features. To resolve this
issue, we propose a novel deep learning model for 3D point clouds, named
Point2Sequence, to learn 3D shape features by capturing fine-grained contextual
information in a novel implicit way. Point2Sequence employs a novel sequence
learning model for point clouds to capture the correlations by aggregating
multi-scale areas of each local region with attention. Specifically,
Point2Sequence first learns the feature of each area scale in a local region.
Then, it captures the correlation between area scales in the process of
aggregating all area scales using a recurrent neural network (RNN) based
encoder-decoder structure, where an attention mechanism is proposed to
highlight the importance of different area scales. Experimental results show
that Point2Sequence achieves state-of-the-art performance in shape
classification and segmentation tasks.Comment: To be published in AAAI 201
Spherical Transformer: Adapting Spherical Signal to CNNs
Convolutional neural networks (CNNs) have been widely used in various vision
tasks, e.g. image classification, semantic segmentation, etc. Unfortunately,
standard 2D CNNs are not well suited for spherical signals such as panorama
images or spherical projections, as the sphere is an unstructured grid. In this
paper, we present Spherical Transformer which can transform spherical signals
into vectors that can be directly processed by standard CNNs such that many
well-designed CNNs architectures can be reused across tasks and datasets by
pretraining. To this end, the proposed method first uses locally structured
sampling methods such as HEALPix to construct a transformer grid by using the
information of spherical points and its adjacent points, and then transforms
the spherical signals to the vectors through the grid. By building the
Spherical Transformer module, we can use multiple CNN architectures directly.
We evaluate our approach on the tasks of spherical MNIST recognition, 3D object
classification and omnidirectional image semantic segmentation. For 3D object
classification, we further propose a rendering-based projection method to
improve the performance and a rotational-equivariant model to improve the
anti-rotation ability. Experimental results on three tasks show that our
approach achieves superior performance over state-of-the-art methods
Let Images Give You More:Point Cloud Cross-Modal Training for Shape Analysis
Although recent point cloud analysis achieves impressive progress, the
paradigm of representation learning from a single modality gradually meets its
bottleneck. In this work, we take a step towards more discriminative 3D point
cloud representation by fully taking advantages of images which inherently
contain richer appearance information, e.g., texture, color, and shade.
Specifically, this paper introduces a simple but effective point cloud
cross-modality training (PointCMT) strategy, which utilizes view-images, i.e.,
rendered or projected 2D images of the 3D object, to boost point cloud
analysis. In practice, to effectively acquire auxiliary knowledge from view
images, we develop a teacher-student framework and formulate the cross modal
learning as a knowledge distillation problem. PointCMT eliminates the
distribution discrepancy between different modalities through novel feature and
classifier enhancement criteria and avoids potential negative transfer
effectively. Note that PointCMT effectively improves the point-only
representation without architecture modification. Sufficient experiments verify
significant gains on various datasets using appealing backbones, i.e., equipped
with PointCMT, PointNet++ and PointMLP achieve state-of-the-art performance on
two benchmarks, i.e., 94.4% and 86.7% accuracy on ModelNet40 and ScanObjectNN,
respectively. Code will be made available at
https://github.com/ZhanHeshen/PointCMT.Comment: To appear in NIPS202
PointMCD: Boosting Deep Point Cloud Encoders via Multi-view Cross-modal Distillation for 3D Shape Recognition
As two fundamental representation modalities of 3D objects, 3D point clouds
and multi-view 2D images record shape information from different domains of
geometric structures and visual appearances. In the current deep learning era,
remarkable progress in processing such two data modalities has been achieved
through respectively customizing compatible 3D and 2D network architectures.
However, unlike multi-view image-based 2D visual modeling paradigms, which have
shown leading performance in several common 3D shape recognition benchmarks,
point cloud-based 3D geometric modeling paradigms are still highly limited by
insufficient learning capacity, due to the difficulty of extracting
discriminative features from irregular geometric signals. In this paper, we
explore the possibility of boosting deep 3D point cloud encoders by
transferring visual knowledge extracted from deep 2D image encoders under a
standard teacher-student distillation workflow. Generally, we propose PointMCD,
a unified multi-view cross-modal distillation architecture, including a
pretrained deep image encoder as the teacher and a deep point encoder as the
student. To perform heterogeneous feature alignment between 2D visual and 3D
geometric domains, we further investigate visibility-aware feature projection
(VAFP), by which point-wise embeddings are reasonably aggregated into
view-specific geometric descriptors. By pair-wisely aligning multi-view visual
and geometric descriptors, we can obtain more powerful deep point encoders
without exhausting and complicated network modification. Experiments on 3D
shape classification, part segmentation, and unsupervised learning strongly
validate the effectiveness of our method. The code and data will be publicly
available at https://github.com/keeganhk/PointMCD