194 research outputs found
Frozen CLIP Model is An Efficient Point Cloud Backbone
The pretraining-finetuning paradigm has demonstrated great success in NLP and
2D image fields because of the high-quality representation ability and
transferability of their pretrained models. However, pretraining such a strong
model is difficult in the 3D point cloud field since the training data is
limited and point cloud collection is expensive. This paper introduces
Efficient Point Cloud Learning (EPCL), an effective and efficient point cloud
learner for directly training high-quality point cloud models with a frozen
CLIP model. Our EPCL connects the 2D and 3D modalities by semantically aligning
the 2D features and point cloud features without paired 2D-3D data.
Specifically, the input point cloud is divided into a sequence of tokens and
directly fed into the frozen CLIP model to learn point cloud representation.
Furthermore, we design a task token to narrow the gap between 2D images and 3D
point clouds. Comprehensive experiments on 3D detection, semantic segmentation,
classification and few-shot learning demonstrate that the 2D CLIP model can be
an efficient point cloud backbone and our method achieves state-of-the-art
accuracy on both real-world and synthetic downstream tasks. Code will be
available.Comment: Technical repor
Ponder: Point Cloud Pre-training via Neural Rendering
We propose a novel approach to self-supervised learning of point cloud
representations by differentiable neural rendering. Motivated by the fact that
informative point cloud features should be able to encode rich geometry and
appearance cues and render realistic images, we train a point-cloud encoder
within a devised point-based neural renderer by comparing the rendered images
with real images on massive RGB-D data. The learned point-cloud encoder can be
easily integrated into various downstream tasks, including not only high-level
tasks like 3D detection and segmentation, but low-level tasks like 3D
reconstruction and image synthesis. Extensive experiments on various tasks
demonstrate the superiority of our approach compared to existing pre-training
methods.Comment: Project page: https://dihuang.me/ponder
Experts Weights Averaging: A New General Training Scheme for Vision Transformers
Structural re-parameterization is a general training scheme for Convolutional
Neural Networks (CNNs), which achieves performance improvement without
increasing inference cost. As Vision Transformers (ViTs) are gradually
surpassing CNNs in various visual tasks, one may question: if a training scheme
specifically for ViTs exists that can also achieve performance improvement
without increasing inference cost? Recently, Mixture-of-Experts (MoE) has
attracted increasing attention, as it can efficiently scale up the capacity of
Transformers at a fixed cost through sparsely activated experts. Considering
that MoE can also be viewed as a multi-branch structure, can we utilize MoE to
implement a ViT training scheme similar to structural re-parameterization? In
this paper, we affirmatively answer these questions, with a new general
training strategy for ViTs. Specifically, we decouple the training and
inference phases of ViTs. During training, we replace some Feed-Forward
Networks (FFNs) of the ViT with specially designed, more efficient MoEs that
assign tokens to experts by random uniform partition, and perform Experts
Weights Averaging (EWA) on these MoEs at the end of each iteration. After
training, we convert each MoE into an FFN by averaging the experts,
transforming the model back into original ViT for inference. We further provide
a theoretical analysis to show why and how it works. Comprehensive experiments
across various 2D and 3D visual tasks, ViT architectures, and datasets validate
the effectiveness and generalizability of the proposed training scheme.
Besides, our training scheme can also be applied to improve performance when
fine-tuning ViTs. Lastly, but equally important, the proposed EWA technique can
significantly improve the effectiveness of naive MoE in various 2D visual small
datasets and 3D visual tasks.Comment: 12 pages, 2 figure
MM-3DScene: 3D Scene Understanding by Customizing Masked Modeling with Informative-Preserved Reconstruction and Self-Distilled Consistency
Masked Modeling (MM) has demonstrated widespread success in various vision
challenges, by reconstructing masked visual patches. Yet, applying MM for
large-scale 3D scenes remains an open problem due to the data sparsity and
scene complexity. The conventional random masking paradigm used in 2D images
often causes a high risk of ambiguity when recovering the masked region of 3D
scenes. To this end, we propose a novel informative-preserved reconstruction,
which explores local statistics to discover and preserve the representative
structured points, effectively enhancing the pretext masking task for 3D scene
understanding. Integrated with a progressive reconstruction manner, our method
can concentrate on modeling regional geometry and enjoy less ambiguity for
masked reconstruction. Besides, such scenes with progressive masking ratios can
also serve to self-distill their intrinsic spatial consistency, requiring to
learn the consistent representations from unmasked areas. By elegantly
combining informative-preserved reconstruction on masked areas and consistency
self-distillation from unmasked areas, a unified framework called MM-3DScene is
yielded. We conduct comprehensive experiments on a host of downstream tasks.
The consistent improvement (e.g., +6.1 [email protected] on object detection and +2.2%
mIoU on semantic segmentation) demonstrates the superiority of our approach
Incremental Feature Selection Oriented for Data with Hierarchical Structure
In the big data era, the sample size is becoming increasingly large, the data dimensionality is also becoming extremely high, moreover, there exists hierarchical structure between different class labels. This paper investigates incremental feature selection for hierarchical classification based on the dependency degree of inclusive strategy and solves the hierarchical classification problem where labels are distributed at arbitrary nodes in tree structure. Firstly, the inclusive strategy is used to reduce the negative sample space by exploiting the hierarchical label structure. Secondly, a new fuzzy rough set model is introduced based on inclusive strategy, and a dependency calculation algorithm based on the inclusive strategy and a non-incremental feature selection algorithm are also proposed. Then, the dependency degree based on the inclusive strategy is proposed by adopting the incremental mechanism. Based on these, two incremental feature selection frameworks based on two strategies are designed. Lastly, a comparative study with the method based on the sibling strategy is performed. The?feasibility?and?efficiency?of the proposed algorithms are verified by numerical experiments
- …