74 research outputs found
X-3D: Explicit 3D Structure Modeling for Point Cloud Recognition
Numerous prior studies predominantly emphasize constructing relation vectors
for individual neighborhood points and generating dynamic kernels for each
vector and embedding these into high-dimensional spaces to capture implicit
local structures. However, we contend that such implicit high-dimensional
structure modeling approch inadequately represents the local geometric
structure of point clouds due to the absence of explicit structural
information. Hence, we introduce X-3D, an explicit 3D structure modeling
approach. X-3D functions by capturing the explicit local structural information
within the input 3D space and employing it to produce dynamic kernels with
shared weights for all neighborhood points within the current local region.
This modeling approach introduces effective geometric prior and significantly
diminishes the disparity between the local structure of the embedding space and
the original input point cloud, thereby improving the extraction of local
features. Experiments show that our method can be used on a variety of methods
and achieves state-of-the-art performance on segmentation, classification,
detection tasks with lower extra computational cost, such as \textbf{90.7\%} on
ScanObjectNN for classification, \textbf{79.2\%} on S3DIS 6 fold and
\textbf{74.3\%} on S3DIS Area 5 for segmentation, \textbf{76.3\%} on ScanNetV2
for segmentation and \textbf{64.5\%} mAP , \textbf{46.9\%} mAP on SUN RGB-D and
\textbf{69.0\%} mAP , \textbf{51.1\%} mAP on ScanNetV2 . Our code is available
at
\href{https://github.com/sunshuofeng/X-3D}{https://github.com/sunshuofeng/X-3D}
UniPC: A Unified Predictor-Corrector Framework for Fast Sampling of Diffusion Models
Diffusion probabilistic models (DPMs) have demonstrated a very promising
ability in high-resolution image synthesis. However, sampling from a
pre-trained DPM is time-consuming due to the multiple evaluations of the
denoising network, making it more and more important to accelerate the sampling
of DPMs. Despite recent progress in designing fast samplers, existing methods
still cannot generate satisfying images in many applications where fewer steps
(e.g., 10) are favored. In this paper, we develop a unified corrector (UniC)
that can be applied after any existing DPM sampler to increase the order of
accuracy without extra model evaluations, and derive a unified predictor (UniP)
that supports arbitrary order as a byproduct. Combining UniP and UniC, we
propose a unified predictor-corrector framework called UniPC for the fast
sampling of DPMs, which has a unified analytical form for any order and can
significantly improve the sampling quality over previous methods, especially in
extremely few steps. We evaluate our methods through extensive experiments
including both unconditional and conditional sampling using pixel-space and
latent-space DPMs. Our UniPC can achieve 3.87 FID on CIFAR10 (unconditional)
and 7.51 FID on ImageNet 256256 (conditional) with only 10 function
evaluations. Code is available at https://github.com/wl-zhao/UniPC.Comment: Accepted by NeurIPS 2023. Project page:
https://unipc.ivg-research.xy
Dynamic Spatial Sparsification for Efficient Vision Transformers and Convolutional Neural Networks
In this paper, we present a new approach for model acceleration by exploiting
spatial sparsity in visual data. We observe that the final prediction in vision
Transformers is only based on a subset of the most informative tokens, which is
sufficient for accurate image recognition. Based on this observation, we
propose a dynamic token sparsification framework to prune redundant tokens
progressively and dynamically based on the input to accelerate vision
Transformers. Specifically, we devise a lightweight prediction module to
estimate the importance score of each token given the current features. The
module is added to different layers to prune redundant tokens hierarchically.
While the framework is inspired by our observation of the sparse attention in
vision Transformers, we find the idea of adaptive and asymmetric computation
can be a general solution for accelerating various architectures. We extend our
method to hierarchical models including CNNs and hierarchical vision
Transformers as well as more complex dense prediction tasks that require
structured feature maps by formulating a more generic dynamic spatial
sparsification framework with progressive sparsification and asymmetric
computation for different spatial locations. By applying lightweight fast paths
to less informative features and using more expressive slow paths to more
important locations, we can maintain the structure of feature maps while
significantly reducing the overall computations. Extensive experiments
demonstrate the effectiveness of our framework on various modern architectures
and different visual recognition tasks. Our results clearly demonstrate that
dynamic spatial sparsification offers a new and more effective dimension for
model acceleration. Code is available at
https://github.com/raoyongming/DynamicViTComment: Accepted to T-PAMI. Journal version of our NeurIPS 2021 work:
arXiv:2106.02034. Code is available at
https://github.com/raoyongming/DynamicVi
Prompt Learning with Optimal Transport for Vision-Language Models
With the increasing attention to large vision-language models such as CLIP,
there has been a significant amount of effort dedicated to building efficient
prompts. Unlike conventional methods of only learning one single prompt, we
propose to learn multiple comprehensive prompts to describe diverse
characteristics of categories such as intrinsic attributes or extrinsic
contexts. However, directly matching each prompt to the same visual feature is
problematic, as it pushes the prompts to converge to one point. To solve this
problem, we propose to apply optimal transport to match the vision and text
modalities. Specifically, we first model images and the categories with visual
and textual feature sets. Then, we apply a two-stage optimization strategy to
learn the prompts. In the inner loop, we optimize the optimal transport
distance to align visual features and prompts by the Sinkhorn algorithm, while
in the outer loop, we learn the prompts by this distance from the supervised
data. Extensive experiments are conducted on the few-shot recognition task and
the improvement demonstrates the superiority of our method
- …