44 research outputs found
Dynamic Spatial Sparsification for Efficient Vision Transformers and Convolutional Neural Networks
In this paper, we present a new approach for model acceleration by exploiting
spatial sparsity in visual data. We observe that the final prediction in vision
Transformers is only based on a subset of the most informative tokens, which is
sufficient for accurate image recognition. Based on this observation, we
propose a dynamic token sparsification framework to prune redundant tokens
progressively and dynamically based on the input to accelerate vision
Transformers. Specifically, we devise a lightweight prediction module to
estimate the importance score of each token given the current features. The
module is added to different layers to prune redundant tokens hierarchically.
While the framework is inspired by our observation of the sparse attention in
vision Transformers, we find the idea of adaptive and asymmetric computation
can be a general solution for accelerating various architectures. We extend our
method to hierarchical models including CNNs and hierarchical vision
Transformers as well as more complex dense prediction tasks that require
structured feature maps by formulating a more generic dynamic spatial
sparsification framework with progressive sparsification and asymmetric
computation for different spatial locations. By applying lightweight fast paths
to less informative features and using more expressive slow paths to more
important locations, we can maintain the structure of feature maps while
significantly reducing the overall computations. Extensive experiments
demonstrate the effectiveness of our framework on various modern architectures
and different visual recognition tasks. Our results clearly demonstrate that
dynamic spatial sparsification offers a new and more effective dimension for
model acceleration. Code is available at
https://github.com/raoyongming/DynamicViTComment: Accepted to T-PAMI. Journal version of our NeurIPS 2021 work:
arXiv:2106.02034. Code is available at
https://github.com/raoyongming/DynamicVi
UniPC: A Unified Predictor-Corrector Framework for Fast Sampling of Diffusion Models
Diffusion probabilistic models (DPMs) have demonstrated a very promising
ability in high-resolution image synthesis. However, sampling from a
pre-trained DPM is time-consuming due to the multiple evaluations of the
denoising network, making it more and more important to accelerate the sampling
of DPMs. Despite recent progress in designing fast samplers, existing methods
still cannot generate satisfying images in many applications where fewer steps
(e.g., 10) are favored. In this paper, we develop a unified corrector (UniC)
that can be applied after any existing DPM sampler to increase the order of
accuracy without extra model evaluations, and derive a unified predictor (UniP)
that supports arbitrary order as a byproduct. Combining UniP and UniC, we
propose a unified predictor-corrector framework called UniPC for the fast
sampling of DPMs, which has a unified analytical form for any order and can
significantly improve the sampling quality over previous methods, especially in
extremely few steps. We evaluate our methods through extensive experiments
including both unconditional and conditional sampling using pixel-space and
latent-space DPMs. Our UniPC can achieve 3.87 FID on CIFAR10 (unconditional)
and 7.51 FID on ImageNet 256256 (conditional) with only 10 function
evaluations. Code is available at https://github.com/wl-zhao/UniPC.Comment: Accepted by NeurIPS 2023. Project page:
https://unipc.ivg-research.xy
Prompt Learning with Optimal Transport for Vision-Language Models
With the increasing attention to large vision-language models such as CLIP,
there has been a significant amount of effort dedicated to building efficient
prompts. Unlike conventional methods of only learning one single prompt, we
propose to learn multiple comprehensive prompts to describe diverse
characteristics of categories such as intrinsic attributes or extrinsic
contexts. However, directly matching each prompt to the same visual feature is
problematic, as it pushes the prompts to converge to one point. To solve this
problem, we propose to apply optimal transport to match the vision and text
modalities. Specifically, we first model images and the categories with visual
and textual feature sets. Then, we apply a two-stage optimization strategy to
learn the prompts. In the inner loop, we optimize the optimal transport
distance to align visual features and prompts by the Sinkhorn algorithm, while
in the outer loop, we learn the prompts by this distance from the supervised
data. Extensive experiments are conducted on the few-shot recognition task and
the improvement demonstrates the superiority of our method