119 research outputs found
Dynamic Spatial Sparsification for Efficient Vision Transformers and Convolutional Neural Networks
In this paper, we present a new approach for model acceleration by exploiting
spatial sparsity in visual data. We observe that the final prediction in vision
Transformers is only based on a subset of the most informative tokens, which is
sufficient for accurate image recognition. Based on this observation, we
propose a dynamic token sparsification framework to prune redundant tokens
progressively and dynamically based on the input to accelerate vision
Transformers. Specifically, we devise a lightweight prediction module to
estimate the importance score of each token given the current features. The
module is added to different layers to prune redundant tokens hierarchically.
While the framework is inspired by our observation of the sparse attention in
vision Transformers, we find the idea of adaptive and asymmetric computation
can be a general solution for accelerating various architectures. We extend our
method to hierarchical models including CNNs and hierarchical vision
Transformers as well as more complex dense prediction tasks that require
structured feature maps by formulating a more generic dynamic spatial
sparsification framework with progressive sparsification and asymmetric
computation for different spatial locations. By applying lightweight fast paths
to less informative features and using more expressive slow paths to more
important locations, we can maintain the structure of feature maps while
significantly reducing the overall computations. Extensive experiments
demonstrate the effectiveness of our framework on various modern architectures
and different visual recognition tasks. Our results clearly demonstrate that
dynamic spatial sparsification offers a new and more effective dimension for
model acceleration. Code is available at
https://github.com/raoyongming/DynamicViTComment: Accepted to T-PAMI. Journal version of our NeurIPS 2021 work:
arXiv:2106.02034. Code is available at
https://github.com/raoyongming/DynamicVi
Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks
The growing energy and performance costs of deep learning have driven the community to reduce the size of neural networks by selectively pruning components. Similarly to their biological counterparts, sparse networks generalize just as well, sometimes even better than, the original dense networks. Sparsity promises to reduce the memory footprint of regular networks to fit mobile devices, as well as shorten training time for ever growing networks. In this paper, we survey prior work on sparsity in deep learning and provide an extensive tutorial of sparsification for both inference and training. We describe approaches to remove and add elements of neural networks, different training strategies to achieve model sparsity, and mechanisms to exploit sparsity in practice. Our work distills ideas from more than 300 research papers and provides guidance to practitioners who wish to utilize sparsity today, as well as to researchers whose goal is to push the frontier forward. We include the necessary background on mathematical methods in sparsification, describe phenomena such as early structure adaptation, the intricate relations between sparsity and the training process, and show techniques for achieving acceleration on real hardware. We also define a metric of pruned parameter efficiency that could serve as a baseline for comparison of different sparse networks. We close by speculating on how sparsity can improve future workloads and outline major open problems in the field
STPrivacy: Spatio-Temporal Privacy-Preserving Action Recognition
Existing methods of privacy-preserving action recognition (PPAR) mainly focus
on frame-level (spatial) privacy removal through 2D CNNs. Unfortunately, they
have two major drawbacks. First, they may compromise temporal dynamics in input
videos, which are critical for accurate action recognition. Second, they are
vulnerable to practical attacking scenarios where attackers probe for privacy
from an entire video rather than individual frames. To address these issues, we
propose a novel framework STPrivacy to perform video-level PPAR. For the first
time, we introduce vision Transformers into PPAR by treating a video as a
tubelet sequence, and accordingly design two complementary mechanisms, i.e.,
sparsification and anonymization, to remove privacy from a spatio-temporal
perspective. In specific, our privacy sparsification mechanism applies adaptive
token selection to abandon action-irrelevant tubelets. Then, our anonymization
mechanism implicitly manipulates the remaining action-tubelets to erase privacy
in the embedding space through adversarial learning. These mechanisms provide
significant advantages in terms of privacy preservation for human eyes and
action-privacy trade-off adjustment during deployment. We additionally
contribute the first two large-scale PPAR benchmarks, VP-HMDB51 and VP-UCF101,
to the community. Extensive evaluations on them, as well as two other tasks,
validate the effectiveness and generalization capability of our framework
AdaViT: Adaptive Tokens for Efficient Vision Transformer
We introduce A-ViT, a method that adaptively adjusts the inference cost of
vision transformer (ViT) for images of different complexity. A-ViT achieves
this by automatically reducing the number of tokens in vision transformers that
are processed in the network as inference proceeds. We reformulate Adaptive
Computation Time (ACT) for this task, extending halting to discard redundant
spatial tokens. The appealing architectural properties of vision transformers
enables our adaptive token reduction mechanism to speed up inference without
modifying the network architecture or inference hardware. We demonstrate that
A-ViT requires no extra parameters or sub-network for halting, as we base the
learning of adaptive halting on the original network parameters. We further
introduce distributional prior regularization that stabilizes training compared
to prior ACT approaches. On the image classification task (ImageNet1K), we show
that our proposed A-ViT yields high efficacy in filtering informative spatial
features and cutting down on the overall compute. The proposed method improves
the throughput of DeiT-Tiny by 62% and DeiT-Small by 38% with only 0.3%
accuracy drop, outperforming prior art by a large margin. Project page at
https://a-vit.github.io/Comment: CVPR'22 oral acceptanc
A Theoretical Understanding of Shallow Vision Transformers: Learning, Generalization, and Sample Complexity
Vision Transformers (ViTs) with self-attention modules have recently achieved
great empirical success in many vision tasks. Due to non-convex interactions
across layers, however, theoretical learning and generalization analysis is
mostly elusive. Based on a data model characterizing both label-relevant and
label-irrelevant tokens, this paper provides the first theoretical analysis of
training a shallow ViT, i.e., one self-attention layer followed by a two-layer
perceptron, for a classification task. We characterize the sample complexity to
achieve a zero generalization error. Our sample complexity bound is positively
correlated with the inverse of the fraction of label-relevant tokens, the token
noise level, and the initial model error. We also prove that a training process
using stochastic gradient descent (SGD) leads to a sparse attention map, which
is a formal verification of the general intuition about the success of
attention. Moreover, this paper indicates that a proper token sparsification
can improve the test performance by removing label-irrelevant and/or noisy
tokens, including spurious correlations. Empirical experiments on synthetic
data and CIFAR-10 dataset justify our theoretical results and generalize to
deeper ViTs
Full Stack Optimization of Transformer Inference: a Survey
Recent advances in state-of-the-art DNN architecture design have been moving
toward Transformer models. These models achieve superior accuracy across a wide
range of applications. This trend has been consistent over the past several
years since Transformer models were originally introduced. However, the amount
of compute and bandwidth required for inference of recent Transformer models is
growing at a significant rate, and this has made their deployment in
latency-sensitive applications challenging. As such, there has been an
increased focus on making Transformer models more efficient, with methods that
range from changing the architecture design, all the way to developing
dedicated domain-specific accelerators. In this work, we survey different
approaches for efficient Transformer inference, including: (i) analysis and
profiling of the bottlenecks in existing Transformer architectures and their
similarities and differences with previous convolutional models; (ii)
implications of Transformer architecture on hardware, including the impact of
non-linear operations such as Layer Normalization, Softmax, and GELU, as well
as linear operations, on hardware design; (iii) approaches for optimizing a
fixed Transformer architecture; (iv) challenges in finding the right mapping
and scheduling of operations for Transformer models; and (v) approaches for
optimizing Transformer models by adapting the architecture using neural
architecture search. Finally, we perform a case study by applying the surveyed
optimizations on Gemmini, the open-source, full-stack DNN accelerator
generator, and we show how each of these approaches can yield improvements,
compared to previous benchmark results on Gemmini. Among other things, we find
that a full-stack co-design approach with the aforementioned methods can result
in up to 88.7x speedup with a minimal performance degradation for Transformer
inference
EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision Transformers
Self-attention based models such as vision transformers (ViTs) have emerged
as a very competitive architecture alternative to convolutional neural networks
(CNNs) in computer vision. Despite increasingly stronger variants with
ever-higher recognition accuracies, due to the quadratic complexity of
self-attention, existing ViTs are typically demanding in computation and model
size. Although several successful design choices (e.g., the convolutions and
hierarchical multi-stage structure) of prior CNNs have been reintroduced into
recent ViTs, they are still not sufficient to meet the limited resource
requirements of mobile devices. This motivates a very recent attempt to develop
light ViTs based on the state-of-the-art MobileNet-v2, but still leaves a
performance gap behind. In this work, pushing further along this under-studied
direction we introduce EdgeViTs, a new family of light-weight ViTs that, for
the first time, enable attention-based vision models to compete with the best
light-weight CNNs in the tradeoff between accuracy and on-device efficiency.
This is realized by introducing a highly cost-effective local-global-local
(LGL) information exchange bottleneck based on optimal integration of
self-attention and convolutions. For device-dedicated evaluation, rather than
relying on inaccurate proxies like the number of FLOPs or parameters, we adopt
a practical approach of focusing directly on on-device latency and, for the
first time, energy efficiency. Specifically, we show that our models are
Pareto-optimal when both accuracy-latency and accuracy-energy trade-offs are
considered, achieving strict dominance over other ViTs in almost all cases and
competing with the most efficient CNNs. Code is available at
https://github.com/saic-fi/edgevit.Comment: Accepted in ECCV 202
- …