57 research outputs found
SBNet: Sparse Blocks Network for Fast Inference
Conventional deep convolutional neural networks (CNNs) apply convolution
operators uniformly in space across all feature maps for hundreds of layers -
this incurs a high computational cost for real-time applications. For many
problems such as object detection and semantic segmentation, we are able to
obtain a low-cost computation mask, either from a priori problem knowledge, or
from a low-resolution segmentation network. We show that such computation masks
can be used to reduce computation in the high-resolution main network. Variants
of sparse activation CNNs have previously been explored on small-scale tasks
and showed no degradation in terms of object classification accuracy, but often
measured gains in terms of theoretical FLOPs without realizing a practical
speed-up when compared to highly optimized dense convolution implementations.
In this work, we leverage the sparsity structure of computation masks and
propose a novel tiling-based sparse convolution algorithm. We verified the
effectiveness of our sparse CNN on LiDAR-based 3D object detection, and we
report significant wall-clock speed-ups compared to dense convolution without
noticeable loss of accuracy.Comment: 10 pages, CVPR 201
RepVGG:Making VGG-style ConvNets Great Again
We present a simple but powerful architecture of convolutional neural
network, which has a VGG-like inference-time body composed of nothing but a
stack of 3x3 convolution and ReLU, while the training-time model has a
multi-branch topology. Such decoupling of the training-time and inference-time
architecture is realized by a structural re-parameterization technique so that
the model is named RepVGG. On ImageNet, RepVGG reaches over 80% top-1 accuracy,
which is the first time for a plain model, to the best of our knowledge. On
NVIDIA 1080Ti GPU, RepVGG models run 83% faster than ResNet-50 or 101% faster
than ResNet-101 with higher accuracy and show favorable accuracy-speed
trade-off compared to the state-of-the-art models like EfficientNet and RegNet.
The code and trained models are available at
https://github.com/megvii-model/RepVGG.Comment: CVPR 202
Lightweight Modules for Efficient Deep Learning based Image Restoration
Low level image restoration is an integral component of modern artificial
intelligence (AI) driven camera pipelines. Most of these frameworks are based
on deep neural networks which present a massive computational overhead on
resource constrained platform like a mobile phone. In this paper, we propose
several lightweight low-level modules which can be used to create a
computationally low cost variant of a given baseline model. Recent works for
efficient neural networks design have mainly focused on classification.
However, low-level image processing falls under the image-to-image' translation
genre which requires some additional computational modules not present in
classification. This paper seeks to bridge this gap by designing generic
efficient modules which can replace essential components used in contemporary
deep learning based image restoration networks. We also present and analyse our
results highlighting the drawbacks of applying depthwise separable
convolutional kernel (a popular method for efficient classification network)
for sub-pixel convolution based upsampling (a popular upsampling strategy for
low-level vision applications). This shows that concepts from domain of
classification cannot always be seamlessly integrated into image-to-image
translation tasks. We extensively validate our findings on three popular tasks
of image inpainting, denoising and super-resolution. Our results show that
proposed networks consistently output visually similar reconstructions compared
to full capacity baselines with significant reduction of parameters, memory
footprint and execution speeds on contemporary mobile devices.Comment: Accepted at: IEEE Transactions on Circuits and Systems for Video
Technology (Early Access Print) | |Codes Available at:
https://github.com/avisekiit/TCSVT-LightWeight-CNNs | Supplementary Document
at:
https://drive.google.com/file/d/1BQhkh33Sen-d0qOrjq5h8ahw2VCUIVLg/view?usp=sharin
OLLIE: Derivation-based Tensor Program Optimizer
Boosting the runtime performance of deep neural networks (DNNs) is critical
due to their wide adoption in real-world tasks. Existing approaches to
optimizing the tensor algebra expression of a DNN only consider expressions
representable by a fixed set of predefined operators, missing possible
optimization opportunities between general expressions. We propose OLLIE, the
first derivation-based tensor program optimizer. OLLIE optimizes tensor
programs by leveraging transformations between general tensor algebra
expressions, enabling a significantly larger expression search space that
includes those supported by prior work as special cases. OLLIE uses a hybrid
derivation-based optimizer that effectively combines explorative and guided
derivations to quickly discover highly optimized expressions. Evaluation on
seven DNNs shows that OLLIE can outperform existing optimizers by up to
2.73 (1.46 on average) on an A100 GPU and up to 2.68
(1.51) on a V100 GPU, respectively
Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis
Deep Neural Networks (DNNs) are becoming an important tool in modern
computing applications. Accelerating their training is a major challenge and
techniques range from distributed algorithms to low-level circuit design. In
this survey, we describe the problem from a theoretical perspective, followed
by approaches for its parallelization. We present trends in DNN architectures
and the resulting implications on parallelization strategies. We then review
and model the different types of concurrency in DNNs: from the single operator,
through parallelism in network inference and training, to distributed deep
learning. We discuss asynchronous stochastic optimization, distributed system
architectures, communication schemes, and neural architecture search. Based on
those approaches, we extrapolate potential directions for parallelism in deep
learning
Toward Full-Stack Acceleration of Deep Convolutional Neural Networks on FPGAs
Due to the huge success and rapid development of convolutional neural networks (CNNs), there is a growing demand for hardware accelerators that accommodate a variety of CNNs to improve their inference latency and energy efficiency, in order to enable their deployment in real-time applications. Among popular platforms, field-programmable gate arrays (FPGAs) have been widely adopted for CNN acceleration because of their capability to provide superior energy efficiency and low-latency processing, while supporting high reconfigurability, making them favorable for accelerating rapidly evolving CNN algorithms. This article introduces a highly customized streaming hardware architecture that focuses on improving the compute efficiency for streaming applications by providing full-stack acceleration of CNNs on FPGAs. The proposed accelerator maps most computational functions, that is, convolutional and deconvolutional layers into a singular unified module, and implements the residual and concatenative connections between the functions with high efficiency, to support the inference of mainstream CNNs with different topologies. This architecture is further optimized through exploiting different levels of parallelism, layer fusion, and fully leveraging digital signal processing blocks (DSPs). The proposed accelerator has been implemented on Intel's Arria 10 GX1150 hardware and evaluated with a wide range of benchmark models. The results demonstrate a high performance of over 1.3 TOP/s of throughput, up to 97% of compute [multiply-accumulate (MAC)] efficiency, which outperforms the state-of-the-art FPGA accelerators
- …