9 research outputs found
ZeroQ: A Novel Zero Shot Quantization Framework
Quantization is a promising approach for reducing the inference time and
memory footprint of neural networks. However, most existing quantization
methods require access to the original training dataset for retraining during
quantization. This is often not possible for applications with sensitive or
proprietary data, e.g., due to privacy and security concerns. Existing
zero-shot quantization methods use different heuristics to address this, but
they result in poor performance, especially when quantizing to ultra-low
precision. Here, we propose ZeroQ , a novel zero-shot quantization framework to
address this. ZeroQ enables mixed-precision quantization without any access to
the training or validation data. This is achieved by optimizing for a Distilled
Dataset, which is engineered to match the statistics of batch normalization
across different layers of the network. ZeroQ supports both uniform and
mixed-precision quantization. For the latter, we introduce a novel Pareto
frontier based method to automatically determine the mixed-precision bit
setting for all layers, with no manual search involved. We extensively test our
proposed method on a diverse set of models, including ResNet18/50/152,
MobileNetV2, ShuffleNet, SqueezeNext, and InceptionV3 on ImageNet, as well as
RetinaNet-ResNet50 on the Microsoft COCO dataset. In particular, we show that
ZeroQ can achieve 1.71\% higher accuracy on MobileNetV2, as compared to the
recently proposed DFQ method. Importantly, ZeroQ has a very low computational
overhead, and it can finish the entire quantization process in less than 30s
(0.5\% of one epoch training time of ResNet50 on ImageNet). We have
open-sourced the ZeroQ
framework\footnote{https://github.com/amirgholami/ZeroQ}.Comment: CVPR 202
Synetgy: Algorithm-hardware Co-design for ConvNet Accelerators on Embedded FPGAs
Using FPGAs to accelerate ConvNets has attracted significant attention in
recent years. However, FPGA accelerator design has not leveraged the latest
progress of ConvNets. As a result, the key application characteristics such as
frames-per-second (FPS) are ignored in favor of simply counting GOPs, and
results on accuracy, which is critical to application success, are often not
even reported. In this work, we adopt an algorithm-hardware co-design approach
to develop a ConvNet accelerator called Synetgy and a novel ConvNet model
called DiracDeltaNet. Both the accelerator and ConvNet are tailored
to FPGA requirements. DiracDeltaNet, as the name suggests, is a ConvNet with
only convolutions while spatial convolutions are replaced by more
efficient shift operations. DiracDeltaNet achieves competitive accuracy on
ImageNet (88.7\% top-5), but with 42 fewer parameters and 48
fewer OPs than VGG16. We further quantize DiracDeltaNet's weights to 4-bit and
activations to 4-bits, with less than 1\% accuracy loss. These quantizations
exploit well the nature of FPGA hardware. In short, DiracDeltaNet's small model
size, low computational OP count, low precision and simplified operators allow
us to co-design a highly customized computing unit for an FPGA. We implement
the computing units for DiracDeltaNet on an Ultra96 SoC system through
high-level synthesis. Our accelerator's final top-5 accuracy of 88.1\% on
ImageNet, is higher than all the previously reported embedded FPGA
accelerators. In addition, the accelerator reaches an inference speed of 66.3
FPS on the ImageNet classification task, surpassing prior works with similar
accuracy by at least 11.6.Comment: Update to the latest result
Gemmini: Enabling Systematic Deep-Learning Architecture Evaluation via Full-Stack Integration
DNN accelerators are often developed and evaluated in isolation without
considering the cross-stack, system-level effects in real-world environments.
This makes it difficult to appreciate the impact of System-on-Chip (SoC)
resource contention, OS overheads, and programming-stack inefficiencies on
overall performance/energy-efficiency. To address this challenge, we present
Gemmini, an open-source*, full-stack DNN accelerator generator. Gemmini
generates a wide design-space of efficient ASIC accelerators from a flexible
architectural template, together with flexible programming stacks and full SoCs
with shared resources that capture system-level effects. Gemmini-generated
accelerators have also been fabricated, delivering up to three
orders-of-magnitude speedups over high-performance CPUs on various DNN
benchmarks.
* https://github.com/ucb-bar/gemminiComment: To appear at the 58th IEEE/ACM Design Automation Conference (DAC),
December 2021, San Francisco, CA, US