3,163 research outputs found
Reduced Precision Strategies for Deep Learning: A High Energy Physics Generative Adversarial Network Use Case
Deep learning is finding its way into high energy physics by replacing
traditional Monte Carlo simulations. However, deep learning still requires an
excessive amount of computational resources. A promising approach to make deep
learning more efficient is to quantize the parameters of the neural networks to
reduced precision. Reduced precision computing is extensively used in modern
deep learning and results to lower execution inference time, smaller memory
footprint and less memory bandwidth. In this paper we analyse the effects of
low precision inference on a complex deep generative adversarial network model.
The use case which we are addressing is calorimeter detector simulations of
subatomic particle interactions in accelerator based high energy physics. We
employ the novel Intel low precision optimization tool (iLoT) for quantization
and compare the results to the quantized model from TensorFlow Lite. In the
performance benchmark we gain a speed-up of 1.73x on Intel hardware for the
quantized iLoT model compared to the initial, not quantized, model. With
different physics-inspired self-developed metrics, we validate that the
quantized iLoT model shows a lower loss of physical accuracy in comparison to
the TensorFlow Lite model.Comment: Submitted at ICPRAM 2021; from CERN openlab - Intel collaboratio
HAQ: Hardware-Aware Automated Quantization with Mixed Precision
Model quantization is a widely used technique to compress and accelerate deep
neural network (DNN) inference. Emergent DNN hardware accelerators begin to
support mixed precision (1-8 bits) to further improve the computation
efficiency, which raises a great challenge to find the optimal bitwidth for
each layer: it requires domain experts to explore the vast design space trading
off among accuracy, latency, energy, and model size, which is both
time-consuming and sub-optimal. Conventional quantization algorithm ignores the
different hardware architectures and quantizes all the layers in a uniform way.
In this paper, we introduce the Hardware-Aware Automated Quantization (HAQ)
framework which leverages the reinforcement learning to automatically determine
the quantization policy, and we take the hardware accelerator's feedback in the
design loop. Rather than relying on proxy signals such as FLOPs and model size,
we employ a hardware simulator to generate direct feedback signals (latency and
energy) to the RL agent. Compared with conventional methods, our framework is
fully automated and can specialize the quantization policy for different neural
network architectures and hardware architectures. Our framework effectively
reduced the latency by 1.4-1.95x and the energy consumption by 1.9x with
negligible loss of accuracy compared with the fixed bitwidth (8 bits)
quantization. Our framework reveals that the optimal policies on different
hardware architectures (i.e., edge and cloud architectures) under different
resource constraints (i.e., latency, energy and model size) are drastically
different. We interpreted the implication of different quantization policies,
which offer insights for both neural network architecture design and hardware
architecture design.Comment: CVPR 2019. The first three authors contributed equally to this work.
Project page: https://hanlab.mit.edu/projects/haq
- …