4,695 research outputs found
Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions
In the past decade, Convolutional Neural Networks (CNNs) have demonstrated
state-of-the-art performance in various Artificial Intelligence tasks. To
accelerate the experimentation and development of CNNs, several software
frameworks have been released, primarily targeting power-hungry CPUs and GPUs.
In this context, reconfigurable hardware in the form of FPGAs constitutes a
potential alternative platform that can be integrated in the existing deep
learning ecosystem to provide a tunable balance between performance, power
consumption and programmability. In this paper, a survey of the existing
CNN-to-FPGA toolflows is presented, comprising a comparative study of their key
characteristics which include the supported applications, architectural
choices, design space exploration methods and achieved performance. Moreover,
major challenges and objectives introduced by the latest trends in CNN
algorithmic research are identified and presented. Finally, a uniform
evaluation methodology is proposed, aiming at the comprehensive, complete and
in-depth evaluation of CNN-to-FPGA toolflows.Comment: Accepted for publication at the ACM Computing Surveys (CSUR) journal,
201
HAQ: Hardware-Aware Automated Quantization with Mixed Precision
Model quantization is a widely used technique to compress and accelerate deep
neural network (DNN) inference. Emergent DNN hardware accelerators begin to
support mixed precision (1-8 bits) to further improve the computation
efficiency, which raises a great challenge to find the optimal bitwidth for
each layer: it requires domain experts to explore the vast design space trading
off among accuracy, latency, energy, and model size, which is both
time-consuming and sub-optimal. Conventional quantization algorithm ignores the
different hardware architectures and quantizes all the layers in a uniform way.
In this paper, we introduce the Hardware-Aware Automated Quantization (HAQ)
framework which leverages the reinforcement learning to automatically determine
the quantization policy, and we take the hardware accelerator's feedback in the
design loop. Rather than relying on proxy signals such as FLOPs and model size,
we employ a hardware simulator to generate direct feedback signals (latency and
energy) to the RL agent. Compared with conventional methods, our framework is
fully automated and can specialize the quantization policy for different neural
network architectures and hardware architectures. Our framework effectively
reduced the latency by 1.4-1.95x and the energy consumption by 1.9x with
negligible loss of accuracy compared with the fixed bitwidth (8 bits)
quantization. Our framework reveals that the optimal policies on different
hardware architectures (i.e., edge and cloud architectures) under different
resource constraints (i.e., latency, energy and model size) are drastically
different. We interpreted the implication of different quantization policies,
which offer insights for both neural network architecture design and hardware
architecture design.Comment: CVPR 2019. The first three authors contributed equally to this work.
Project page: https://hanlab.mit.edu/projects/haq
Fast Prototyping Next-Generation Accelerators for New ML Models using MASE: ML Accelerator System Exploration
Machine learning (ML) accelerators have been studied and used extensively to
compute ML models with high performance and low power. However, designing such
accelerators normally takes a long time and requires significant effort.
Unfortunately, the pace of development of ML software models is much faster
than the accelerator design cycle, leading to frequent and drastic
modifications in the model architecture, thus rendering many accelerators
obsolete. Existing design tools and frameworks can provide quick accelerator
prototyping, but only for a limited range of models that can fit into a single
hardware device, such as an FPGA. Furthermore, with the emergence of large
language models, such as GPT-3, there is an increased need for hardware
prototyping of these large models within a many-accelerator system to ensure
the hardware can scale with the ever-growing model sizes. In this paper, we
propose an efficient and scalable approach for exploring accelerator systems to
compute large ML models. We developed a tool named MASE that can directly map
large ML models onto an efficient streaming accelerator system. Over a set of
ML models, we show that MASE can achieve better energy efficiency to GPUs when
computing inference for recent transformer models. Our tool will open-sourced
upon publication
- …