1,757 research outputs found
Automated flow for compressing convolution neural networks for efficient edge-computation with FPGA
Deep convolutional neural networks (CNN) based solutions are the current
state- of-the-art for computer vision tasks. Due to the large size of these
models, they are typically run on clusters of CPUs or GPUs. However, power
requirements and cost budgets can be a major hindrance in adoption of CNN for
IoT applications. Recent research highlights that CNN contain significant
redundancy in their structure and can be quantized to lower bit-width
parameters and activations, while maintaining acceptable accuracy. Low
bit-width and especially single bit-width (binary) CNN are particularly
suitable for mobile applications based on FPGA implementation, due to the
bitwise logic operations involved in binarized CNN. Moreover, the transition to
lower bit-widths opens new avenues for performance optimizations and model
improvement. In this paper, we present an automatic flow from trained
TensorFlow models to FPGA system on chip implementation of binarized CNN. This
flow involves quantization of model parameters and activations, generation of
network and model in embedded-C, followed by automatic generation of the FPGA
accelerator for binary convolutions. The automated flow is demonstrated through
implementation of binarized "YOLOV2" on the low cost, low power Cyclone- V FPGA
device. Experiments on object detection using binarized YOLOV2 demonstrate
significant performance benefit in terms of model size and inference speed on
FPGA as compared to CPU and mobile CPU platforms. Furthermore, the entire
automated flow from trained models to FPGA synthesis can be completed within
one hour.Comment: 7 pages, 9 figures. Accepted and presented at MLPCD workshop, NIPS
2017 (LongBeach, California
Reconfigurable Hardware Accelerators: Opportunities, Trends, and Challenges
With the emerging big data applications of Machine Learning, Speech
Recognition, Artificial Intelligence, and DNA Sequencing in recent years,
computer architecture research communities are facing the explosive scale of
various data explosion. To achieve high efficiency of data-intensive computing,
studies of heterogeneous accelerators which focus on latest applications, have
become a hot issue in computer architecture domain. At present, the
implementation of heterogeneous accelerators mainly relies on heterogeneous
computing units such as Application-specific Integrated Circuit (ASIC),
Graphics Processing Unit (GPU), and Field Programmable Gate Array (FPGA). Among
the typical heterogeneous architectures above, FPGA-based reconfigurable
accelerators have two merits as follows: First, FPGA architecture contains a
large number of reconfigurable circuits, which satisfy requirements of high
performance and low power consumption when specific applications are running.
Second, the reconfigurable architectures of employing FPGA performs prototype
systems rapidly and features excellent customizability and reconfigurability.
Nowadays, in top-tier conferences of computer architecture, emerging a batch of
accelerating works based on FPGA or other reconfigurable architectures. To
better review the related work of reconfigurable computing accelerators
recently, this survey reserves latest high-level research products of
reconfigurable accelerator architectures and algorithm applications as the
basis. In this survey, we compare hot research issues and concern domains,
furthermore, analyze and illuminate advantages, disadvantages, and challenges
of reconfigurable accelerators. In the end, we prospect the development
tendency of accelerator architectures in the future, hoping to provide a
reference for computer architecture researchers
C-LSTM: Enabling Efficient LSTM using Structured Compression Techniques on FPGAs
Recently, significant accuracy improvement has been achieved for acoustic
recognition systems by increasing the model size of Long Short-Term Memory
(LSTM) networks. Unfortunately, the ever-increasing size of LSTM model leads to
inefficient designs on FPGAs due to the limited on-chip resources. The previous
work proposes to use a pruning based compression technique to reduce the model
size and thus speedups the inference on FPGAs. However, the random nature of
the pruning technique transforms the dense matrices of the model to highly
unstructured sparse ones, which leads to unbalanced computation and irregular
memory accesses and thus hurts the overall performance and energy efficiency.
In contrast, we propose to use a structured compression technique which could
not only reduce the LSTM model size but also eliminate the irregularities of
computation and memory accesses. This approach employs block-circulant instead
of sparse matrices to compress weight matrices and reduces the storage
requirement from to . Fast Fourier Transform
algorithm is utilized to further accelerate the inference by reducing the
computational complexity from to
. The datapath and activation functions are
quantized as 16-bit to improve the resource utilization. More importantly, we
propose a comprehensive framework called C-LSTM to automatically optimize and
implement a wide range of LSTM variants on FPGAs. According to the experimental
results, C-LSTM achieves up to 18.8X and 33.5X gains for performance and energy
efficiency compared with the state-of-the-art LSTM implementation under the
same experimental setup, and the accuracy degradation is very small.Comment: Proceedings of the 2018 ACM/SIGDA International Symposium on
Field-Programmable Gate Array
CodeX: Bit-Flexible Encoding for Streaming-based FPGA Acceleration of DNNs
This paper proposes CodeX, an end-to-end framework that facilitates encoding,
bitwidth customization, fine-tuning, and implementation of neural networks on
FPGA platforms. CodeX incorporates nonlinear encoding to the computation flow
of neural networks to save memory. The encoded features demand significantly
lower storage compared to the raw full-precision activation values; therefore,
the execution flow of CodeX hardware engine is completely performed within the
FPGA using on-chip streaming buffers with no access to the off-chip DRAM. We
further propose a fully-automated algorithm inspired by reinforcement learning
which determines the customized encoding bitwidth across network layers. CodeX
full-stack framework comprises of a compiler which takes a high-level Python
description of an arbitrary neural network architecture. The compiler then
instantiates the corresponding elements from CodeX Hardware library for FPGA
implementation. Proof-of-concept evaluations on MNIST, SVHN, and CIFAR-10
datasets demonstrate an average of 4.65x throughput improvement compared to
stand-alone weight encoding. We further compare CodeX with six existing
full-precision DNN accelerators on ImageNet, showing an average of 3.6x and
2.54x improvement in throughput and performance-per-watt, respectively
INsight: A Neuromorphic Computing System for Evaluation of Large Neural Networks
Deep neural networks have been demonstrated impressive results in various
cognitive tasks such as object detection and image classification. In order to
execute large networks, Von Neumann computers store the large number of weight
parameters in external memories, and processing elements are timed-shared,
which leads to power-hungry I/O operations and processing bottlenecks. This
paper describes a neuromorphic computing system that is designed from the
ground up for the energy-efficient evaluation of large-scale neural networks.
The computing system consists of a non-conventional compiler, a neuromorphic
architecture, and a space-efficient microarchitecture that leverages existing
integrated circuit design methodologies. The compiler factorizes a trained,
feedforward network into a sparsely connected network, compresses the weights
linearly, and generates a time delay neural network reducing the number of
connections. The connections and units in the simplified network are mapped to
silicon synapses and neurons. We demonstrate an implementation of the
neuromorphic computing system based on a field-programmable gate array that
performs the MNIST hand-written digit classification with 97.64% accuracy
DNNVM : End-to-End Compiler Leveraging Heterogeneous Optimizations on FPGA-based CNN Accelerators
The convolutional neural network (CNN) has become a state-of-the-art method
for several artificial intelligence domains in recent years. The increasingly
complex CNN models are both computation-bound and I/O-bound. FPGA-based
accelerators driven by custom instruction set architecture (ISA) achieve a
balance between generality and efficiency, but there is much on them left to be
optimized. We propose the full-stack compiler DNNVM, which is an integration of
optimizers for graphs, loops and data layouts, and an assembler, a runtime
supporter and a validation environment. The DNNVM works in the context of deep
learning frameworks and transforms CNN models into the directed acyclic graph:
XGraph. Based on XGraph, we transform the optimization challenges for both the
data layout and pipeline into graph-level problems. DNNVM enumerates all
potentially profitable fusion opportunities by a heuristic subgraph isomorphism
algorithm to leverage pipeline and data layout optimizations, and searches for
the best choice of execution strategies of the whole computing graph. On the
Xilinx ZU2 @330 MHz and ZU9 @330 MHz, we achieve equivalently state-of-the-art
performance on our benchmarks by na\"ive implementations without optimizations,
and the throughput is further improved up to 1.26x by leveraging heterogeneous
optimizations in DNNVM. Finally, with ZU9 @330 MHz, we achieve state-of-the-art
performance for VGG and ResNet50. We achieve a throughput of 2.82 TOPs/s and an
energy efficiency of 123.7 GOPs/s/W for VGG. Additionally, we achieve 1.38
TOPs/s for ResNet50 and 1.41 TOPs/s for GoogleNet.Comment: 18 pages, 9 figures, 5 table
FPGA/DNN Co-Design: An Efficient Design Methodology for IoT Intelligence on the Edge
While embedded FPGAs are attractive platforms for DNN acceleration on
edge-devices due to their low latency and high energy efficiency, the scarcity
of resources of edge-scale FPGA devices also makes it challenging for DNN
deployment. In this paper, we propose a simultaneous FPGA/DNN co-design
methodology with both bottom-up and top-down approaches: a bottom-up
hardware-oriented DNN model search for high accuracy, and a top-down FPGA
accelerator design considering DNN-specific characteristics. We also build an
automatic co-design flow, including an Auto-DNN engine to perform
hardware-oriented DNN model search, as well as an Auto-HLS engine to generate
synthesizable C code of the FPGA accelerator for explored DNNs. We demonstrate
our co-design approach on an object detection task using PYNQ-Z1 FPGA. Results
show that our proposed DNN model and accelerator outperform the
state-of-the-art FPGA designs in all aspects including Intersection-over-Union
(IoU) (6.2% higher), frames per second (FPS) (2.48X higher), power consumption
(40% lower), and energy efficiency (2.5X higher). Compared to GPU-based
solutions, our designs deliver similar accuracy but consume far less energy.Comment: Accepted by Design Automation Conference (DAC'2019
FINN-R: An End-to-End Deep-Learning Framework for Fast Exploration of Quantized Neural Networks
Convolutional Neural Networks have rapidly become the most successful machine
learning algorithm, enabling ubiquitous machine vision and intelligent
decisions on even embedded computing-systems. While the underlying arithmetic
is structurally simple, compute and memory requirements are challenging. One of
the promising opportunities is leveraging reduced-precision representations for
inputs, activations and model parameters. The resulting scalability in
performance, power efficiency and storage footprint provides interesting design
compromises in exchange for a small reduction in accuracy. FPGAs are ideal for
exploiting low-precision inference engines leveraging custom precisions to
achieve the required numerical accuracy for a given application. In this
article, we describe the second generation of the FINN framework, an end-to-end
tool which enables design space exploration and automates the creation of fully
customized inference engines on FPGAs. Given a neural network description, the
tool optimizes for given platforms, design targets and a specific precision. We
introduce formalizations of resource cost functions and performance
predictions, and elaborate on the optimization algorithms. Finally, we evaluate
a selection of reduced precision neural networks ranging from CIFAR-10
classifiers to YOLO-based object detection on a range of platforms including
PYNQ and AWS\,F1, demonstrating new unprecedented measured throughput at
50TOp/s on AWS-F1 and 5TOp/s on embedded devices.Comment: to be published in ACM TRETS Special Edition on Deep Learnin
A Survey of FPGA-Based Neural Network Accelerator
Recent researches on neural network have shown significant advantage in
machine learning over traditional algorithms based on handcrafted features and
models. Neural network is now widely adopted in regions like image, speech and
video recognition. But the high computation and storage complexity of neural
network inference poses great difficulty on its application. CPU platforms are
hard to offer enough computation capacity. GPU platforms are the first choice
for neural network process because of its high computation capacity and easy to
use development frameworks.
On the other hand, FPGA-based neural network inference accelerator is
becoming a research topic. With specifically designed hardware, FPGA is the
next possible solution to surpass GPU in speed and energy efficiency. Various
FPGA-based accelerator designs have been proposed with software and hardware
optimization techniques to achieve high speed and energy efficiency. In this
paper, we give an overview of previous work on neural network inference
accelerators based on FPGA and summarize the main techniques used. An
investigation from software to hardware, from circuit level to system level is
carried out to complete analysis of FPGA-based neural network inference
accelerator design and serves as a guide to future work
Deploying Customized Data Representation and Approximate Computing in Machine Learning Applications
Major advancements in building general-purpose and customized hardware have
been one of the key enablers of versatility and pervasiveness of machine
learning models such as deep neural networks. To sustain this ubiquitous
deployment of machine learning models and cope with their computational and
storage complexity, several solutions such as low-precision representation of
model parameters using fixed-point representation and deploying approximate
arithmetic operations have been employed. Studying the potency of such
solutions in different applications requires integrating them into existing
machine learning frameworks for high-level simulations as well as implementing
them in hardware to analyze their effects on power/energy dissipation,
throughput, and chip area. Lop is a library for design space exploration that
bridges the gap between machine learning and efficient hardware realization. It
comprises a Python module, which can be integrated with some of the existing
machine learning frameworks and implements various customizable data
representations including fixed-point and floating-point as well as approximate
arithmetic operations.Furthermore, it includes a highly-parameterized Scala
module, which allows synthesizing hardware based on the said data
representations and arithmetic operations. Lop allows researchers and designers
to quickly compare quality of their models using various data representations
and arithmetic operations in Python and contrast the hardware cost of viable
representations by synthesizing them on their target platforms (e.g., FPGA or
ASIC). To the best of our knowledge, Lop is the first library that allows both
software simulation and hardware realization using customized data
representations and approximate computing techniques
- …