720 research outputs found
ReBNet: Residual Binarized Neural Network
This paper proposes ReBNet, an end-to-end framework for training
reconfigurable binary neural networks on software and developing efficient
accelerators for execution on FPGA. Binary neural networks offer an intriguing
opportunity for deploying large-scale deep learning models on
resource-constrained devices. Binarization reduces the memory footprint and
replaces the power-hungry matrix-multiplication with light-weight XnorPopcount
operations. However, binary networks suffer from a degraded accuracy compared
to their fixed-point counterparts. We show that the state-of-the-art methods
for optimizing binary networks accuracy, significantly increase the
implementation cost and complexity. To compensate for the degraded accuracy
while adhering to the simplicity of binary networks, we devise the first
reconfigurable scheme that can adjust the classification accuracy based on the
application. Our proposition improves the classification accuracy by
representing features with multiple levels of residual binarization. Unlike
previous methods, our approach does not exacerbate the area cost of the
hardware accelerator. Instead, it provides a tradeoff between throughput and
accuracy while the area overhead of multi-level binarization is negligible.Comment: To Appear In The 26th IEEE International Symposium on
Field-Programmable Custom Computing Machine
NullHop: A Flexible Convolutional Neural Network Accelerator Based on Sparse Representations of Feature Maps
Convolutional neural networks (CNNs) have become the dominant neural network
architecture for solving many state-of-the-art (SOA) visual processing tasks.
Even though Graphical Processing Units (GPUs) are most often used in training
and deploying CNNs, their power efficiency is less than 10 GOp/s/W for
single-frame runtime inference. We propose a flexible and efficient CNN
accelerator architecture called NullHop that implements SOA CNNs useful for
low-power and low-latency application scenarios. NullHop exploits the sparsity
of neuron activations in CNNs to accelerate the computation and reduce memory
requirements. The flexible architecture allows high utilization of available
computing resources across kernel sizes ranging from 1x1 to 7x7. NullHop can
process up to 128 input and 128 output feature maps per layer in a single pass.
We implemented the proposed architecture on a Xilinx Zynq FPGA platform and
present results showing how our implementation reduces external memory
transfers and compute time in five different CNNs ranging from small ones up to
the widely known large VGG16 and VGG19 CNNs. Post-synthesis simulations using
Mentor Modelsim in a 28nm process with a clock frequency of 500 MHz show that
the VGG19 network achieves over 450 GOp/s. By exploiting sparsity, NullHop
achieves an efficiency of 368%, maintains over 98% utilization of the MAC
units, and achieves a power efficiency of over 3TOp/s/W in a core area of
6.3mm. As further proof of NullHop's usability, we interfaced its FPGA
implementation with a neuromorphic event camera for real time interactive
demonstrations
FPSA: A Full System Stack Solution for Reconfigurable ReRAM-based NN Accelerator Architecture
Neural Network (NN) accelerators with emerging ReRAM (resistive random access
memory) technologies have been investigated as one of the promising solutions
to address the \textit{memory wall} challenge, due to the unique capability of
\textit{processing-in-memory} within ReRAM-crossbar-based processing elements
(PEs). However, the high efficiency and high density advantages of ReRAM have
not been fully utilized due to the huge communication demands among PEs and the
overhead of peripheral circuits.
In this paper, we propose a full system stack solution, composed of a
reconfigurable architecture design, Field Programmable Synapse Array (FPSA) and
its software system including neural synthesizer, temporal-to-spatial mapper,
and placement & routing. We highly leverage the software system to make the
hardware design compact and efficient. To satisfy the high-performance
communication demand, we optimize it with a reconfigurable routing architecture
and the placement & routing tool. To improve the computational density, we
greatly simplify the PE circuit with the spiking schema and then adopt neural
synthesizer to enable the high density computation-resources to support
different kinds of NN operations. In addition, we provide spiking memory blocks
(SMBs) and configurable logic blocks (CLBs) in hardware and leverage the
temporal-to-spatial mapper to utilize them to balance the storage and
computation requirements of NN. Owing to the end-to-end software system, we can
efficiently deploy existing deep neural networks to FPSA. Evaluations show
that, compared to one of state-of-the-art ReRAM-based NN accelerators, PRIME,
the computational density of FPSA improves by 31x; for representative NNs, its
inference performance can achieve up to 1000x speedup.Comment: Accepted by ASPLOS 201
- …