433 research outputs found
FINN: A Framework for Fast, Scalable Binarized Neural Network Inference
Research has shown that convolutional neural networks contain significant
redundancy, and high classification accuracy can be obtained even when weights
and activations are reduced from floating point to binary values. In this
paper, we present FINN, a framework for building fast and flexible FPGA
accelerators using a flexible heterogeneous streaming architecture. By
utilizing a novel set of optimizations that enable efficient mapping of
binarized neural networks to hardware, we implement fully connected,
convolutional and pooling layers, with per-layer compute resources being
tailored to user-provided throughput requirements. On a ZC706 embedded FPGA
platform drawing less than 25 W total system power, we demonstrate up to 12.3
million image classifications per second with 0.31 {\mu}s latency on the MNIST
dataset with 95.8% accuracy, and 21906 image classifications per second with
283 {\mu}s latency on the CIFAR-10 and SVHN datasets with respectively 80.1%
and 94.9% accuracy. To the best of our knowledge, ours are the fastest
classification rates reported to date on these benchmarks.Comment: To appear in the 25th International Symposium on Field-Programmable
Gate Arrays, February 201
ReBNet: Residual Binarized Neural Network
This paper proposes ReBNet, an end-to-end framework for training
reconfigurable binary neural networks on software and developing efficient
accelerators for execution on FPGA. Binary neural networks offer an intriguing
opportunity for deploying large-scale deep learning models on
resource-constrained devices. Binarization reduces the memory footprint and
replaces the power-hungry matrix-multiplication with light-weight XnorPopcount
operations. However, binary networks suffer from a degraded accuracy compared
to their fixed-point counterparts. We show that the state-of-the-art methods
for optimizing binary networks accuracy, significantly increase the
implementation cost and complexity. To compensate for the degraded accuracy
while adhering to the simplicity of binary networks, we devise the first
reconfigurable scheme that can adjust the classification accuracy based on the
application. Our proposition improves the classification accuracy by
representing features with multiple levels of residual binarization. Unlike
previous methods, our approach does not exacerbate the area cost of the
hardware accelerator. Instead, it provides a tradeoff between throughput and
accuracy while the area overhead of multi-level binarization is negligible.Comment: To Appear In The 26th IEEE International Symposium on
Field-Programmable Custom Computing Machine
Towards Ultra-High Performance and Energy Efficiency of Deep Learning Systems: An Algorithm-Hardware Co-Optimization Framework
Hardware accelerations of deep learning systems have been extensively
investigated in industry and academia. The aim of this paper is to achieve
ultra-high energy efficiency and performance for hardware implementations of
deep neural networks (DNNs). An algorithm-hardware co-optimization framework is
developed, which is applicable to different DNN types, sizes, and application
scenarios. The algorithm part adopts the general block-circulant matrices to
achieve a fine-grained tradeoff between accuracy and compression ratio. It
applies to both fully-connected and convolutional layers and contains a
mathematically rigorous proof of the effectiveness of the method. The proposed
algorithm reduces computational complexity per layer from O() to O() and storage complexity from O() to O(), both for training and
inference. The hardware part consists of highly efficient Field Programmable
Gate Array (FPGA)-based implementations using effective reconfiguration, batch
processing, deep pipelining, resource re-using, and hierarchical control.
Experimental results demonstrate that the proposed framework achieves at least
152X speedup and 71X energy efficiency gain compared with IBM TrueNorth
processor under the same test accuracy. It achieves at least 31X energy
efficiency gain compared with the reference FPGA-based work.Comment: 6 figures, AAAI Conference on Artificial Intelligence, 201
Hardware-efficient on-line learning through pipelined truncated-error backpropagation in binary-state networks
Artificial neural networks (ANNs) trained using backpropagation are powerful
learning architectures that have achieved state-of-the-art performance in
various benchmarks. Significant effort has been devoted to developing custom
silicon devices to accelerate inference in ANNs. Accelerating the training
phase, however, has attracted relatively little attention. In this paper, we
describe a hardware-efficient on-line learning technique for feedforward
multi-layer ANNs that is based on pipelined backpropagation. Learning is
performed in parallel with inference in the forward pass, removing the need for
an explicit backward pass and requiring no extra weight lookup. By using binary
state variables in the feedforward network and ternary errors in
truncated-error backpropagation, the need for any multiplications in the
forward and backward passes is removed, and memory requirements for the
pipelining are drastically reduced. Further reduction in addition operations
owing to the sparsity in the forward neural and backpropagating error signal
paths contributes to highly efficient hardware implementation. For
proof-of-concept validation, we demonstrate on-line learning of MNIST
handwritten digit classification on a Spartan 6 FPGA interfacing with an
external 1Gb DDR2 DRAM, that shows small degradation in test error performance
compared to an equivalently sized binary ANN trained off-line using standard
back-propagation and exact errors. Our results highlight an attractive synergy
between pipelined backpropagation and binary-state networks in substantially
reducing computation and memory requirements, making pipelined on-line learning
practical in deep networks.Comment: Now also consider 0/1 binary activations. Memory access statistics
reporte
- …