325 research outputs found
FPGA Implementation of Convolutional Neural Networks with Fixed-Point Calculations
Neural network-based methods for image processing are becoming widely used in
practical applications. Modern neural networks are computationally expensive
and require specialized hardware, such as graphics processing units. Since such
hardware is not always available in real life applications, there is a
compelling need for the design of neural networks for mobile devices. Mobile
neural networks typically have reduced number of parameters and require a
relatively small number of arithmetic operations. However, they usually still
are executed at the software level and use floating-point calculations. The use
of mobile networks without further optimization may not provide sufficient
performance when high processing speed is required, for example, in real-time
video processing (30 frames per second). In this study, we suggest
optimizations to speed up computations in order to efficiently use already
trained neural networks on a mobile device. Specifically, we propose an
approach for speeding up neural networks by moving computation from software to
hardware and by using fixed-point calculations instead of floating-point. We
propose a number of methods for neural network architecture design to improve
the performance with fixed-point calculations. We also show an example of how
existing datasets can be modified and adapted for the recognition task in hand.
Finally, we present the design and the implementation of a floating-point gate
array-based device to solve the practical problem of real-time handwritten
digit classification from mobile camera video feed
Automatic generation of hardware Tree Classifiers
Machine Learning is growing in popularity and spreading across different fields for various applications. Due to this trend, machine learning algorithms use different hardware platforms and are being experimented to obtain high test accuracy and throughput. FPGAs are well-suited hardware platform for machine learning because of its re-programmability and lower power consumption. Programming using FPGAs for machine learning algorithms requires substantial engineering time and effort compared to software implementation. We propose a software assisted design flow to program FPGA for machine learning algorithms using our hardware library. The hardware library is highly parameterized and it accommodates Tree Classifiers. As of now, our library consists of the components required to implement decision trees and random forests. The whole automation is wrapped around using a python script which takes you from the first step of having a dataset and design choices to the last step of having a hardware descriptive code for the trained machine learning model
FINN: A Framework for Fast, Scalable Binarized Neural Network Inference
Research has shown that convolutional neural networks contain significant
redundancy, and high classification accuracy can be obtained even when weights
and activations are reduced from floating point to binary values. In this
paper, we present FINN, a framework for building fast and flexible FPGA
accelerators using a flexible heterogeneous streaming architecture. By
utilizing a novel set of optimizations that enable efficient mapping of
binarized neural networks to hardware, we implement fully connected,
convolutional and pooling layers, with per-layer compute resources being
tailored to user-provided throughput requirements. On a ZC706 embedded FPGA
platform drawing less than 25 W total system power, we demonstrate up to 12.3
million image classifications per second with 0.31 {\mu}s latency on the MNIST
dataset with 95.8% accuracy, and 21906 image classifications per second with
283 {\mu}s latency on the CIFAR-10 and SVHN datasets with respectively 80.1%
and 94.9% accuracy. To the best of our knowledge, ours are the fastest
classification rates reported to date on these benchmarks.Comment: To appear in the 25th International Symposium on Field-Programmable
Gate Arrays, February 201
Binarized Convolutional Neural Networks with Separable Filters for Efficient Hardware Acceleration
State-of-the-art convolutional neural networks are enormously costly in both
compute and memory, demanding massively parallel GPUs for execution. Such
networks strain the computational capabilities and energy available to embedded
and mobile processing platforms, restricting their use in many important
applications. In this paper, we push the boundaries of hardware-effective CNN
design by proposing BCNN with Separable Filters (BCNNw/SF), which applies
Singular Value Decomposition (SVD) on BCNN kernels to further reduce
computational and storage complexity. To enable its implementation, we provide
a closed form of the gradient over SVD to calculate the exact gradient with
respect to every binarized weight in backward propagation. We verify BCNNw/SF
on the MNIST, CIFAR-10, and SVHN datasets, and implement an accelerator for
CIFAR-10 on FPGA hardware. Our BCNNw/SF accelerator realizes memory savings of
17% and execution time reduction of 31.3% compared to BCNN with only minor
accuracy sacrifices.Comment: 9 pages, 6 figures, accepted for Embedded Vision Workshop (CVPRW
Comprehensive Evaluation of OpenCL-based Convolutional Neural Network Accelerators in Xilinx and Altera FPGAs
Deep learning has significantly advanced the state of the art in artificial intelligence, gaining wide popularity from both industry and academia. Special interest is around Convolutional Neural Networks (CNN), which take inspiration from the hierarchical structure of the visual cortex, to form deep layers of convolutional operations, along with fully connected classifiers. Hardware implementations of these deep CNN architectures are challenged with memory bottlenecks that require many convolution and fully-connected layers demanding large amount of communication for parallel computation. Multi-core CPU based solutions have demonstrated their inadequacy for this problem due to the memory wall and low parallelism. Many-core GPU architectures show superior performance but they consume high power and also have memory constraints due to inconsistencies between cache and main memory. FPGA design solutions are also actively being explored, which allow implementing the memory hierarchy using embedded BlockRAM. This boosts the parallel use of shared memory elements between multiple processing units, avoiding data replicability and inconsistencies. This makes FPGAs potentially powerful solutions for real-time classification of CNNs. Both Altera and Xilinx have adopted OpenCL co-design framework from GPU for FPGA designs as a pseudo-automatic development solution. In this paper, a comprehensive evaluation and comparison of Altera and Xilinx OpenCL frameworks for a 5-layer deep CNN is presented. Hardware resources, temporal performance and the OpenCL architecture for CNNs are discussed. Xilinx demonstrates faster synthesis, better FPGA resource utilization and more compact boards. Altera provides multi-platforms tools, mature design community and better execution times
On the Resilience of RTL NN Accelerators: Fault Characterization and Mitigation
Machine Learning (ML) is making a strong resurgence in tune with the massive
generation of unstructured data which in turn requires massive computational
resources. Due to the inherently compute- and power-intensive structure of
Neural Networks (NNs), hardware accelerators emerge as a promising solution.
However, with technology node scaling below 10nm, hardware accelerators become
more susceptible to faults, which in turn can impact the NN accuracy. In this
paper, we study the resilience aspects of Register-Transfer Level (RTL) model
of NN accelerators, in particular, fault characterization and mitigation. By
following a High-Level Synthesis (HLS) approach, first, we characterize the
vulnerability of various components of RTL NN. We observed that the severity of
faults depends on both i) application-level specifications, i.e., NN data
(inputs, weights, or intermediate), NN layers, and NN activation functions, and
ii) architectural-level specifications, i.e., data representation model and the
parallelism degree of the underlying accelerator. Second, motivated by
characterization results, we present a low-overhead fault mitigation technique
that can efficiently correct bit flips, by 47.3% better than state-of-the-art
methods.Comment: 8 pages, 6 figure
Hardware-efficient on-line learning through pipelined truncated-error backpropagation in binary-state networks
Artificial neural networks (ANNs) trained using backpropagation are powerful
learning architectures that have achieved state-of-the-art performance in
various benchmarks. Significant effort has been devoted to developing custom
silicon devices to accelerate inference in ANNs. Accelerating the training
phase, however, has attracted relatively little attention. In this paper, we
describe a hardware-efficient on-line learning technique for feedforward
multi-layer ANNs that is based on pipelined backpropagation. Learning is
performed in parallel with inference in the forward pass, removing the need for
an explicit backward pass and requiring no extra weight lookup. By using binary
state variables in the feedforward network and ternary errors in
truncated-error backpropagation, the need for any multiplications in the
forward and backward passes is removed, and memory requirements for the
pipelining are drastically reduced. Further reduction in addition operations
owing to the sparsity in the forward neural and backpropagating error signal
paths contributes to highly efficient hardware implementation. For
proof-of-concept validation, we demonstrate on-line learning of MNIST
handwritten digit classification on a Spartan 6 FPGA interfacing with an
external 1Gb DDR2 DRAM, that shows small degradation in test error performance
compared to an equivalently sized binary ANN trained off-line using standard
back-propagation and exact errors. Our results highlight an attractive synergy
between pipelined backpropagation and binary-state networks in substantially
reducing computation and memory requirements, making pipelined on-line learning
practical in deep networks.Comment: Now also consider 0/1 binary activations. Memory access statistics
reporte
Neuro-memristive Circuits for Edge Computing: A review
The volume, veracity, variability, and velocity of data produced from the
ever-increasing network of sensors connected to Internet pose challenges for
power management, scalability, and sustainability of cloud computing
infrastructure. Increasing the data processing capability of edge computing
devices at lower power requirements can reduce several overheads for cloud
computing solutions. This paper provides the review of neuromorphic
CMOS-memristive architectures that can be integrated into edge computing
devices. We discuss why the neuromorphic architectures are useful for edge
devices and show the advantages, drawbacks and open problems in the field of
neuro-memristive circuits for edge computing
- …