2,252 research outputs found
NullHop: A Flexible Convolutional Neural Network Accelerator Based on Sparse Representations of Feature Maps
Convolutional neural networks (CNNs) have become the dominant neural network
architecture for solving many state-of-the-art (SOA) visual processing tasks.
Even though Graphical Processing Units (GPUs) are most often used in training
and deploying CNNs, their power efficiency is less than 10 GOp/s/W for
single-frame runtime inference. We propose a flexible and efficient CNN
accelerator architecture called NullHop that implements SOA CNNs useful for
low-power and low-latency application scenarios. NullHop exploits the sparsity
of neuron activations in CNNs to accelerate the computation and reduce memory
requirements. The flexible architecture allows high utilization of available
computing resources across kernel sizes ranging from 1x1 to 7x7. NullHop can
process up to 128 input and 128 output feature maps per layer in a single pass.
We implemented the proposed architecture on a Xilinx Zynq FPGA platform and
present results showing how our implementation reduces external memory
transfers and compute time in five different CNNs ranging from small ones up to
the widely known large VGG16 and VGG19 CNNs. Post-synthesis simulations using
Mentor Modelsim in a 28nm process with a clock frequency of 500 MHz show that
the VGG19 network achieves over 450 GOp/s. By exploiting sparsity, NullHop
achieves an efficiency of 368%, maintains over 98% utilization of the MAC
units, and achieves a power efficiency of over 3TOp/s/W in a core area of
6.3mm. As further proof of NullHop's usability, we interfaced its FPGA
implementation with a neuromorphic event camera for real time interactive
demonstrations
XNOR Neural Engine: a Hardware Accelerator IP for 21.6 fJ/op Binary Neural Network Inference
Binary Neural Networks (BNNs) are promising to deliver accuracy comparable to
conventional deep neural networks at a fraction of the cost in terms of memory
and energy. In this paper, we introduce the XNOR Neural Engine (XNE), a fully
digital configurable hardware accelerator IP for BNNs, integrated within a
microcontroller unit (MCU) equipped with an autonomous I/O subsystem and hybrid
SRAM / standard cell memory. The XNE is able to fully compute convolutional and
dense layers in autonomy or in cooperation with the core in the MCU to realize
more complex behaviors. We show post-synthesis results in 65nm and 22nm
technology for the XNE IP and post-layout results in 22nm for the full MCU
indicating that this system can drop the energy cost per binary operation to
21.6fJ per operation at 0.4V, and at the same time is flexible and performant
enough to execute state-of-the-art BNN topologies such as ResNet-34 in less
than 2.2mJ per frame at 8.9 fps.Comment: 11 pages, 8 figures, 2 tables, 3 listings. Accepted for presentation
at CODES'18 and for publication in IEEE Transactions on Computer-Aided Design
of Circuits and Systems (TCAD) as part of the ESWEEK-TCAD special issu
Direct Feedback Alignment with Sparse Connections for Local Learning
Recent advances in deep neural networks (DNNs) owe their success to training
algorithms that use backpropagation and gradient-descent. Backpropagation,
while highly effective on von Neumann architectures, becomes inefficient when
scaling to large networks. Commonly referred to as the weight transport
problem, each neuron's dependence on the weights and errors located deeper in
the network require exhaustive data movement which presents a key problem in
enhancing the performance and energy-efficiency of machine-learning hardware.
In this work, we propose a bio-plausible alternative to backpropagation drawing
from advances in feedback alignment algorithms in which the error computation
at a single synapse reduces to the product of three scalar values. Using a
sparse feedback matrix, we show that a neuron needs only a fraction of the
information previously used by the feedback alignment algorithms. Consequently,
memory and compute can be partitioned and distributed whichever way produces
the most efficient forward pass so long as a single error can be delivered to
each neuron. Our results show orders of magnitude improvement in data movement
and improvement in multiply-and-accumulate operations over
backpropagation. Like previous work, we observe that any variant of feedback
alignment suffers significant losses in classification accuracy on deep
convolutional neural networks. By transferring trained convolutional layers and
training the fully connected layers using direct feedback alignment, we
demonstrate that direct feedback alignment can obtain results competitive with
backpropagation. Furthermore, we observe that using an extremely sparse
feedback matrix, rather than a dense one, results in a small accuracy drop
while yielding hardware advantages. All the code and results are available
under https://github.com/bcrafton/ssdfa.Comment: 15 pages, 8 figure
Event-based Vision meets Deep Learning on Steering Prediction for Self-driving Cars
Event cameras are bio-inspired vision sensors that naturally capture the
dynamics of a scene, filtering out redundant information. This paper presents a
deep neural network approach that unlocks the potential of event cameras on a
challenging motion-estimation task: prediction of a vehicle's steering angle.
To make the best out of this sensor-algorithm combination, we adapt
state-of-the-art convolutional architectures to the output of event sensors and
extensively evaluate the performance of our approach on a publicly available
large scale event-camera dataset (~1000 km). We present qualitative and
quantitative explanations of why event cameras allow robust steering prediction
even in cases where traditional cameras fail, e.g. challenging illumination
conditions and fast motion. Finally, we demonstrate the advantages of
leveraging transfer learning from traditional to event-based vision, and show
that our approach outperforms state-of-the-art algorithms based on standard
cameras.Comment: 9 pages, 8 figures, 6 tables. Video: https://youtu.be/_r_bsjkJTH
YodaNN: An Architecture for Ultra-Low Power Binary-Weight CNN Acceleration
Convolutional neural networks (CNNs) have revolutionized the world of
computer vision over the last few years, pushing image classification beyond
human accuracy. The computational effort of today's CNNs requires power-hungry
parallel processors or GP-GPUs. Recent developments in CNN accelerators for
system-on-chip integration have reduced energy consumption significantly.
Unfortunately, even these highly optimized devices are above the power envelope
imposed by mobile and deeply embedded applications and face hard limitations
caused by CNN weight I/O and storage. This prevents the adoption of CNNs in
future ultra-low power Internet of Things end-nodes for near-sensor analytics.
Recent algorithmic and theoretical advancements enable competitive
classification accuracy even when limiting CNNs to binary (+1/-1) weights
during training. These new findings bring major optimization opportunities in
the arithmetic core by removing the need for expensive multiplications, as well
as reducing I/O bandwidth and storage. In this work, we present an accelerator
optimized for binary-weight CNNs that achieves 1510 GOp/s at 1.2 V on a core
area of only 1.33 MGE (Million Gate Equivalent) or 0.19 mm and with a power
dissipation of 895 {\mu}W in UMC 65 nm technology at 0.6 V. Our accelerator
significantly outperforms the state-of-the-art in terms of energy and area
efficiency achieving 61.2 TOp/s/[email protected] V and 1135 GOp/s/[email protected] V, respectively
- …