1,341 research outputs found
Extended Bit-Plane Compression for Convolutional Neural Network Accelerators
After the tremendous success of convolutional neural networks in image
classification, object detection, speech recognition, etc., there is now rising
demand for deployment of these compute-intensive ML models on tightly power
constrained embedded and mobile systems at low cost as well as for pushing the
throughput in data centers. This has triggered a wave of research towards
specialized hardware accelerators. Their performance is often constrained by
I/O bandwidth and the energy consumption is dominated by I/O transfers to
off-chip memory. We introduce and evaluate a novel, hardware-friendly
compression scheme for the feature maps present within convolutional neural
networks. We show that an average compression ratio of 4.4x relative to
uncompressed data and a gain of 60% over existing method can be achieved for
ResNet-34 with a compression block requiring <300 bit of sequential cells and
minimal combinational logic
CAS-CNN: A Deep Convolutional Neural Network for Image Compression Artifact Suppression
Lossy image compression algorithms are pervasively used to reduce the size of
images transmitted over the web and recorded on data storage media. However, we
pay for their high compression rate with visual artifacts degrading the user
experience. Deep convolutional neural networks have become a widespread tool to
address high-level computer vision tasks very successfully. Recently, they have
found their way into the areas of low-level computer vision and image
processing to solve regression problems mostly with relatively shallow
networks.
We present a novel 12-layer deep convolutional network for image compression
artifact suppression with hierarchical skip connections and a multi-scale loss
function. We achieve a boost of up to 1.79 dB in PSNR over ordinary JPEG and an
improvement of up to 0.36 dB over the best previous ConvNet result. We show
that a network trained for a specific quality factor (QF) is resilient to the
QF used to compress the input image - a single network trained for QF 60
provides a PSNR gain of more than 1.5 dB over the wide QF range from 40 to 76.Comment: 8 page
Slotted ALOHA Overlay on LoRaWAN: a Distributed Synchronization Approach
LoRaWAN is one of the most promising standards for IoT applications.
Nevertheless, the high density of end-devices expected for each gateway, the
absence of an effective synchronization scheme between gateway and end-devices,
challenge the scalability of these networks. In this article, we propose to
regulate the communication of LoRaWAN networks using a Slotted-ALOHA (S-ALOHA)
instead of the classic ALOHA approach used by LoRa. The implementation is an
overlay on top of the standard LoRaWAN; thus no modification in pre-existing
LoRaWAN firmware and libraries is necessary. Our method is based on a novel
distributed synchronization service that is suitable for low-cost IoT
end-nodes. S-ALOHA supported by our synchronization service significantly
improves the performance of traditional LoRaWAN networks regarding packet loss
rate and network throughput.Comment: 4 pages, 8 figure
XNOR Neural Engine: a Hardware Accelerator IP for 21.6 fJ/op Binary Neural Network Inference
Binary Neural Networks (BNNs) are promising to deliver accuracy comparable to
conventional deep neural networks at a fraction of the cost in terms of memory
and energy. In this paper, we introduce the XNOR Neural Engine (XNE), a fully
digital configurable hardware accelerator IP for BNNs, integrated within a
microcontroller unit (MCU) equipped with an autonomous I/O subsystem and hybrid
SRAM / standard cell memory. The XNE is able to fully compute convolutional and
dense layers in autonomy or in cooperation with the core in the MCU to realize
more complex behaviors. We show post-synthesis results in 65nm and 22nm
technology for the XNE IP and post-layout results in 22nm for the full MCU
indicating that this system can drop the energy cost per binary operation to
21.6fJ per operation at 0.4V, and at the same time is flexible and performant
enough to execute state-of-the-art BNN topologies such as ResNet-34 in less
than 2.2mJ per frame at 8.9 fps.Comment: 11 pages, 8 figures, 2 tables, 3 listings. Accepted for presentation
at CODES'18 and for publication in IEEE Transactions on Computer-Aided Design
of Circuits and Systems (TCAD) as part of the ESWEEK-TCAD special issu
YodaNN: An Architecture for Ultra-Low Power Binary-Weight CNN Acceleration
Convolutional neural networks (CNNs) have revolutionized the world of
computer vision over the last few years, pushing image classification beyond
human accuracy. The computational effort of today's CNNs requires power-hungry
parallel processors or GP-GPUs. Recent developments in CNN accelerators for
system-on-chip integration have reduced energy consumption significantly.
Unfortunately, even these highly optimized devices are above the power envelope
imposed by mobile and deeply embedded applications and face hard limitations
caused by CNN weight I/O and storage. This prevents the adoption of CNNs in
future ultra-low power Internet of Things end-nodes for near-sensor analytics.
Recent algorithmic and theoretical advancements enable competitive
classification accuracy even when limiting CNNs to binary (+1/-1) weights
during training. These new findings bring major optimization opportunities in
the arithmetic core by removing the need for expensive multiplications, as well
as reducing I/O bandwidth and storage. In this work, we present an accelerator
optimized for binary-weight CNNs that achieves 1510 GOp/s at 1.2 V on a core
area of only 1.33 MGE (Million Gate Equivalent) or 0.19 mm and with a power
dissipation of 895 {\mu}W in UMC 65 nm technology at 0.6 V. Our accelerator
significantly outperforms the state-of-the-art in terms of energy and area
efficiency achieving 61.2 TOp/s/[email protected] V and 1135 GOp/s/[email protected] V, respectively
- …