22 research outputs found
An Application-Specific VLIW Processor with Vector Instruction Set for CNN Acceleration
In recent years, neural networks have surpassed classical algorithms in areas
such as object recognition, e.g. in the well-known ImageNet challenge. As a
result, great effort is being put into developing fast and efficient
accelerators, especially for Convolutional Neural Networks (CNNs). In this work
we present ConvAix, a fully C-programmable processor, which -- contrary to many
existing architectures -- does not rely on a hard-wired array of
multiply-and-accumulate (MAC) units. Instead it maps computations onto
independent vector lanes making use of a carefully designed vector instruction
set. The presented processor is targeted towards latency-sensitive applications
and is capable of executing up to 192 MAC operations per cycle. ConvAix
operates at a target clock frequency of 400 MHz in 28nm CMOS, thereby offering
state-of-the-art performance with proper flexibility within its target domain.
Simulation results for several 2D convolutional layers from well known CNNs
(AlexNet, VGG-16) show an average ALU utilization of 72.5% using vector
instructions with 16 bit fixed-point arithmetic. Compared to other well-known
designs which are less flexible, ConvAix offers competitive energy efficiency
of up to 497 GOP/s/W while even surpassing them in terms of area efficiency and
processing speed.Comment: Accepted for publication in the proceedings of the 2019 IEEE
International Symposium on Circuits and Systems (ISCAS
A 0.3-2.6 TOPS/W Precision-Scalable Processor for Real-Time Large-Scale ConvNets
A low-power precision-scalable processor for ConvNets or convolutional neural
networks (CNN) is implemented in a 40nm technology. Its 256 parallel processing
units achieve a peak 102GOPS running at 204MHz. To minimize energy consumption
while maintaining throughput, this works is the first to both exploit the
sparsity of convolutions and to implement dynamic precision-scalability
enabling supply- and energy scaling. The processor is fully C-programmable,
consumes 25-288mW at 204 MHz and scales efficiency from 0.3-2.6 real TOPS/W.
This system hereby outperforms the state-of-the-art up to 3.9x in energy
efficiency.Comment: Published at the Symposium on VLSI Circuits, 2016, Honolulu, HI, U
Design-Space Exploration of Mixed-precision DNN Accelerators based on Sum-Together Multipliers
Mixed-precision quantization (MPQ) is gaining momentum in academia and industry as a way to improve the trade-off between accuracy and latency of Deep Neural Networks (DNNs) in edge applications. MPQ requires dedicated hardware to support different bit-widths. One approach uses Precision-Scalable MAC units (PSMACs) based on multipliers operating in Sum-Together (ST) mode. These can be configured to compute N = 1, 2, 4 multiplications/dot-products in parallel with operands at 16/N bits. We contribute to the State of the Art (SoA) in three directions: we compare for the first time the SoA ST multipliers architectures in performance, power and area; compared to previous work, we contribute to the portfolio of ST-based accelerators proposing three designs for the most common DNN algorithms: 2D-Convolution, Depth-wise Convolution and Fully-Connected; we show how these accelerators can be obtained with a High-Level Synthesis (HLS) flow. In particular, we perform a design-space exploration (DSE) in area, latency, power, varying many knobs, including PSMAC units parallelism, clock frequency and ST multipliers type. From the DSE on a 28-nm technology we observe that both at multiplier level and at accelerator level there is no one-fits-all solution for each possible scenario. Our findings allow accelerators’ designers to choose, out of a rich variety, the best combination of ST multiplier and HLS knobs depending on the target, either high performance, low area, or low power
A Reconfigurable Depth-Wise Convolution Module for Heterogeneously Quantized DNNs
In Deep Neural Networks (DNN), the depth-wise separable convolution has often replaced the standard 2D convolution having much fewer parameters and operations. Another common technique to squeeze DNNs is heterogeneous quantization, which uses a different bitwidth for each layer. In this context we propose for the first time a novel Reconfigurable Depth-wise convolution Module (RDM), which uses multipliers that can be reconfigured to support 1, 2 or 4 operations at the same time at increasingly lower precision of the operands. We leveraged High Level Synthesis to produce five RDM variants with different channels parallelism to cover a wide range of DNNs. The comparisons with a non-configurable Standard Depth-wise convolution module (SDM) on a CMOS FDSOI 28-nm technology show a significant latency reduction for a given silicon area for the low-precision configurations
Neural Network Quantisation for Faster Homomorphic Encryption
Homomorphic encryption (HE) enables calculating on encrypted data, which
makes it possible to perform privacypreserving neural network inference. One
disadvantage of this technique is that it is several orders of magnitudes
slower than calculation on unencrypted data. Neural networks are commonly
trained using floating-point, while most homomorphic encryption libraries
calculate on integers, thus requiring a quantisation of the neural network. A
straightforward approach would be to quantise to large integer sizes (e.g. 32
bit) to avoid large quantisation errors. In this work, we reduce the integer
sizes of the networks, using quantisation-aware training, to allow more
efficient computations. For the targeted MNIST architecture proposed by Badawi
et al., we reduce the integer sizes by 33% without significant loss of
accuracy, while for the CIFAR architecture, we can reduce the integer sizes by
43%. Implementing the resulting networks under the BFV homomorphic encryption
scheme using SEAL, we could reduce the execution time of an MNIST neural
network by 80% and by 40% for a CIFAR neural network.Comment: 5 pages, 2 figures, 3 table
XNOR Neural Engine: a Hardware Accelerator IP for 21.6 fJ/op Binary Neural Network Inference
Binary Neural Networks (BNNs) are promising to deliver accuracy comparable to
conventional deep neural networks at a fraction of the cost in terms of memory
and energy. In this paper, we introduce the XNOR Neural Engine (XNE), a fully
digital configurable hardware accelerator IP for BNNs, integrated within a
microcontroller unit (MCU) equipped with an autonomous I/O subsystem and hybrid
SRAM / standard cell memory. The XNE is able to fully compute convolutional and
dense layers in autonomy or in cooperation with the core in the MCU to realize
more complex behaviors. We show post-synthesis results in 65nm and 22nm
technology for the XNE IP and post-layout results in 22nm for the full MCU
indicating that this system can drop the energy cost per binary operation to
21.6fJ per operation at 0.4V, and at the same time is flexible and performant
enough to execute state-of-the-art BNN topologies such as ResNet-34 in less
than 2.2mJ per frame at 8.9 fps.Comment: 11 pages, 8 figures, 2 tables, 3 listings. Accepted for presentation
at CODES'18 and for publication in IEEE Transactions on Computer-Aided Design
of Circuits and Systems (TCAD) as part of the ESWEEK-TCAD special issu
YodaNN: An Architecture for Ultra-Low Power Binary-Weight CNN Acceleration
Convolutional neural networks (CNNs) have revolutionized the world of
computer vision over the last few years, pushing image classification beyond
human accuracy. The computational effort of today's CNNs requires power-hungry
parallel processors or GP-GPUs. Recent developments in CNN accelerators for
system-on-chip integration have reduced energy consumption significantly.
Unfortunately, even these highly optimized devices are above the power envelope
imposed by mobile and deeply embedded applications and face hard limitations
caused by CNN weight I/O and storage. This prevents the adoption of CNNs in
future ultra-low power Internet of Things end-nodes for near-sensor analytics.
Recent algorithmic and theoretical advancements enable competitive
classification accuracy even when limiting CNNs to binary (+1/-1) weights
during training. These new findings bring major optimization opportunities in
the arithmetic core by removing the need for expensive multiplications, as well
as reducing I/O bandwidth and storage. In this work, we present an accelerator
optimized for binary-weight CNNs that achieves 1510 GOp/s at 1.2 V on a core
area of only 1.33 MGE (Million Gate Equivalent) or 0.19 mm and with a power
dissipation of 895 {\mu}W in UMC 65 nm technology at 0.6 V. Our accelerator
significantly outperforms the state-of-the-art in terms of energy and area
efficiency achieving 61.2 TOp/s/[email protected] V and 1135 GOp/s/[email protected] V, respectively
Survey of Precision-Scalable Multiply-Accumulate Units for Neural-Network Processing
The current trend for deep learning has come with an enormous computational need for billions of Multiply-Accumulate (MAC) operations per inference. Fortunately, reduced precision has demonstrated large benefits with low impact on accuracy, paving the way towards processing in mobile devices and IoT nodes. Precision-scalable MAC architectures optimized for neural networks have recently gained interest thanks to their subword parallel or bit-serial capabilities. Yet, it has been hard to make a fair judgment of their relative benefits as they have been implemented with different technologies and performance targets. In this work, run-time configurable MAC units from ISSCC 2017 and 2018 are implemented and compared objectively under diverse precision scenarios. All circuits are synthesized in a 28nm commercial CMOS process with precision ranging from 2 to 8 bits. This work analyzes the impact of scalability and compares the different MAC units in terms of energy, throughput and area, aiming to understand the optimal architectures to reduce computation costs in neural-network processing