Deep neural networks have achieved impressive results in computer vision and machine learning. Unfortunately, state-of-the-art networks are extremely compute-and memoryintensive which makes them unsuitable for mW-devices such as IoT end-nodes. Aggressive quantization of these networks dramatically reduces the computation and memory footprint. Binary-weight neural networks (BWNs) follow this trend, pushing weight quantization to the limit. Hardware accelerators for BWNs presented up to now have focused on core efficiency, disregarding I/O bandwidth and system-level efficiency that are crucial for deployment of accelerators in ultra-low power devices. We present Hyperdrive: a BWN accelerator dramatically reducing the I/O bandwidth exploiting a novel binary-weight streaming approach, and capable of handling high-resolution images by virtue of its systolic-scalable architecture. We achieve a 5.9 TOp/s/W system-level efficiency (i.e. including I/Os)-2.2x higher than state-of-the-art BNN accelerators, even if our core uses resource-intensive FP16 arithmetic for increased robustness.
I. INTRODUCTION
Over the last few years, deep neural networks (DNNs) have revolutionized computer vision and data analytics. Particularly in computer vision, they have become the leading approach for the majority of tasks with rapidly growing data set sizes and problem complexity, achieving beyond-human accuracy in tasks like image classification. What started with image recognition for handwritten digits has moved to data sets with millions of images and 1000s of classes [1, 2] . What used to be image recognition on small images [3, 4] has evolved to object segmentation and detection [5] [6] [7] [8] in high-resolution framesand the next step, video analysis, is already starting to gain traction [9] [10] [11] . Many applications from automated surveillance to personalized interactive advertising and augmented reality have real-time constraints, such that the required computation can only be run on powerful GPU servers and data center accelerators such as Google's TPUs [12] .
At the same time, we observe the trend towards "internet of things" (IoT), where connected sensor nodes are becoming ubiquitous in our lives in the form of fitness trackers, smart phones, surveillance cameras [13, 14] . This creates a data deluge that is never analyzed and raises privacy concerns when collected at a central site [15] . Gathering all this data is largely infeasible as the cost of communication is very high in terms of network infrastructure, but also reliability, latency and ultimately available energy in mobile devices [16] . The centralized analysis in the cloud also does not solve the compute problem, it merely shifts it around, and service providers might not be willing to carry the processing cost while customers do not want to share their privacy-sensitive data [17] .
A viable approach to address these issues is edge computinganalyzing the vast amount of data close to the sensor and transmitting only condensed highly informative data [13, 18] . This information is often many orders of magnitude smaller in size, e.g. a class ID instead of an image, or even only an alert every few days instead of a continuous video stream. However, this implies that the data analysis has to fit within the power constraints of IoT nodes-often small-form factor devices with batteries of a limited capacity, or even devices deployed using a set-and-forget strategy with on-board energy harvesting (solar, thermal, kinetic, . . . ) [19] .
Recently, several methods to train neural networks to withstand extreme quantization have been proposed, yielding the notions of binary and ternary weight networks (BWNs, TWNs) and binarized neural networks (BNNs) [20] [21] [22] . BWNs and TWNs allow a massive reduction of the data volume to store the network and have been applied to recent and highcomplexity networks with an almost negligible loss. In parallel, the VLSI research community has been developing specialized hardware architectures focusing on data re-use with limited resources and optimizing arithmetic precision, exploiting weight and feature map (FM) sparsity, and performing on-the-fly data compression to ultimately maximize energy efficiency [23] . However, these implementations fall into one of two categories: 1) they stream the entire or even partial FMs into and out of the accelerator ending up in a regime where I/O energy is far in excess of the energy spent on computation, hitting an energy efficiency wall: the state-of-the-art accelerator presented in [24] has a core energy efficiency of 59 TOp/s/W, but including I/O power it is limited to 1 TOp/s/W; or 2) they assume to store the entire network's weights and intermediate FMs on-chip. This severely constrains the DNN's size that can be handled efficiently by a small low-cost IoT-endnode class chip. It also prevents the analysis of high resolution images, thus precluding many relevant applications such as object detection.
In this work, we convey the following key contributions: 
II. RELATED WORK

A. Software-Programmable Platforms
The availability of cheap computing power on GPUs and large data sets have sparked the deep learning revolution, starting when AlexNet had incurred a landslide win in the ILSVRC image recognition challenge in 2012 [25] . Since then we have seen optimized implementations [26, 27] and algorithmic advances such as FFT-based and Winograd convolutions further raising the throughput [28, 29] . The availability of easy-to-use deep learning frameworks exploiting the power of GPUs transparently to the user has resulted in wide-spread use of GPU computing. With the growing market size, improved hardware has become available as well: Nvidia has introduced a product line of systems-on-chip for embedded applications where ARM cores have been co-integrated with small GPUs for a power range of 5-20 W and ≈50 GOp/s/W. Also the GPUs' architecture has been optimized for DNN workload, introducing tensor cores and fast half-precision floating-point (fp16) support. The latest device, Nvidia's V100, achieves 112 TFLOPS at 250 W [30] -an energy efficiency of 448 GOp/s/W. It's main competitor, Google's TPU [12] , works with 8-bit arithmetic and achieves 92 TOp/s at 384 W (240 GOp/s/W). With these power budgets, however, they are unsuitable for IoT end-nodes.
B. Co-Design of DNN Models and Hardware
Over the last few years, several approaches adapting DNNs to reduce the computation effort have been presented. One main direction was the reduction of the number of operations and model size. Specifically, the introduction of sparsity provides an opportunity to skip some operations. By pruning the weights a high sparsity can be achieved particularly for the fullyconnected layers found at the end of many networks and the ReLU activations in most DNN models injects sparsity into the FMs, which can be exploited [31, 32] .
A different direction is the research into reduced precision computation. Standard fixed-point approaches work down to 10-16 bit number formats for many networks. It is possible to further reduce the precision to 8 bit with small accuracy losses (< 1%) when retraining the network to adapt to this quantization [33] . There are limitation to this: 1) for deeper networks higher accuracy losses (2-3% for GoogLeNet) remain, and 2) Typically, only the input to the convolutions are quantized in this format. Internal computations are performed at full precision, which implies that the internal precision is very high for large networks e.g. for a 3x3 convolution layer with 512 input FMs, this adds 12 bit. Further approaches include non-linearly spaced quantization in the form of mini-floats [33] , and power-of-two quantization levels replacing multiplications with bit-shift operations [20] .
Several efforts have taken the path to extreme quantization to binary (+1/-1) or ternary (+1/0/-1) quantization of the weights while computing the FMs using floats. This massively compresses the data volume of the weights and has even been shown to be applicable to deep networks with an accuracy loss of approx 1.6% for ResNet-18 [20] and thus less than the fixed point-and-retrain strategies. The next extreme approach are (fully) binary neural networks (BNNs), where the weights and FMs are binarized [34] . While this approach is attractive for extreme resource constrained devices [18, 35] , the associated accuracy loss of 16% on ResNet-18 is unacceptable for many applications.
C. FPGA and ASIC Accelerators
Many hardware architectures targeting DNNs have been published over the last few years. The peak compute energy efficiency for fixed-point CNN accelerators with >8 bit can be found at around 50 GOp/s/W for FPGAs, 2 TOp/s/W in 65 nm and around 10 TOp/s/W projected to 28 nm [36] [37] [38] [39] . However, this does not include I/O energy for streaming the FMs, or assumes that intermediate results can be entirely stored in limited-size on-chip memory. Streaming the FMs results in a device-level energy efficiency wall at around 1 TOp/s/W [24] , while requiring the data to fit entirely into on-chip memory renders the device very large.
Many of the sparsity-based optimizations mentioned in Sec. II-B have been implemented in hardware accelerators [32, 40] , they could achieve up to 3× higher core energy efficiency and raise the device-level energy efficiency by around 70% through data compression. The effect of training DNNs to become BWNs has shown the biggest impact on core computeonly energy with an energy efficiency of 60 TOp/s/W in 65 nm [24] . However, with the present architectures, the fundamental efficiency limitation by the I/O energy remains.
III. HYPERDRIVE ARCHITECTURE
The Hyperdrive architecture we propose in this work is fundamentally different from previous BWN accelerators [24, 41] in two aspects: 1) it keeps the FMs on-chip and streams the weights. This approach exploits the binary nature of weights, significantly reducing the I/O bandwidth. 2) Its hierarchical systolic-scalable structure allows to scale the complexity and resolution of the networks, both on-chip, by instantiating multiple computing tiles within an accelerator, and off-chip, instantiating multiple accelerators in a 2D mesh.
The system architecture is composed of the following components, illustrated in Fig. 1 . FM-wise batch-normalization shared among the TPUs of the same tile, and 3) a ReLU activation unit. Each TPU is assigned to a tile and an output FM. Each TPU is connected to its 8 neighbors on the X,Y axes to quickly access neighbouring pixels. • memory for optimal energy efficiency [24] .
IV. COMPUTATIONAL MODEL
In state of the art CNNs such as ResNet-34 the complexity and number of weights is huge. However, for BWNs, streaming the weights rather than the FMs or both is particularly attractive due to the compression by 16× (i.e. from fp16). Moreover, the weights are only read once from off-chip memory and do not need to be streamed out again. To exploit these features we propose a weight-streaming approach where all the weights are streamed in only once during the execution of the overall network and the FMs are stored in on-chip memory. Weights are streamed in only for computing the first pixel of the current output FMs, and then stored into a weight buffer to avoid streaming the them again to the chip and read from there for the following pixels. This approach significantly reduces the I/O bandwidth of the accelerator, by a factor of 5× for 224×224 or 710× for 2048×1024 sized images for several ResNet configurations, as shown in Tbl. II.
The computational model of Hyperdrive is based on a first phase where the input FMs of the first layer are partitioned and loaded the the on-chip systolic design, as illustrated in Fig. 3 . During this phase, the FMs are assigned to the tile processing units (TPUs), which are organized as a 3D-mesh systolic array and follow a single instruction multiple data (SIMD) execution model. Hence, the input FMs are tiled into blocks and assigned on the X,Y dimension, while the output FMs which re-use the same input FMs depth-wise are assigned to the TPUs on the Z dimension ( Fig. 3) .
Once the input FMs are loaded into the array, the execution starts, as illustrated in Tbl. I for an implementation of the architecture featuring 16×7×7 TPUs with 8×8 sized tiles and for a 3×3 convolution layer with 16×64 FMs.
Each TPU calculates a single pixel of its assigned tile for the same input FM of its assigned output FM. The TPUs assigned to the same tile shares the input FM pixels, which reduces memory reads by M× and the weights are read once for each FM, but shared among the spatial TPUs, which reduces the weight buffer reads by N 2 ×. The results are stored in the accumulation register. In more detail, the execution can be summarized as the following 4 nested execution phases: Fig. 5 , hence management of bypass has been introduced into the architecture. When all contribution for the output FM pixels are summed up, the bypass FM is read from memory and added (if bypass exists). Finally, the biasing and batch normalization is applied and stored back to the FM memory.
When the current's output FM are entirely calculated, the next layer is processed while the FMs stay on-chip and only the new weights are streamed in. This is continued until the final layer is computed, as illustrated in Fig. 4 .
A. CNN Mapping
The size of the on-chip memory for intermediate FM storage has to be selected depending on the convolution layer with the largest memory footprint of the network -Worst-Case Layer (WCL). Typically, the WCL is at the beginning of the network, since a common design pattern is to double the number of FMs after a few layers while performing at the same time a 2×2 strided operation, thereby reducing the number of pixels by 4×and the total FM volume by 2×. To perform the computations layer-by-layer, avoiding usage of power hungry dual-port memories, we leverage a ping-pong buffer mechanism reading from one memory bank and writing the results to a different memory bank. Hence, for a generic CNN the amount of memory required by the WCL is: max layers in CNN n in h in w in + n out h out w out words, since all input and output FMs have to be stored to implement the described ping-pong buffering mechanism.
However, many networks have bypass paths, hence additional intermediate FMs have to be stored, as described in Fig. 5 for the potential WCLs of ResNet-34. This aspect has two implications: 1) In order to avoid additional memory (+50%), we perform an on-the-fly addition of the bypass path after the second 3×3 convolution (i.e. the dashed rectangle is a single operation). This is done by performing a read-add-write operation on the target memory locations.
2) The common transition pattern with the 2×2-strided convolution does not require additional memory. It temporarily needs three memory segments, but two of them are 2× smaller and can fit into what has been a single memory segment before (M2 is split into two equal-size segments M2.1 and M2.2).
For ResNet-18 and -34 on ImageNet data samples (i.e. for image recognition), the total required memory is 401 k words. Using the same procedure to determine the amount of memory for ResNet-50/-152/. . . shows that due to the different structure (i.e. the bottleneck building block), three memory segments are required with a total of 1.2 M words. If the structure is fixed, the required memory size does not depend on the network depth. Note that if enough silicon area is available and scalability to an arbitrary deep network is not required, on-chip storage of the weights should be considered, eliminating almost the entire remaining I/O transfers. For ResNet-34 this approach would require 21 Mbit and thus 6.3 mm 2 and for ResNet-18 to around 3 mm 2 of SRAM (0.3 μm 2 /bit in GF 22nm FDX).
B. Scalability to Multiple Chips
The proposed architecture is trivially scalable to high resolution data on a single die, however production yield diminishes and cost explodes for large die sizes and many varying-size chips (for different image resolutions) would have to be manufactured for cost-efficient volume production. We address this issue and allow flexible scaling of not just the image resolution (i.e. on-chip memory), but also performance with a fixed die size by extending the systolic design approach to chip-level and connecting multiple Hyperdrive chips on the circuit board or an interposer.
Hyperdrive can be scaled to a systolic array of m × n chips, where the FMs are split into tiles of size M × N . In this way, every chip keeps M × N tiles, and the entire FM is partitioned into M · m × N · n tiles. As the convolution window overlaps with the neighboring chips, data needs to be exchanged between the chips. Since FM values are read repeatedly, we transfer them as soon as they are computed and then buffer them in a border memory. Each chip has to be able to keep max all layers ((2Mh tile + 2Nw tile + 4)n in ) values from the neighboring chips for 3×3 filter support, where h tile , w tile denote the spatial dimension of each tile (e.g. 8×8). Chips at the boundary of the systolic grid would be performing zeropadding instead of using this memory.
V. RESULTS
In the further discussion, the number of tiles chosen is 7×7, which allows for 4× striding on 112×112 sized input FMs (like in common ResNet-like networks), while keeping all the TPUs busy with at least one single spatial pixel during the entire network. Half-precision floating point numbers are used for the FMs, as this gives higher flexibility and greatly eases the training on common frameworks. 1 The on-chip memory was sized to fit the WCL (i.e. ResNet-like networks with bypass layers and 112×112 input FM) with 6.4 Mbit (400 kword) and is implemented with y × 8 = 7 × 8 high-density single-port SRAMs with 1024 lines of x · 16 = 7 · 16 = 112 bit words, where as the memories are assigned to the (x × y) tiles. The output FM parallelism has been fixed to 16 to optimize the trade-offs between area, performance and power. The weight buffer has been implemented to fit up to 512 (max. #input FMs) 3×3 kernels for 16× depth-wise parallelism. If more input FMs are needed, they can be tiled to 512 blocks and partial output FM can be calculated and summed up on-the-fly using the bypass mode. The frequently-accessed weights buffer has been implemented as a latch-based memory composed of 5×8 blocks of 128 rows of 16-bit words, reducing the access energy to SRAM memories by 43×.
A. Implementation Results
Hyperdrive was implemented in 22 nm FDX technology based on the standard cells of INVECAS (8 track, LVT, V05) and has been sent for tape-out. The chip has an effective core area of 1.92 mm 2 (=9.6 MGE) 2 , where 1.24 mm 2 are SRAM memories (6.4 Mbit), 0.115 mm 2 are SCM memory (74 kbit) and 0.32 mm 2 arithmetic units.
The power consumption is evaluated on a reference 3×3 convolution layer with 16→ 16 FMs with 56×56 pixels with bias and batch normalization and on the post-layout netlist with timing (sdf) and parasitics annotations (spef) while using 1 Using fixed-point numbers is considered as a potential extension, however this will need to be supported by more complex quantization-aware training. 2 One 2-input NAND gate equivalents (GE) is 0.199 μm 2 in GF22. the provied at different voltages, at the operatig temperature of 25 • C in typical operating conditions.. The I/O energy was determined on the basis of an LPDDR3 PHY in 28 nm technology [36] and thus estimated with 21 pJ/bit, but should be considered as a lower bound when less advanced PHYs are used, e.g. pessimistic for our architecture. Tbl. III gives an overview of the power consumption of the various blocks of Hyperdrive at 0.65 V (typical corner) and 160 MHz. The memory takes most of the power and consumes 9.29 mW, while the arithmetic units use 8.25 mW. Even though the weight buffer is accessed every cycles, it consumes little energy of 0.16 mW.
Tbl. IV gives an overview of the key metrics of Hyperdrive. The design is running at 100 MHz in the low-voltage corner at 0.59V (125C, slow) and reaches a throughput of 156.8 GOp/s and a core energy efficiency of 6.1 TOp/s/W and at 0.9 V (25C, typical) a throughput of 539.4 GOp/s can be achieved.
B. Benchmarking
We evaluated Hyperdrive on ResNet-34, one of the most prominent networks. From the residual networks, this network feature a good trade-off of depth and accuracy, i.e. ResNet-50 outperforms ResNet-34 by just 0.5% (Top-1), but is roughly 50% more compute-intense and the memory footprint is even 3.3× higher (see Sec. IV-B).
The first and the last layer need to stay in full-precision to keep a satisfactory accuracy and are not implemented on Hyperdrive, but they contribute just 3% of the computation (226 MOp of 7.3 GOp) and can therefore also be evaluated on low-power compute platforms [42] . Tbl. VI gives an overview of the number of operations, number of cycles and throughput while Hyperdrive is evaluating ResNet-34. In case of batch normalization, the throughput is reduced since just 49 multipliers are available and the normalization does takes more cycles. In the layers where the bypass has to be added, Hyperdrive can also just calculate one output FM at a time, because the memory bandwidth is limited to 49 half-precision words. Fortunately, the non-convolution operations are comparably rare and a real throughput of 1.53 kOp/cycle or 152.8 GOp/s @ 0.65 V is achieved leading to a very high utilization ratio of 97.5%.
C. Comparison with State-of-the-Art
Tbl. V compares our architecture with state-of-the-art binary weight CNN accelerators. On the top, we compare the numbers of image recognition (224×224 pixel images), for which previous work report results. In the lower part, we compare the key figures for object detection using ResNet-34 and ResNet-152 features on 2048×1024 pixel images (e.g. found in autonomous driving data sets [5, 44] ). At 0.65 V a frame rate of 34.6 for ResNet-34 and 11.3 frame/s for ResNet-152 is achieved independent of the image resolution when systolically scaling the architecture accordingly.
Previous work is dominated by I/O energy, especially for spatially large feature maps. We compare our work to Wang et al. [41] , approximating the power consumption according to the scaling model presented in Dreslinski et al. [43] . Furthermore, the FMs are based on Tbl. II but scaled down to 6 bits words to 3 The energy efficiency of Wang et al. [41] has been scaled to 22 nm according to the energy scaling model presented in Dreslinski et al. [43] . account for the ENQ number format. Our approach uses up to 58× less energy for I/O and increases overall energy efficiency by up to 2.2× because just the first input FM and the weights need to be streamed to the chip, but not the intermediate FMs.
Hyperdrive's core energy efficiency is 3× lower than previous work, due to: 1) Fp16 operators which are more robust than Q12 or ENQ6 in [24, 41] and were shown to work with the most challenging deep networks. Using floating-point feature maps directly impacts the energy for the accumulation operations as well as memory and register read/write operations. ENQ on the other side has been shown to introduce an accuracy drop of 1.6% already on CIFAR-100 [41] , which is more than the difference between running ResNet-32 instead of ResNet-110 on CIFAR-10. It thus implies that a deeper network has to be computed to achieve a comparable accuracy. 2) Significantly larger on-chip memories to store the FMs.
However, optimizations such as approximate adders and strong quantization can be combined with Hyperdrive's concepts, extending the core efficiency gains to the system level by removing the non-scalable I/O bottleneck. For instance, moving from FP16 to Q12 would lead to an energy efficiency boost that can be estimated around 3× for the core, which would translate to a system efficiency boost of 6× with respect to state-of-the-art.
VI. CONCLUSION
We have presented Hyperdrive: a systolically-scalable hardware architecture for binary weights neural networks, which dramatically minimizes the I/O energy consumption to achieve outstanding system-level energy efficiency. Hyperdrive achieves an energy efficiency of 5.9 TOp/s/W which is more than 2.2× better than prior state-of-the-art architectures, by exploiting a binary weights streaming mechanism while keeping the entire FMs on-chip. Furthermore, while previous architectures were limited to some specific network sizes, Hyperdrive allows running networks not fitting on a single die, by arranging multiple chips in an on-board 2D systolic array, scaling-up the resolution of neural networks, hence enabling a new class of applications such as object detection on the edge of the IoT.
