Abstract-Convolutional Neural Networks (CNNs) have revolutionized the world of image classification over the last few years, pushing the computer vision close beyond human accuracy. The required computational effort of CNNs today requires powerhungry parallel processors and GP-GPUs. Recent efforts in designing CNN Application-Specific Integrated Circuits (ASICs) and accelerators for System-On-Chip (SoC) integration have achieved very promising results. Unfortunately, even these highly optimized engines are still above the power envelope imposed by mobile and deeply embedded applications and face hard limitations caused by CNN weight I/O and storage. On the algorithmic side, highly competitive classification accuracy can be achieved by properly training CNNs with binary weights. This novel algorithm approach brings major optimization opportunities in the arithmetic core by removing the need for the expensive multiplications as well as in the weight storage and I/O costs. In this work, we present a HW accelerator optimized for BinaryConnect CNNs that achieves 1510 GOp/s on a core area of only 1.33 MGE and with a power dissipation of 153 mW in UMC 65 nm technology at 1.2 V. Our accelerator outperforms state-of-the-art performance in terms of ASIC energy efficiency as well as area efficiency with 61.2 TOp/s/W and 1135 GOp/s/MGE, respectively.
I. INTRODUCTION
Convolutional Neural Networks (CNN) algorithms have been achieving outstanding classification capabilities in several complex tasks such as image recognition [2] , [3] , face recognition [4] and speech recognition [5] . The rather large computational workload and data intensity required by CNNs has resulted in optimized implementations on mainstream systems [6] , CPUs [7] , GPUs [8] . However, these software implementations lack a key attribute to enable a wider variety of mobile and embedded applications, namely energy efficiency. This pushes the need for specialized architectures to achieve higher performance at lower power.
A few research groups exploited the customization paradigm to design highly specialized hardware to enable CNN computation in the domain of embedded applications. Several approaches leverage FPGA implementation to maintain postfabrication programmability, while providing significant boost in terms of performance and energy efficiency [9] . However, FPGAs are still two orders of magnitude less energy efficient than ASICs [10] . Moreover, CNNs share a very reduced set of kernels, but can be used to cover a huge application domain by simply changing weights and network topology, relaxing the constraints on NRE costs typical of ASIC design.
Among CNN ASIC implementations, precision of arithmetic operands play a crucial role in the energy efficiency. Sev-eral reduced-precision implementations have been proposed recently, relying on 16-bit, 12-bit or 10-bit of accuracy for both operands and weights [10] , [11] , [12] , [13] , [14] , exploiting the intrinsic resiliency of CNNs to quantization and approximation [15] , [16] . In this work we take a significant step forward in energy efficiency by exploiting recent research on binary-weight CNNs [17] , [18] . BinaryConnect is a method which trains a deep neural network with binary weights during the forward and backward propagation, while retaining precision of the stored weights in gradient computation. This approach has the potential to bring great benefits to CNN hardware implementation by enabling the replacement of multipliers with much simpler complement operations and multiplexers as well as by drastically reducing weight storage requirements. Interestingly, BinaryConnect leads to negligible accuracy losses on several well-known CNN benchmarks [19] , [20] .
In this paper, we introduce the first optimized design implementing a flexible, energy efficient and performance scalable convolutional engine based on Binary Connect CNNs. We demonstrate that this approach improves the energy efficiency of the digital core of the accelerator by 5.1x, and the throughput by 1.3x, with respect to a baseline architecture based on 12-bit MAC units. To extend the performance scalability of the device we implement a latch-based Standard Cell Memory (SCM) architecture for on-chip data storage. Although SCMs are more expensive than SRAMs in terms of area, they provide better voltage scalability and energy efficiency [21] , extending the operating range of the device to the very limit of the technology (0.6V). This further improves the energy efficiency of the engine by 6x (at 55 GOp/s), with respect to the nominal operating voltage of 1.2V. To improve the flexibility of the convolutional engine we implement support for three different kernel sizes (3x3, 5x5, 7x7) making it suitable for implementing a wide variety of CNNs. The proposed accelerator achieves a peak performance of 1510 GOp/s, a peak energy efficiency of 61.2 TOp/s/W and a peak area efficiency of 1135 GOp/s/MGE surpassing state-of-the-art of ASICs CNN engines by 2.7, 32x, and 10x respectively.
II. RELATED WORK
Convolution neural networks are reaching record-breaking accuracy in image recognition on small images and class sets like MNIST, SVHN and CIFAR-10 with accuracy rates of 99.79%, 98.31% and 96.53% [22] , [23] , [24] . Recent CNN architectures also perform extremely well for large and complex datasets such as ImageNet: GoogLeNet reached 93.33% and ResNet broke the human record of 94.9% with an accuracy of 96.43%. As the trend goes to larger CNN networks, e.g. ResNet uses 34 layers, VGG Oxfordnet 19 layers (135M parameter) [25] , the memory and computational complexity raises, too. CNN-based classification is not problematic when running on large GPU clusters with kW power budgets. However, IoE edge-node applications have much tighter mW-level power budgets. This "CNN power wall" led to the development of many approaches to improve CNN energy efficiency, both at the algorithmic and the hardware level.
A. Algorithm Approaches
To reduce the computation complexity, a widely exploited method is based on quantization and precision scaling. Even though the precision is reduced, the CNNs are still able to compete against the full-precision CNNs. E.g. on Full Cifar-10 the accuracy decreases just 2% for 4 bits [16] and 0.03% for 2 bits (-1,0,1) [15] compared to the full-precision CNN. It has also been shown, that the results can be improved by independent quantization per layer. Moons et al. [26] were able to reduce power by 30x (compared to 16-bit fixed-point) without accuracy loss and 225x with an accuracy of 99% on AlexNet, CifarQuick and LeNet-5. BinaryConnect [20] proposes to binarize (-1, 1) the weights. The weights are kept at full precision, but get binarized in forward and backward propagation. The following formula shows the deterministic and stochastic binarization function, where a "hard sigmoid" function σ is used to determine the probability distribution:
In a further work [19] , the same authors propose to quantize the inputs to the layers in the backward propagation to 3 or 4 bits and replacing the multiplications by shift-add operations. The performance could even outperform the fixedpoint weights because of regularization effect and lowering overfitting. Rastegari et al. consider not only to binarize the weights, but also the layer inputs, such that the proposed algorithm uses mainly binary XNOR operations. They show only 2.9% less accuracy on ImageNet for their XNOR version of AlexNet than the full-precision one (in top-1 measure) [18] . In this work we focus on implementing a CNN accelerator for the BinaryConnect algorithm, instead of the stochastic gradient descent algorithm, as the reduction of complexity is promising in terms of energy and speed and as near state-of-the art classification accuracy was achieved [17] .
B. Hardware and Software Implementations
There are several approaches to optimize CNN computations on GPU, which reach high throughput up to 6 TOp/s, but they need up to 250 W [8] , [27] . In the last years also some low-power programmable accelerators entered the marked. A known example is Movidius' Myriad 2 which computes 100 GFLOPs and needs just 500mW@600 MHz [28] . Even though GPU and GPU-like implementations already exploit parallelism, they are still over the energy budget of IoT application and therefore several dedicated hardware architectures have been proposed to improve the energy efficiency. Several CNN systems were presented implementing activation layer (mainly ReLU) and pooling (i.e. max pooling) [29] , [30] , [31] , in this work we focus on the convolution layer as this contributes most to the computational complexity [8] . Convolutions have by fact a lot of repeating data, thus a sliding window scheme is very beneficial, because most of the pixels (e.g. 6 × 7 pixels for 7 × 7 kernels) in the input channel can be reused and have not to be loaded from memory again [32] , [12] , [30] , [13] . In this work we go even further and cache the values, such that we can reuse them when we switch to the next tile, in this way only one pixel per cycle has to be loaded off-chip. As the filter kernel sizes change from problem to problem, several approaches were proposed: E.g. zero-padding is a possibility: in Neuflow the filter kernel was fixed to 9x9 and filled with zeros for smaller filters [33] . For smaller filters this means unused data is loaded and the additional hardware cannot be switched off. Another approach implements several processing elements arranged in a matrix. Chen et al. proposed a accelerator containing an array of 14x12 processing elements which are configurable and connected in a network-on-chip. The PE can be adjusted for several filter sizes. For small filter sizes, they can be used to calculate several output channels in parallel or they can be switched-off [32] . Even though this brings flexibility, all data packets have to be labeled, such that the data can be reassembled in a later step and the system needs a lot of MUXes and control logic to handle this, thus the energy efficiency is decreased. In this work we propose an architecture that focuses mainly on common 3x3, 5x5 and 7x7 kernel sizes.
Large bandwidth requirements are needed which affect time and energy when going off-chip. In other works compression was used, either the input data [32] or the weights were compressed. E.g. Jaehyeong et al. used PCA to reduce the dimension of the kernels, as they proved that correlation between filters can be exploited with minor influence on accuracy [30] . Finally the computational complexity on-chip can be minimized benefiting from the fact, that zero-values appear quite often in CNNs due to the ReLU activation layer, Chen et al. exploit zero-skipping where the multiplications are bypassed [32] . Even though they are non-zero, a lot of small values are present in the network, therefore Jaehyeong et al. proposed in their work a small 16-bit multiplier, which triggers a stall and calculation of the higher bits only when an overflow was detected, which already gives an improvement of 56% in energy efficiency [30] . The next step for further complexity is to exploit quantization scaling as described in Section II-A. Even though most approaches work with fixed-point operations, the number of quantization bits is still kept at 24 bits( [30] , [29] ) or 16 bits( [12] , [13] , [31] , [33] , [34] ). In this work we binarize the weights and quantize the input and output channel pixel to 12 bits. The last step consist of exploiting binary weights, thus BinaryConnect was implemented on Nvidia GTX750 GPU and a reduction of run time of a factor 7 was achieved [19] . In this work we present a first accelerator optimized for BinaryConnect, fully exploiting binary weight for boosting area and energy efficiency.
III. ARCHITECTURE
A CNN consists of several stages which include a convolution, activation and pooling layer. We focus on the convolution layer as it needs the most of the total computation time [10] . A general convolution layer is drawn in Fig. 1 and it is described by Eq. 1. A layer consists of n in input channels and n out output channels and n in · n out kernels with h k × b k weights; we denote the matrix of filter weights as w k,n . For each output channel k every input channel n is convolved with a different kernel w k,n , resulting in the termsõ k,n , which are accumulated to the final output channel o k . We propose a new scalable CNN accelerator called YodaNN. We propose a hardware architecture able to calculate n ch × n ch channels in parallel. If the number of input channels n in is greater than n ch , the system has to process the network n in /n ch times and the results are accumulated off-chip, this adds only n in /n ch −1 operations per pixel. In the following we fix the number of output channels to n ch = 32 and the filter kernel size to h k = b k = 7. The system is composed of the following units (An overview can be seen on Fig. 2 
):
• Filter Bank is a shift register which contains n in · n out = 32 2 = 1024 7 × 7 filter weights w k,n for the output channels k ∈ N <32 and input channels n ∈ N <32 (1 KiB)and supports column-wise left circular shift per kernel.
• The Image Memory saves an image stripe of b k = 7 width and 1024 height (10.5 KiB), which can be used to cache 1024/n in = 1024/32 = 32 rows per input channel.
• The Image Bank (ImgBnk) caches a spatial window of h k × b k = 7 × 7 per input channel n (2.3 KiB), which are applied to the SoP units. This unit is used to reduce memory accesses, as the h k − 1 = 6 last rows can be reused when we proceed in a columnwise order through the input images. Only the last row has to be loaded from the Image Memory and the upper rows are shifted upwards.
• n ch = 32 Sum-of-Product (SoP) Units: The SoP unit k calculates the sum termsõ k,n where in each cycle the contribution of a new input channel n is calculated.
• n ch = 32 ChannelSummer (ChSum): The ChannelSummer k accumulates the sum termsõ k,n for all input channels n, then truncates the result to the initial fixed-point format Q2.9 (12-bits) and streams it out in an interleaved manner.
A. Dataflow   Fig. 3 illustrates the timing of the chip. In a initialization step the weights w k,n are loaded to the image memory for all input and output channels k and n. Then YodaNN preloads the very first pixel for every input channel n ∈ N <32 . The image pointer is moved to the 2 nd row and this procedure is repeated until the pixel i 31 (0, 31) (last row and last channel) is read. Then the image pointer is moved to the 1 st row and 2 nd column and the 32 rows for the 32 input channels are loaded. This procedure is repeated until the b k = 7 th column and the h k = 7 th row for the first input channel is loaded. At this moment a full spatial window of b k × h k = 7 × 7 is ready in the image bank. The SoPs start calculating the termsõ k,0 for all output channels k, followed by the next input channel 1 in the next cycle and so on for the n in = 32 input images while the ChannelSummers accumulate the partial sums. After 32 cycles the ChannelSummers have calculated all first pixels o k (0, 0) for all their assigned output channels k and stream them out channel by channel in an interleaved manner. This is done until the last line is reached, then the first h k − 1 lines have to be pre-loaded to the image bank from the image memory. The filters are circular left shifted to align to the new input images. The next column of all output channels are calculated, this procedure is repeated until the whole image is processed.
B. BinaryConnect Approach
As discussed in Section I, in this work we present a CNN accelerator based on BinaryConnect [17] . With respect to an equivalent 12-bit version, the first big change in architecture are the weights which are reduced to a binary value w k,n ∈ {−1, 1} and remapped by the following equation:
The size of the filterbank decreases thus from n 2 ch · n 2 f ilt · 12 = 37632 bits to n 2 ch · n 2 f ilt · 1 = 3136 bits in case of the 12-bit MAC architecture with 8 × 8 channels and 7 × 7 filters that we consider as baseline. The 12 × 12-bit multipliers can be substituted by two's-complement operations and multiplexers, which reduce the "multiplier" and the adder tree size, as the products have a width of 12 bits instead of 24. The SoP is fed by a 12-bit and 7×7 pixel sized image window and 7×7 binary weights. Figure 4 shows the impact on area of the 12-bit MAC and the binary connect architectures. Considering that with the 12-bit MAC implementation 40% of the total total chip area is used for the filterbank and 40% are needed for the 12 × 12-bit multipliers and the accumulating adder trees, this leads to an enormously reduced area cost and complexity. The critical path is reduced as well. Thus, it is possible to reduce voltage while still keeping the same speed, improving the energy efficiency even more.
C. Latch-Based SCM
An effective approach to optimize energy efficiency is to adapt the supply voltage of the architecture according to the performance requirements of the application. However, the potential of this approach is limited by the presence of SRAMs for implementation of image memory, which bounds the voltage scalability to 0.8V. To overcome this limitation, taking advantage of the area savings achieved through adoption of binary SoPs, we replace the SRAM-based image memory with a latch-based SCMs. Although SCMs are more expensive in terms of area (Figure 4) , they are able to operate in the whole operating range of the technology (0.6V -1.2V) and they also feature significantly smaller read/write energy [21] at the same voltage. To reduce the area overhead of the SCMs and improve routability we propose a multi-banked implementation, where the memory consist of a 7x8 matrix of 12-bit x 128 rows latch-based arrays. The SCMs are designed with hierarchical clock gating and address/data silencing mechanisms, so that when a bank is not accessed the whole latch array consumes no dynamic power. During a typical CNN execution, 7 SCMs banks are read, and one is written per cycle. Hence, only 8 over 56 banks of SCM banks consume dynamic power. The proposed architecture reduces power consumption of the memory by 3.25x at 1.2V, and extends the functional range of the whole convolutional engine down to 0.6V.
D. Energy Efficiency Considerations
I/O power is a primary concern of convolutional accelerators, consuming even more than 30% of the overall chip power [35] . As we decrease the computational complexity by the binary approach, the I/O power gets even more critical. The other advantage with having more SoP units on-chip is throughput which is formulated in Equation 3:
With this in mind, we increased the number of input and output channels to 8 × 8, 16 × 16 and 32 × 32.
E. Support of Different Filter Sizes
Adapting filter size to the problem provides an effective way to improve the flexibility and energy efficiency of the accelerator when executing CNNs with different requirements. Although the simplest approach is to zero-pad the filter, this is not feasible in the presented binary connect architecture, as the value 0 is mapped to −1. A more power-efficient approach tries to re-use parts of the architecture. We present an architecture where we re-use the binary multipliers for two 3×3, two 5×5 or one 7 × 7 filters. In this work we limit the number of output channels per SoP unit to two as we are limited in the number of pins and can afford only 12 additional pins. With respect to our baseline architecture, supporting only 7 × 7 filters, the number of binary operators and the weights per filter is increased from 49 to 50, such that two 5×5 or one 7×7 filter fit in one block. In case of a 3x3 or 5x5 the image from the image bank is mapped to the first 25 input image pixels and the latter 25 and are finally accumulated in a adjusted adder tree which is drawn in Fig. 5 . With this scheme n ch × 2n ch channels for 3 × 3 and 5 × 5 filters can be calculated, which improves the maximum bandwidth and energy efficiency for these two cases. 
A. Computational Complexity and Energy Efficiency Measure
Research in the field of deep learning is done on various platforms, nevertheless comparable metrics are needed. For computational complexity the total number of multiplications and additions have been used in other publications like in [36] , [11] , [33] , [8] . The number of operations for a CNN layer with n in input channels and n out output channels, a filter kernel size of w k × h k and an input image size of w im × h im the computational complexity can be calculated as follows:
The factor of 2 is because additions and multiplications are counted as separate operations and the two latter factors account the fact, that the output channel image is reduced at the border by (h k − 1) or (w k − 1), respectively. Memory operations are not counted as additional operations. We will use in the following evaluation the following metrics:
• Throughput Θ = (#Op based on Eq. 4)/t [1 GOp/s]
• Peak Throughput: Theoretically reachable throughput. This takes into account idling, cache misses, etc.
• Energy Efficiency
As the architecture has not been taped out yet, all results are based on post place & route results of the design. The libraries are characterized for UMC 65nm and the design was implemented with Cadence SoC Encounter 15.20 and gate-level power simulations were performed with Synopsys PrimePower 2012.12. The I/O power was approximated by power measurement on chips of the same technology and scaled by the actual operating frequency [10] .
B. Fixed-Point vs. YodaNN
In this section we compare a fixed-point baseline implementation with a binary version with fixed filter kernel size of 7 × 7 and 8 × 8 channels and including an SRAM. Results are summarized in Table I . The reduced arithmetic complexity and the replacement of the SRAM by a latchbased memory helped to improve the critical path. Three pipeline-stage which were used in the fixed-point version could be removed and the critical path changed to the memory and thus the peak throughput could still be increased from 348 GOp/s to 377 GOp/s at a core voltage of 1.2 V and the core power could be reduced by 79 % to 39 mW which leads to a 5.1x better core energy efficiency and 1.3x better core area efficiency. As the SRAM fails below 0.8 V, we can get even better results by reducing the supply voltage to 0.6 V thanks to our SCM implementation. Although the peak throughput drops to 15 GOp/s, the core power consumption is reduced to 260 µW and core energy efficiency raises to 59 TOp/s/W which is an improvement of 11.6x compared to the fixed-point architecture at 0.8 V. Fig. 6 . Comparison of core energy efficiency and throughput for the baseline architecture (Fixed-point Q2.9, SRAM, 8x8 channels, fixed 7x7 filters) with final YodaNN (Binary, SCM, 32x32 channels, supporting several filters). Fig. 6 shows the throughput and energy efficiency of YodaNN with respect to the baseline architecture for different voltage supplies, while Fig. 7 shows the breakdown of the core power at the operating frequency of 400 MHz. Although the power consumption of the core increases by 3.32x when moving from 8x8 to 32x32 channels, the throughput increases by 4x, improving energy efficiency by 20%. Moreover, taking advantage of more parallelism, voltage and frequency scaling can be exploited to improve energy efficiency for a target throughput. The implementation of different kernel sizes, that allows to significantly improve the flexibility of the YodaNN architecture, increases by 11.2% the core area, and by 38% the core power, with respect to a binary design implementing 7x7 kernels. If we now consider I/O power, it is getting even more interesting to increase the number of channels, as we can increase the throughput, but the total device power does not increase the same extent. We estimate a fixed contribution of 328mW for the the IO power at 400 MHz. Table II gives an overview of the device energy efficiency for different filter kernel sizes at 1.2 V core and 1.8 V pad supply. The device energy efficiency raises from 856 GOps/s/W in the 8x8 architecture to 1611 in the 16x16 and to 2756 in the 32x32.
Q2.9-8x8
Bin-8x8 Bin-16x16 Bin-32x32 
C. Comparison with State-of-the-Art
In Section II several software and architectural approaches were introduced. Table III gives a comparative and quantitative overview. With our 32 × 32 architecture we were able to reach a peak throughput of 1.5 TOp/s which outperforms NINEX [31] by a factor of 2.7. To evaluate an actual throughput in a real CNN, we used the Stanford backgrounds dataset [37] and our reference CNN from [10] . In core energy efficiency the design outperforms k-Brain, Ninex by 5x and more. If finally the supply voltage is reduced to 0.6 V the throughput decreases to 55 GOp/s but the energy efficiency raises to 61.2 TOp/s which beats k-Brain [29] , NINEX [31] and the design of Jaehyeong et al. [30] by more than 50, or 32 when supporting several filter sizes.
V. CONCLUSION
In this work we presented a flexible, energy efficient and performance scalable CNN engine. The proposed architecture is the first ASIC design exploiting recent results on BinaryConnect CNNs, which greatly simplifies the complexity of the design by replacing fixed-point MAC units with simpler complement operations and multiplexers, with no hurt in classification accuracy. To further improve energy efficiency and extend the performance scalability of the accelerator we have implemented a latch-based SCMs for on-chip data storage. To improve its flexibility we support three different kernel sizes (3x3, 5x5, 7x7) making it suitable for implementing a wide variety of CNNs. Even though this leads to a reduction of 29% in energy efficiency, still an outstanding energy efficiency of 61 TOp/s/W is achieved. The proposed accelerator achieves a peak performance of 1510 GOp/s, a peak energy efficiency of 61.2TOp/s/W and a peak area efficiency of 1135 GOp/s/MGE surpassing the state-of-the-art of ASICs CNN engines by 2.7x, 32x, and 10x respectively. 
