Abstract-Convolutional neural networks (CNNs) have revolutionized computer vision, speech recognition, and other fields requiring strong classification capabilities. These strengths make CNNs appealing in edge node Internet-of-Things (IoT) applications requiring near-sensors processing. Specialized CNN accelerators deliver significant performance per watt and satisfy the tight constraints of deeply embedded devices, but they cannot be used to implement arbitrary CNN topologies or nonconventional sensory algorithms where CNNs are only a part of the processing stack. A higher level of flexibility is desirable for next generation IoT nodes. Here, we present Mia Wallace, a 65-nm system-on-chip integrating a near-threshold parallel processor cluster tightly coupled with a CNN accelerator: it achieves peak energy efficiency of 108 GMAC/s/W at 0.72 V and peak performance of 14 GMAC/s at 1.2 V, leaving 1.2 GMAC/s available for general-purpose parallel processing.
I. INTRODUCTION

C
ONVOLUTIONAL networks are becoming increasingly popular in computer vision thanks to their outstanding accuracy and generalization capability in object detection, scene parsing, and image segmentation tasks [1] . As convolutional neural network (CNNs) are computationally expensive, they are typically deployed in high-performance, powerhungry servers "in the cloud" and not on embedded devices. However, the ability to compress low information density data into a highly informative compressed state (e.g., a classification tag) is attractive also for low-power embedded devices. For example, smart visual sensor nodes could exploit it to minimize the amount of energy spent in data transmission, by sending to the cloud only classification tags or preclassified data. Application-specified integrated circuit (ASIC) Manuscript accelerators are the standard way to cope with significant workloads in low-power embedded devices, but for the specific task of CNNs they lack the flexibility to adapt to the great variety of different topologies. Moreover, they are limited in scope to a scenario where CNNs are the sole application run on the node. This brief extends the previous abstract by Pullini et al. [2] ; we propose a 65-nm energy-efficient system-on-chip (SoC) based on a hybrid HW/SW approach to CNN acceleration. We rely on a near-threshold parallel platform featuring four single-issue OpenRISC cores enhanced for efficient fixed point computations and a hardware accelerator for convolutionaccumulation operations, which constitute the bulk of the computational load of CNNs. The proposed approach joins the flexibility of software-programmable processors with the performance and energy efficiency boost of specialized hardware, suitable for a new generation of Internet-of-Things (IoT) applications based on brain inspired computing.
A common way to accelerate CNNs on programmable hardware relies on GP-general public utilities, which are able to reach extremely high throughput (up to 6 Top/s), but consume tens or hundreds of Watts [3] , [4] . Embedded CNN implementations on platforms such as ODROID-XU [5] or CEVA [6] provide tens of Gop/s within a power budget of a few watts. Movidius Myriad 2 has 12 8-way VLIW SHAVE processors, claimed to be working at 600 MHz within a power envelope of 0.5 mW [7] , for a total of up to 120 Gop/s/W. Despite its very high claimed peak efficiency, it is not a direct point of comparison for our work as it targets more powerful embedded systems such as smartphones, UAVs with a power envelope more than 10× higher than that of Mia Wallace.
Low-power application specific CNN accelerators often focus on convolutional layers as they dominate CNNs. Origami 1549-7747 c 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
is a convolutional accelerator providing a peak energy efficiency of 803 GOPS/W in 65-nm technology [8] . Although this solution is efficient, it requires additional components at system level to implement full CNNs. Other hardware solutions implement a whole CNN including pooling, activation layers, and fully connected layers exploiting different computational models and architectures. A reconfigurable dataflow architecture NeuFlow was presented in IBM 45-nm SOI technology with a throughput up to 1280 Gop/s and an core energy efficiency of 490 Gop/s/W [9] . ShiDianNao presented an architecture exploiting the 2-D structure of the CNN reaching 128 Gop/s and 400 Gop/s/W energy efficiency [10] . Eyeriss includes an array of 14 × 12 reconfigurable processing elements connected through a network-on-chip. It reduces data movements and exploit data reuse compression to reduce I/O bandwidth [11] . Sim et al. [12] presented a DNN processor that uses an image tiling scheme for reducing off-chip memory access and an algorithmic approach (principal component analysis) to reduce the dimension of the kernels. Different approaches to low-power sensor data analytics have also been recently explored with promising results, using, e.g., convolutional deep belief networks (DBN) instead of CNNs. Park et al. [13] reported up to 1939 Gop/s/W when combining DBN learning and inference.
Contrasted to most of the presented CNN accelerators: 1) our proposal is a full multicore heterogeneous SoC, working in an SW-driven fashion (in fact, the CNN accelerator accounts only for 15% of the full cluster area); 2) we focus on a technique for sustainable usage of the available memory bandwidth, which is often the scarcest resource in CNN computations; and 3) the flexible architecture we propose can combine CNNs with many other sensor data analysis techniques, using SW cores and the CNN accelerator concurrently.
II. SOC ARCHITECTURE The proposed SoC (see Fig. 1 ) implements, the third generation parallel ultralow-power (PULP) platform 1 extended with a dedicated accelerator for convolution intensive processing [16] . The programmable computing engine of the SoC is based on a tightly coupled cluster of four OpenRISC ISA cores called OR10N enhanced for energy efficient digital signal processing [17] . The cluster features a shared 4 kB latch-based standard cell memory (SCM) [18] instruction cache that, coupled with a private per-core L0 buffer, increases energy efficiency by 30% with respect to an SRAM-based private cache architecture [15] . The cores share an explicitly managed tightly coupled data memory (TCDM). The TCDM features eight word-level interleaved banks connecting the processors through a nonblocking interconnect to minimize banking conflict probability. Each logical bank is implemented as a heterogeneous memory, composed of 64 kB of SRAM banks and 8 kB of SCM banks; by disabling SRAMs in the cluster entirely, it is possible to extend the operating range well below the limits imposed by SRAM scaling (down to 0.62 V in the case of Mia Wallace). 1 The first generation PULP architecture is presented in [14] , while the second generation is presented in [15] . Further information regarding the PULP platform can be found in the project Web page http://www.pulp-platform.org. A multi physical-channel direct memory access (DMA) enables fast and flexible communication with 256 kB of L2 memory. The set of peripherals available on the SoC include: 200 Mbit/s SPI interfaces (master/slave and single/quad mode), I2C, 50 Mb/s I2S, GPIOs, bootup ROM, and JTAG interface for debug and test purposes. To provide high energy efficiency across a wide range of workloads, the cluster and the rest of the SoC are in different clock and voltage domains, isolated by dual-clock FIFOs and level shifters. Finegrained tuning of the SoC and cluster frequencies is performed by two frequency-locked loops [19] .
A dedicated hardware convolution engine (HWCE) extends the cluster to efficiently implement convolve-accumulate operations. The "deep core" of the HWCE are two sum-of-products (SoP) units, providing a peak throughput of two 5×5 convolutions per cycle on 16-bit inputs. The 5 × 5 SoP enables native computation of the great majority of CNNs, which use 5 × 5 and 3 × 3 filters [20] , [21] . The accelerator is integrated in the cluster as a triple of masters of the TCDM interconnect, using the same interconnect as the processor cores. Offload of HW accelerated tasks is performed through a dedicated slave port mapped on a peripheral interconnect, where a shadow configuration register enables asynchronous control of the HWCE without stopping its execution.
III. CNNS ON MIA WALLACE
A first challenge in implementing CNNs in a small embedded device such as Mia Wallace is the CNN working set size, forcing to maximize usage of local memory (the TCDM), and minimize data exchange with other levels of the memory hierarchy. Deep CNNs are naturally divided in layers (e.g., convolutional, pooling, and fully connected). Each convolutional layer builds a 3-D space of output feature maps from a 3-D space of input feature maps, using a matrix of convolutional filters (see [21] ); typically, even a single layer is too big to be stored entirely in the 72 kB TCDM.
This challenge can be addressed noting that CNNs have a hierarchical structure that maps nicely to the explicitly managed memory hierarchy of Mia Wallace. We assume that the CNN topology (inputs and weights for the full network) resides on an external memory accessible via quad SPI (QSPI) (see Fig. 2 ). The application brings one layer at a time to the L2 memory via QSPI. Since the 72 kB TCDM cannot typically host inputs, weights, and outputs for a whole layer, it is necessary to further split the workload and the work set in independent chunks that are executed one after the other from TCDM. This can be done by tiling: considering a convolutional layer, the full input space of a layer is sliced in grids of N i × H × W tiles in the feature, height and width dimensions. Similarly, the output space is split in a N o × H × W grid. Finally, the filters are sliced in a N i × N o grid. The computation is performed on one tile (kept in L1) at a time.
Each output tile is computed as a sum of contributions from a set of N i input tiles, but from the tiling perspective there is significant freedom on how to organize this accumulation. Using the tiling strategy shown in Fig. 3 , for each output tile the contribution of all N i related input tiles is computed sequentially; then the next output tile is computed. This requires to reload each input tile up to N o times. The transfer overhead can be reduced by minimizing N i ; in the limit where there is a single input tile (N i = 1), it is no longer necessary to reload tiles multiple times. The flexible architecture of Mia Wallace enables this strategy, without being limited to it.
The presence of the HWCE relieves the OR10N cores of the heavy task of computing convolutions in CNN, substituting it with the much lighter task of DMA and HWCE control. In both the DMA and HWCE it is possible to enqueue a set of jobs (up to eight for the DMA, two for the HWCE), after which the cores are free to perform other useful work. For example, it is possible to establish a software pipeline where the cores execute activation and pooling of the previous layer while the HWCE works on the "bulk" convolution of the current layer. It is also possible to introduce double buffering to hide data transfer overheads.
An additional advantage of the availability of both SW cores and an HW accelerator is that of establishing a trigger for a high-effort CNN execution with a low-effort, low-power first stage such as a shallow simplified nonconvolutional neural network executed in SW [22] , which Mia Wallace can execute in pure SW, as shown in Fig. 4 . Thanks to its reduced memory requirements, the low effort neural network (LE-NN) can also be run using only the SCM portion of the TCDM, making it possible to reach the minimum supply voltage of 0.62 V and minimize the overall power envelope. Section IV evaluates the maximum workload of such a kind that can be supported in Mia Wallace, and the relative energy efficiency.
IV. RESULTS
In this section, we evaluate performance and efficiency of our platform on the manufactured Mia Wallace prototype chips. 2 Fig. 5 shows a microphotograph of one of the chips, along with its main features and parameters such as working operating frequencies and power range. The total size of Mia Wallace is 3.95 mm×1.88 mm.
We measured the baseline peak efficiency that can be expected from the platform on CNNs by letting the HWCE run while the rest of the cluster is silent and there is no data transfer. The average throughput in this case is 36.5 MAC/cycle-in other words, 1.46 5 × 5 convolutions per cycle. To put this value into perspective, a high-accuracy CNN architecture such as GoogLeNet [20] requires a total of 2.45 × 10 9 MAC operations when applied on a 320 × 240 input image (a realistic image size for low-power camera sensors); at peak, our platform could sustain a similar computational workload in real time. At 0.72 V, pure HWCE execution reaches 3.53 GMAC/s of throughput within 15 mW, for an overall 236 GMAC/s/W of overall efficiency.
Sharing memory between software cores and the HWCE introduces the opportunity to apply also activation and pooling to the output set of the accelerator directly in-place in the cluster, by software. These operations are typically simple, and they show a high degree of variability between different CNN topologies (e.g., max-and avg-pooling on different sizes, different types of nonlinear activations); hence, they are not good candidates for hardware acceleration. The shared-memory acceleration technique allows to implement these kernels without any additional performance/energy overhead to move data from the accelerator. For example, at 0.72 V 2×2 max-pooling has a cost of ∼3.8 cycles and ∼640 pJ per input pixel for computation on 4 SW cores; if this data had to be copied from a private HWCE memory to L2 and then to the shared TCDM before being used, there would be a small time penalty (∼0.5 cycles per input pixel), and a significant energy overhead (250 pJ per pixel-a 39% increase). If the only nonlinearity applied to output pixels is a simple ReLU, then the energy overhead of data movement from private accelerator memories becomes even more expensive-while data movement overheads are similar to the max-pooling case, pure computation costs only 0.7 cycles per pixel and 120 pJ; data movement would cost more than 3 × computation on such a small kernel. As mentioned in Section III, the most critical aspect of executing CNNs on an embedded platform such as PULP is data transfer to/from the local memory, which requires tiling. As a way to understand how well convolutional layer execution can be superimposed to data transfer and other computations (e.g., subsampling in a pooling layer), we define computation-to-communication ratio (CCR) metric as the ratio between executed MAC operations and bytes transferred in/out of the cluster TCDM. We swept CCR from 5 (i.e., full overlap between busy time of DMA and HWCE) to 100 (i.e., the DMA is active only for a fraction of the computation time). We performed two kinds of CNN tests: in HWCE tests one OR10N core is used only for controlling the execution and the other three are idle, while in HWCE+matrix multiplication (MMUL) tests the cores are used for a high effort computation (a matrix multiplication), a worst case where the cores are used in a software pipeline. Finally, a third test (LE-NN) represents the continuous running of a small fully connected artificial NN such as that described in Section III. Fig. 6 (a) and (b) report, respectively, the energy efficiency in GMAC/s/W and the throughput in GMAC/s when executing the tests; Fig. 6(c) shows the efficiency variation at the 1.2 V operating point when sweeping the CCR. The results of the LE-NN tests indicate that at 0.62 V, power is below 6.3 mW at 38 MHz, and the platform can support an LE-NN of up to 150 MMAC/s, which is enough to run a very small shallow nonconvolutional neural network. Conversely, a pure SW implementation of CNNs would be too slow to run a state-of-the-art CNN with more than 10 9 MACs on an embedded SoC. The HWCE solves part of this problem, delivering good results even when the cost of data transfers in energy and in increased memory contention is highest. For example, when the CCR is 5 the energy efficiency is still as high as 91 GMAC/s/W. In the more common case of a relatively high CCR, Mia Wallace reaches an even better overall efficiency of 108 GMAC/s/W at 0.72V, or 9.26 pJ per MAC.
When we consider the HWCE+MMUL tests, the cooperative execution on both the OR10N cores and the HWCE provides a 10% performance boost. Counting two operations per MAC, Fig. 6 shows that a compound CNN workload of up to 30 Gop/s can be sustained by the Mia Wallace platform. Net HWCE throughput differs by less than 5% from the performance in the HWCE tests-superposition of HWCE work with a significant SW load does not hit CNN performance. Supporting additional workload benefits the overall CNN execution in several ways, not only by directly improving its throughput. For example, pooling layers are pure reductions: executing one in the software pipeline improves the CCR by performing more operations and even more by reducing the amount of data to be written back to L2, improving overall throughput in turn. This key consideration helps understanding the HWCE+MMUL results in Fig. 6 : while there is of course a small net power cost in having both the cores and the HWCE run at the same time, additional computation improves the CCR by both raising the number of ops and, in the case of pooling, by reducing the amount of data to be moved, making the effective efficiency loss almost negligible. Table I provides a final summary of our contribution and its positioning with respect to the state-of-the-art in CNN inference executed both in SW and in HW accelerators. In terms of energy efficiency, the HWCE in Mia Wallace achieves results comparable to state-of-the-art dedicated ASICs [8] - [11] , and the efficiency of the full Mia Wallace cluster, including cores, DMA and interconnects, is still in a similar range. Differently from most other platforms, Mia Wallace is able to execute full CNNs of arbitrary size using the methodology illustrated in Section III (whereas except for [10] ASICs are mostly limited to conv and pool layers). Moreover, as opposed to all ASIC architectures compared in the table, it can also execute arbitrary code using CNNs as part of more complex pipelines. Table I reports a comparison between the current state-of-the-art in CNN SW execution and HW acceleration; moreover, we have shown how the flexibility of the Mia Wallace platform enables a plurality of working modes. On the one hand, this allows the execution of a low-performance, low-effort SW neural network within 6.3 mW of overall power budget. On the other hand, the execution of a CNN convolutional layer can be superimposed with additional workload, for a maximum compound average throughput of ∼30 GOP/s at 400 MHz.
