Abstract-loT end-nodes require high performance and extreme energy efficiency to cope with complex near-sensor data analytics algorithms. Processing on multiple programmable processors operating in near-threshold is emerging as a promising solution to exploit the energy boost given by low-voltage operation, while recovering the related frequency degradation with parallelism. In this work, we present a heterogeneous cluster architecture extending a traditional parallel processor cluster with a reconfigurable Integrated Programmable Array (IPA) accelerator. While programmable processors guarantee programming legacy to easily manage peripherals, radio software stacks as well as the global program flow, offloading data-intensive and control-intensive kernels to the IPA leads to much higher system level performance and energy-efficiency. Experimental results show that the proposed heterogeneous cluster outperforms an 8-core homogeneous architecture by up to 4.8x in performance and 4.Sx in energy efficiency when executing a mix of controlintensive and data-intensive kernels typical of near-sensor data analytics applications.
I. INTRODUCTION
High perfonnance and extreme energy efficiency are strict requirements for many deeply embedded near-sensor processing applications such as wireless sensor networks, end-nodes of the Internet of Things (loT) and wearables. One of the most traditional approaches to improve energy efficiency of deeply embedded computing systems is achieved exploiting architectural heterogeneity by coupling general-purpose processors with application-or domain-specific accelerators in a single computing fabric [I] [11] . On the other hand, most recent ultra-low power designs exploit multiple homogeneous programmable processors operating in near-threshold [10] . Such an approach, which joins parallelism with low-voltage computing, is emerging as an attractive way to join performance scalability with high energy efficiency.
In this paper, we present a heterogeneous architecture which integrates a near-threshold tightly-coupled cluster of processors [10] augmented with the Integrated Programmable Array (IPA) presented in [3] . This approach joins the programming legacy of instruction processors with the flexible performance and efficiency boost of Coarse Grain Reconfigurable Arrays [4] (CGRA). A similar approach has been adopted in [5] , which introduced an ultra-low power heterogeneous system featuring a Single Instruction Multiple Data (SIMD) CGRA as reconfigurable accelerator for bio-signal 978-1-5386-4881-0/18/$31.00 ©2018 IEEE analysis. With respect to this domain-specific architecture, where the computational kernels must be mapped manually on the CGRA, the system proposed in our work is meant for general-purpose near-sensor data analytics, also relying on an automated compilation flow that allows to generate the configuration bitstream for the CGRA starting from a generalpurpose ANSI-C code [2] .
We synthesized the architecture in a 28nm FD-SOI technology, and we carried out a quantitative exploration combining physical synthesis results (i.e. frequency, area, power) and benchmarking of a set of signal processing kernels typical of end-nodes loT applications. Two interesting findings of our exploration show that (1) the performance of the IPA is much less sensitive to memory bandwidth than parallel processor clusters and that (2) the simpler nature of its architecture allows the IPA to run twice as fast as the rest of the system. Exploiting these two features of our architecture, we show that the heterogeneous cluster achieves significant performance and energy improvement for both compute and control intensive benchmarks with respect to the 8 core homogeneous cluster, achieving up to 4.8x speed-up (with a minimum of Ix and an average of 1.79x) and up to 4.4 x (with a minimum of 1 x and an average of 2.24 x) better energy efficiency.
II. BACKGROUND
This section presents the background technology used to design the heterogeneous reconfigurable cluster described in this work.
A. PULP Cluster Architecture
The PULP cluster features 8 32-bit RISC-V cores based on a four pipeline stages micro-architecture optimized for energyefficient operation [6] sharing a 64KB multi-banked scratchpad memory through a low-latency interconnect [8] . The ISA of the cores is extended with instructions targeting energy efficient digital signal processing such as hardware loops, load/store with pre/post increment, SIMD operations. The cores share a 4KB private instruction cache to boost performance and energy efficiency for tightly coupled clusters of processors typically relying on data parallel computational models [7] . Off-cluster data transfers are managed by a lightweight multichannel DMA optimized for energy-efficient operation [9] . Both the (1$) and DMA are connected to an AXI4 cluster bus. 
IV. SOFTWARE INFRASTRUCTURE
To offload jobs to the IPA and synchronize the execution, the cores access the control registers of the IPA, by memory bitstream can be found in [3] . A set of memory mapped control registers allow to load a new context to the IPA array, trigger the execution of a kernel and synchronize with the other processors in the cluster.
As opposed to many CGRA architectures, the IPA can access a multi-banked shared memory through 8 master ports connected to the low-latency interconnect. This eases data sharing with the other processors of the cluster, following the computational model described in [I] . The optimal number of port has been chosen to optimize the trade-off between the size of the interconnect and the bandwidth requirements of the IPA. Following the analysis conducted in [2] , which shows that the IPA can operate 2x faster than the processors, we have extended the architecture of the cluster in a way that the IPA can work at twice the frequency of rest of the cluster. This approach allows to operate each component in the cluster at the optimal frequency, without paying the overheads of dual-clock FIFOs, requiring a significant amount of logic and synchronization overhead. On the contrary, the hardware support for the dual-frequency mode includes a clock divider to generate the two different edge aligned clocks, and two modules needed to adapt the request-grant protocol of the lowlatency interconnect [8] to deal with the frequency domain crossing, as shown in 
B. [PA Architecture
The IPA consists of an array of 16 PEs communicating through a 2D torus interconnect [2] . Each PE can perform 32-bit ALU operations (both arithmetic and logical), 16-bit x 16-bit ---+ 32-bit multiplications and control flow operations such as branches. The functional units of each PE features two input operands coming from the neighbouring PEs or the internal register files. The PEs also include an instruction register file which stores the program, a regular register file to store temporary variables and a constant register file to store immediates. To reduce dynamic power consumption in idle mode, each PE contains a tiny Power Management Unit (PMU) which clock gates the PEs when idle [3] . A parametric number of PEs can be augmented with a load-store unit employing the same request-grant protocol of the PULP lowlatency interconnect [8] , which allows to communicate with a multi-banked shared memory. The configuration of the array is generated automatically by a compilation flow which starts from a ANSI-C code and generates the configuration bitstream for the IPA [2] .
III. HETEROGENEOUS CLUSTER ARCHITECTURE
In this work, the PULP cluster is extended with the Integrated Programmable Array accelerator, as shown in Figure  1 . Figure 2 shows a detailed block diagram of the subsystem embedding the IPA array. The IPA array is configured through a global context memory (GCM), responsible for storing locally the configuration bitstream of the PEs. The GCM is connected through a DMA-capable AXI-4 port to the cluster bus, enabling pre-fetching of IPA contexts from L2 memory. The GCM is considered twice the size of the configuration bitstream of the IPA in the worst case. In this way, it is possible to employ a double-buffering mechanism and load a new bitstream from the L2 to the GCM when the current one is being loaded on the array, completely hiding time for reconfiguration. More details on the structure of the IPA array A peripheral interconnect is used to communicate with oncluster peripherals such as a timer, a hardware synchronizer and other memory mapped peripherals such as applicationspecific accelerators. To operate at the best operating point for a given workload the cluster can be integrated in an independent voltage and frequency domain, featuring dualclock FIFOs and level shifters at its boundary.
978-1-5386-4881-0/18/$31.00 ©2018 IEEE
V. EXPERIMENTAL RESULTS

Function
Description void load_data_12totcdm
Writes data from L2 memory (int DMA_CORE_ID, int size, to the TCDM banks through unsigned int 12_addr, DMA_CORE unsigned int tcdm_addr) void load_contexU2togcm (int DMA_IPA_ID, Writes context from L2 memory int size, unsigned int 12_addr, to the GCM through DMA_IPA unsigned int gcm addr) int ipa_starcexecution 0
Initiate IPA execution by writing in the command register void ipa check_status(in id)
Core synchronization void free ipa (int id)
Release IPA
A. Implementation
The cluster consists of 8 cores featuring 4 kB of shared 1$, one IPA with 16 PEs and a GCM of 4KB, while the TCDM is composed of 16 banks of 4 kB each, leading to an overall TCDM size of 64 kB. These architectural parameters were chosen to fit the constraints of the wide range of signal processing benchmarks presented in this paper. The SoC was synthesized with Synopsys Design Compiler 2013.12-SP3 on a STMicroelectronics 28nm UTBB FD-SOI technology library. Since, the achievable frequency of the PEs in the IPA is higher than the RISKY cores used in the cluster, the IPA is clocked at 100 MHz, while the rest of the cluster runs at 50 MHz (in the SS, 0.6V, -40°C corner). Synopsys PrimePower 2013.12-SP3
In this section we present the implementation results of the heterogeneous PULP cluster. The three possible modes considered in these comparisons are: (a) single-core: running applications in a single core, (b) ipa: running applications in the IPA where the core takes part in offloading only, (c) multi-core: running applications in parallel cores. All the benchmarks are coded in fully portable C, using the OpenMP programming model to express parallelism for PULP. In these benchmarks, matrix multiplication, convolution, FFT, FIR, separable filter, sobel filter are broadly used in near sensor image and multimedia applications. Sampling scheduler of the sensors strongly depends on computation of the greatest common divisor (GCD). Feature extraction in sensor networks widely uses CORDIe. 
SYNTHESIZED AREA INFORMATION FOR THE PULP HETEROGENEOUS CLUSTER
B. Performance and Energy Efficiency
Table III reports the execution time in nano seconds for different benchmarks running on a single-core, on 8 cores and on the IPA, The IPA execution time includes the time taken for loading the context into the PEs. Comparing to the performance of execution in single-core, the accelerator achieves a maximum of 8x (with a minimum of 2A9x and an average of SAx) speed-up. The control intensive kernel like GCD does not exhibit parallelism, hence parallel software execution does improve performance of the homogeneous cluster. On the other hand, the execution on the IPA improves the performance by almost 5 x, exploiting also instruction-level parallelism rather than data-level parallelism only. The performance gain in the accelerator for the compute intensive kernels like matrix multiplication, convolution, FIR and separable filters is limited if compared to the performance of parallel-cores. However, the relatively small performance gain compared to the parallel cluster is compensated by the gain in energy consumption (Table V) due to the simpler nature of the compute units of the IPA with respect to full processors, to the smaller number of power-hungry load/store operations (Table VI) , and to the fine-grained power management architecture that allows clock gate the inactive PEs during execution (Table V) . Table IV presents the performance improvement of the IPA when moving from iso-frequency to the 2x frequency domain execution in the IPA, This shows that, although there is a reduction of memory bandwidth (see loss due to additional stalls column in Table IV) , since the TCDM operates at the same frequency as the rest of the cluster (i.e. half frequency w.r.t. the IPA array), an average of 1.82x speed-up (with maximum of l.92x and a minimum of 1.73x) can be achieved with this dual-frequency cluster architecture.
The power consumption profiles for the different modes of execution presented in Figure 4 , which shows the percentage was used for timing and power analysis at the supply voltage of 0.6V, 25°C temperature, in typical process conditions. Table II presents the area information of the components in the cluster. Although, the total area of the IPA with 16 PEs is almost similar to the area of the 8 cores combined, the area occupied by the GCM is much less than the total cache memory, which in turn provides better area efficiency while running applications in IPA.
LIST OF APIS FOR CONTROLLING IPA  TABLE !. mapped operations, The control registers are composed of a command register and a status register, We designed a simple Application Programming Interface (APIs) to perform the offload and synchronize tasks with the IPA, The main functions are described in Table I . Before execution starts in the IPA accelerator, the cores load the corresponding context and data from the L2 memory to the GCM and L I memory, programming the system DMA and the IPA DMA, respectively, The context for the IPA consists of instructions and constants for the PEs, generated by the compilation flow proposed in [2] , The functions load_data_12totcdm and loadJontext_12togcm contain a set of routines to write data and context into TCDM and GCM respectively, The ipa_start_execution writes execute command into the command register of the IPA, The completion of the execution is notified by updating the status register, The core is synchronized with the IPA execution by calling the ipaJheck_status function, which checks for the updates in the status register. of contribution by the several components in the cluster. Figure 4 (a), (b), (c) represents the power breakdown while executing matrix multiplication in multi-core, single-core and IPA respectively, representative for other compute intensive benchmarks. Similarly, in this figure, (d) and (e) present the profiles for executing GCD, a control intensive benchmark, in single-core and IPA respectively. In Figure 4 (a), (b), (c), the TCDM contributes to 14.7%, 15% and 7.2% in the multi-core, single-core and IPA configurations, respectively. The reduced memory access in IPA execution helps to achieve better energy efficiency. While executing GCD in single-core and the IPA, the TCDM consumed around 15.9% and 2.5% of the total power in the two analysed configurations, respectively. The simpler nature of the compute units, low burden on the TCDM and data exchange through PEs explains the energy gain of 7 x in the IPA execution. VI. CONCLUSION In this paper, we present a novel approach towards heterogeneous computing, augmenting ultra-low power reconfigurable accelerator in the PULP multi-core cluster. The experiments integrating IPA in the PULP platform suggests that architectural heterogeneity is a powerful approach to improve energy profile of the computing systems. We have presented three possible executions of the benchmarks in the IPA integrated PULP platform. The heterogeneous cluster achieves achieving up to 4.8 x speed-up and up to 4.4 x better energy efficiency with respect to an 8-core homogeneous cluster.
