Abstract-IoT Edge intelligence requires Convolutional Neural Network (CNN) inference to take place in the edge device itself. ARM big.LITTLE architecture is at the heart of common commercial edge devices. It comprises of single-ISA heterogeneous multi-cores grouped in homogeneous clusters that enables performance and power trade-offs. All the cores are expected to be simultaneously employed in inference to attain maximal throughput. However, high communication overhead involved in parallelization of computation from a convolution kernel across clusters is detrimental to throughput. We present an alternative framework called Pipe-it that employs a pipelined design to split the convolutional layers across clusters while limiting the parallelization of their respective kernels to the assigned clusters. We develop a performance prediction model that, from convolutional layer descriptors, predicts the execution time of each layer individually on all different core types and number of cores. Pipe-it then exploits the predictions to create a balanced pipeline using an efficient design space exploration algorithm. Pipe-it on average results in 39% higher throughput than the highest antecedent throughput.
Abstract-IoT Edge intelligence requires Convolutional Neural Network (CNN) inference to take place in the edge device itself. ARM big.LITTLE architecture is at the heart of common commercial edge devices. It comprises of single-ISA heterogeneous multi-cores grouped in homogeneous clusters that enables performance and power trade-offs. All the cores are expected to be simultaneously employed in inference to attain maximal throughput. However, high communication overhead involved in parallelization of computation from a convolution kernel across clusters is detrimental to throughput. We present an alternative framework called Pipe-it that employs a pipelined design to split the convolutional layers across clusters while limiting the parallelization of their respective kernels to the assigned clusters. We develop a performance prediction model that, from convolutional layer descriptors, predicts the execution time of each layer individually on all different core types and number of cores. Pipe-it then exploits the predictions to create a balanced pipeline using an efficient design space exploration algorithm. Pipe-it on average results in 39% higher throughput than the highest antecedent throughput.
Index Terms-Heterogeneous Multi-Core, Edge Inference, CNN Performance-Prediction
I. INTRODUCTION
C ONVOLUTIONAL Neural Network (CNN) inference on an edge device has become quintessential for enriched user experience. For example, a continuous vision task that uses inference to extract high-level semantic information from a real-time video stream is paramount in numerous edge application domains such as Advanced Driver-Assistance Systems (ADAS), Virtual Reality (VR) and Augmented Reality (AR) [16] . Inference-driven continuous vision applications project unprecedented computational requirement onto the underlying edge devices [31] . Fortunately, there has been tremendous progress to port the CNNs onto edge devices. Many network models, such as MobileNet [10] , have been invented specifically for the edge to perform high-accuracy classification with considerably smaller network size. Several efficient libraries such as ARM Compute Library (ARM-CL) [1] and Tencent NCNN [5] have been constructed precisely to facilitate efficient CNN implementation for the edge. ARM-CL is highly optimized for the edge-specific ARM core architec- tures with inbuilt support for multi-threading and acceleration through ARM NEON vectorization technology. Single-ISA heterogeneous multi-cores comprising of processing cores that have different power-performance-area characteristics but share the same Instruction Set Architecture (ISA) [17] are now commonplace in edge devices. Heterogeneous multi-cores provide higher parallel processing potential than homogeneous multi-cores within a given power and area budget provided all the cores can be simultaneously employed productively [23] . Figure 1 shows an abstract block diagram for an eight-core state-of-the-art ARM big.LITTLE heterogeneous multi-core present in Hi3670 System on a Chip (SoC) used in mobile and edge devices. Hi3670 groups together four high-performance Big Cortex A73 cores and four low-performance Small Cortex A53 cores into two clusters alongside an L2 cache of size 2 MB and 1 MB, respectively. The two clusters are kept fully cache-coherent via a busbased Cache Coherent Interconnect (CCI) using a snooping broadcast protocol. Cores within a cluster are kept coherent using a bus-based Snoop Control Unit (SCU). The raw computational power provided by a heterogeneous multi-core makes inferencing on edge devices feasible.
Although inference is more efficient on edge devices with dedicated accelerators such as GPUs and dedicated IP cores, in practise, inference with CPUs are common due to high availability [21] [28] [33] . On one hand, low-cost edge devices may not contain any other inference-capable components beside the CPU. The low-end GPU might not support general purpose computing or provide suboptimal performance. On the other hand, CNN are more commonly used as a building block to construct more complex systems. For applications ranging from smart classroom [24] with person and text recognition, to autonomous drones [25] with path planning, object classifi-cation and obstacle avoidance, multiple independent inference sub-tasks are performed concurrently, which require all the available resources (CPU, GPU) to run these inference engines in parallel. Therefore, improving inference throughput on ARM big.LITTLE like architectures by itself is an important problem.
Motivational Example: The inherent design of CNN mandates that for a given image, the CNN layers are applied to it in a pre-ordained order, which implies that their associated convolutional kernels are also required to be processed sequentially. Nevertheless, multiple images from an image stream can indeed be processed using the CNN in parallel. Unfortunately, the default parallelization strategy we christen Kernel-level employed in existing state-of-the-art deep learning libraries such as ARM-CL are designed to process an image stream sequentially one image at a time. It is therefore forced to process only one kernel at a time, whose computation it distributes simultaneously across all the cores. The top of Figure 2 visualizes Kernel-level strategy for a representative four-layer CNN on an eight-core heterogeneous multi-core. Further details on Kernel-level strategy are provided in Section II. Kernel-level strategy which is well-suited for intracluster processing within one cluster fails to scale to intercluster processing with multiple clusters. Figure 3 shows the inference throughput (measured in images per second) of several CNNs with the increase in the number of heterogeneous cores used with Kernel-level strategy. We observe that the throughput increases as we add more Big cores but drops sharply as we exhaust Big cores and add an additional Small core from the other cluster to execute a kernel using Heterogeneous Multi-Processing (HMP). The drop can be explained by inter-cluster communication overhead involved in the use of HMP for a given kernel. The throughput increases as we continue to add more Small cores but does not show any improvement over the execution where only all Big cores are being used without HMP. Figure 3 thereby empirically shows that we cannot improve the throughput of CNN inference on heterogeneous multi-cores with the default Kernel-Level strategy alone. It is important to note that this limitation originates from the design of Kernel-level strategy and not from quality of its software implementation.
We observe there are multiple convolutional layers within a CNN of different dimensions that project different resource requirements. We can therefore create a processing pipeline with stages composed of only homogeneous cores that still splits the convolutional layers processing over different heterogeneous clusters. Here we use notation {core type}{core count} to denote the core configuration of a pipeline stage. As shown in Figure 2 wherein denoted as Layer-level, a three-stage pipeline is created to process the incoming images in a stream. Three Big cores (B3) construct the first pipeline stage processing Layer 1 and Layer 2, the remaining one Big core (B1) constructs the second stage processing Layer 3, and the four Small cores (s4) construct the third pipeline stage processing Layer 4. By processing multiple images simultaneously in a pipeline, all the eight heterogeneous cores are constructively engaged in the execution. Generally, initial layers operating on bigger inputs requires more computational power and memory compared to the later layers. Therefore, it is intuitive to map the initial convolutional layers to the more powerful Big cluster and the later layers to the less powerful Small cluster. However, the design space of mapping the layers to the core clusters increases exponentially with the increase in number of layers.
Our Novel Contributions: We propose Pipe-it, a framework that partitions the CNN layers across heterogeneous cores within a heterogeneous multi-core to improve the onchip inference throughput. We create a processing pipeline by splitting CNN layers among the heterogeneous cores wherein a given core (or set of homogeneous cores) always process kernels from a fixed set of layers. Still, the different pipeline stages (and cores within) are responsible for concurrently processing different layers corresponding to consecutive images in a video stream. The pipeline improves CNN inference throughput by employing all the on-chip memory and resources of a heterogeneous multi-core more effectively than the default approach of splitting individual kernels across the heterogeneous cores.
Pipe-it in addition includes an analytical performance model that predicts the performance of a convolutional layer on different core configurations (type, count) from its network structure description. The predicted performance is then used as an input into a design space exploration algorithm that navigates the design space and locates the best fitting pipeline configuration and respective layer allocation. On an average, we get 39% improvement in throughput with the entire heterogeneous multi-core compared to using only the highperformance homogeneous Big cluster in that multi-core.
II. BACKGROUND: CNN AND ARM COMPUTE LIBRARY ARM Compute Library (ARM-CL) [1] is a state-of-the-art framework for implementing CNNs on ARM architectures. Figure 4 shows the throughput of CNN inference implemented with ARM-CL (version 18.05), Tencent NCNN [5] and TVM [7] frameworks running on the Big cluster using multi-threading. Both ARM-CL and Tencent NCNN support acceleration through ARM NEON vectorization and provides NEON assembly implementation for the most computationally intensive convolution kernels of the CNN. The two frameworks present similar performance and outperform TVM implementation without NEON acceleration. However, Tencent NCNN is not as well maintained or supported as ARM-CL; so we use ARM-CL as the foundational framework in this work.
ARM-CL is a collection of functions commonly used in machine learning. The functions are infused with hardware specific optimizations for high performance on ARM architectures. The Graph API accompanying ARM-CL facilitates the creation of complex networks. The network is created as a graph at the frontend and then passed for execution over to the backend. Each layer in the network is a node that is connected to other nodes in the CNN sequence. Table I summarizes the architecture of several popular CNNs and their respective implementations in ARM-CL. We count the weighted layers (convolutional or fully-connected) as major layers because they are in general most computationally expensive part of the CNNs.
Inside each node, the workload is represented as a series of compute kernels. During execution, the runtime scheduler *TVM results are generated with NNVM-TVM framework with pre-trained model from mxnet.gluon.mode zoo.vision model set [2] . GoogleNet is not included thus omitted. sequentially dispatches the kernels in a node and engages the respective processing unit for execution. For example, ARM-CL implements a convolution node with NEON acceleration using im2col (Image to Column) and GEMM (GEneral Matrix Multiplication) kernels. In addition, the parallel nature of the kernels allows their computations to be distributed across multiple cores. This node-level parallelization is implemented in the form of a thread pool that spawns several new threads and distributes the computation of a kernel among them before the scheduler dispatches them for execution. We extended the default ARM-CL CNN implementations to execute multiple graphs in parallel; this allows the same network to be applied to multiple images concurrently. All the graphs share the same copy of read-only weights and biases to improve cache utilization efficiency and reduce the main memory footprint. Each graph contains its own unique copy of the image being classified as we assume the images in a video stream to be independent. We modify the scheduler to run under a one-thread-per-core model with minimal migrations using thread pinning for faster and predictable execution.
III. CO-EXECUTION AT DIFFERENT LEVELS

A. Kernel-Level Splitting
We first exploit the ARM-CL thread pool implementation to engage all the cores in order to explore the parallelism inherent in the kernels. While the parallelization of a kernel across homogeneous cores within a cluster gives performance benefits, further parallelization across heterogeneous clusters does not improve throughput as shown in Figure 3 . Authors in [8] make similar observation for kernel-level splitting in the context of CPU-GPU co-execution. Using multiple cores within the same cluster for processing increases the parallel L2 accesses per unit time which are successfully handled by the cluster's Snooping Control Unit (SCU) without being overwhelmed, thus results in better performance. However, when engaging additional cores from a different core cluster in processing, the L2 access latency increases due to the working set being split between the two L2 caches of the two core clusters. Many of the conflict misses that occur on one cluster now get served by L2 cache of the other cluster using the Cache Coherent Interconnect (CCI) increasing their latency. The additional L2 cache decreases the number of capacity misses going to the main memory but it cannot compensate the longer latency of the conflict misses.
The throughput in Figure 3 is obtained by splitting computational workload from a kernel equally among all threads. We want to emphasize that this problem cannot be alleviated by splitting the workload disproportionately. Figure 5 shows through exhaustive search that no ratio of workload split between the Big and Small clusters result in a statistically significant higher throughput for most of the CNNs than when kernels run exclusively only on the Big cluster. The exhaustive search indicates that the optimal execution is when the Small cluster is given little or no share of the computational work.
B. Layer-Level Splitting
Image classification CNNs are made up of multiple layers through which an image is processed sequentially. shows the share of processing time spent on convolutional layers in different CNNs normalized to the total forward pass processing time. The processing of convolutional layers dominate the overall time spent in processing the layers for all networks except in relatively older AlexNet wherein fullyconnected layers dominate the total processing. Figure 7 shows that time taken to process convolutional layers decrease in general as we move deeper into the network. This observation can be explained by the fact that convolutional layers at start of the network operate upon data of bigger size (and dimensionality) and produce output data which is of smaller size due to the application of filters. This output gets passed on to the subsequent convolutional layer as result of which its convolution processing time is relatively lower. This observation can be exploited in a heterogeneous multi-core wherein more powerful cores can process the more processing intensive initial layers while less powerful cores can process the less processing intensive deeper layers to create a loadbalanced processing split. Kernels from a layer can still get split among all the homogeneous cores within a cluster using kernel-level splitting without straddling across different core clusters. Note that all kernels from the non-convolutional layers are considered part of the previous convolutional layers, and get processed at the same cluster. We do not explore layer-level splitting of CNN at non-convolutional layers unless specified otherwise.
Layer-level splitting between the clusters produces lower number of L2 conflict misses than kernel-level splitting between the clusters as most layers that feed data into each other are processed on the same cluster. Therefore, the amount of data required to be moved between the L2 cache of the clusters using CCI is significantly reduced. Furthermore, layer-level splitting allows multiple images from a stream to be processed in parallel. The Big cluster can start processing layers from image Z + 1 while the Small cluster is still processing layers from image Z which have been previously processed by the Big cluster. As weights and biases are shared among all images of a video stream and kernels from a given layer always get processed at the same cluster irrespective of image, the weights and biases are not moved between clusters, unlike kernel-level split across heterogeneous clusters. This optimization further reduces the amount of conflict misses between the clusters and thereby leads to improvement in L2 cache usage efficiency. Execution time of multi-threaded execution. P ={P1, P2, ..., Pp} Representation of a pipeline configuration with p stages Pi = (type, count)
Representation of the configuration of the i-th stage in a pipeline P . E.g. (B, 3), also written as B3 for convenience. L ={L1, L2, ..., Lp} Corresponding layer allocation for pipeline P with p stages. Li ={lj, ..., l k } A set of layers in original order allocated to stage Pi, also written as l j−k for convenience.
T, T P i
Time matrix for execution times of a single layer on different core configurations; Time array of execution times of a set of layers with core configuration Pi. T
Execution time of layer lj with pipeline configuration Pi; execution time of a pipeline stage Pi with its corresponding layer allocation Li. L wl A set of layers as defined in the context (workload).
IV. DESIGN SPACE
A. Split Points at Convolutional Layers
The structure of different convolutional layers can differ significantly from each other within a network. Their performance on Big and Small clusters with different number of allocated cores can also be quite different. These differences mandate non-trivial decisions on splitting the convolutional layers across the pipelines stages of Pipe-it.
To illustrate the design space, we first consider a network containing W major layers being processed with a basic twostage layer-level split pipeline (B4-s4). The first X layers are processed on the Big cluster with kernel-level split among all the four Big cores, and the rest (W − X) layers are processed on the Small cluster similarly. Thus, the problem here is to find the split point X to reach the optimal performance. There are
possible split points in this pipeline. Figure 8 shows the throughput for different CNNs with the split ratio (X/W) ranging from 0 to 1. We also include fullyconnected layers for AlexNet alongside its convolutional layers as valid points of split. The optimal split ranges from 0.60 for GoogLeNet to 0.90 for AlexNet.
Similarly, for a three-stage pipeline, the design space is much bigger as two split points X 1 , X 2 need to be located. We consider a given pipeline configuration of (B4-s2-s2), which means the four Big cores, two Small cores and remaining two Small cores are used to construct the pipeline Stage 1, Stage 2 and Stage 3, respectively. Figure 9 shows the execution of ResNet50. The y axis shows the split point X 1 , which is the split between Stage 1 (B4) and [Stage 2 + Stage 3] (s2-s2), which is also the split between Big and Small clusters for this pipeline configuration. The x axis shows the split point X 2 , which is the split between Stage 2 (s2) and Stage 3 (s2). The In addition, comparing with the throughput obtained by the two-stage pipeline as shown in Figure 8 , ResNet50 with threestage pipeline gives additional 7% throughput gain, which shows the possibility of further performance gains with more pipeline stages.
B. Stages of Pipelines
We can create pipelines with many more stages (up to H on a heterogeneous multi-core with H cores) in pursuit of attaining higher throughput for CNN inference. As kernellevel split between the clusters is not helpful as shown in Figure 3 , we eliminate the pipeline designs with heterogeneous core types within a pipeline stage. Furthermore, as CNNs usually have more compute intensive convolutional kernels at the beginning as shown in Figure 7 , we consider only pipeline configurations with the Big cores for initial convolutional layers and Small cores for subsequent convolutional layers. The number of different pipelines possible C p with p pipeline stages on a heterogeneous multi-core with H B Big cores and H s Small cores can be calculated by Equation (1). As we only consider pipeline stages made up with homogeneous cores, we use p B and p s to denote the number of stages constructed with the big and small clusters respectively. The total number of different pipeline that can be constructed is therefore
Ps−1 . However, the value of p B and p s must satisfy the following requirements in order to construct a meaningful p-stage pipeline.
Thus the range of p B is derived as, with minimum value of max(1, p − H s ), and maximum value of min(H B , p − 1). We then go through p B and calculate the total number of different pipelines possible with p stages as in Equation (1).
The total number of design points for a CNN with W convolutional layers (D W ) in layer-level splitting on a H-core heterogeneous multi-core is given by Equation (2).
For the prototype board with eight-core heterogeneous multi-core architecture, there are in total 64 possible pipelines (with p = 2 to 8) as calculated with Equation (1). Furthermore, there are in total 5,379,616 distinct possible design points for MobileNet with its 28 convolutional layers and respective pipeline configurations as calculated using Equation (2). The 
C. The Pipe-it Framework
In order to quickly go through the huge design space and locate the best configuration to execute a given CNN workload, we present the Pipe-it framework in two parts. Pipe-it firstly predicts the layer-level execution time on all possible core configurations of the multi-core platform from the static available network layer configuration descriptors (section V). With the timing information, Pipe-it goes through the design space heuristically and predicts the optimal pipeline configuration and the corresponding workload allocation (section VI).
V. LAYER-LEVEL PERFORMANCE ESTIMATION
The most time consuming part of a CNN is the execution of convolutional layers, where the input image tensors are convolved with filters to generate respective output image tensors, feeding into the following layers as inputs. With the extensive calculation required, hardware-dependent implementation/optimization techniques are applied to accelerate the execution of the convolutions.
General Matrix Multiplication (GEMM) is commonly used to implement convolution executions. In ARM-CL, The input image tensor and filter are first converted into matrix (Im2col kernel), followed by GEMM execution and finally transferred back from matrix to output image tensor format (Col2Im kernel). Authors in [22] show that the execution time of convolution, effectively matrix multiplication, is linearly correlated to the dimension of matrices. We adopt a similar approach that correlates the statically available descriptors of each convolutional layers with the layer execution times. Unlike [22] that considers the overall execution time of the network, we evaluate and model the individual convolutional layers. In addition, we consider the effect of multi-threading and predict the execution time for different core configurations.
A. Convolution as GEMM
As shown in Figure 10 , consider a convolutional layer with input image tensor of size (height, width, depth) {I w , I h , I d } and filter of size (height, width, depth, number of output feature maps) {F w , F h , F d , Of m}, with padding P ad and stride S, the output tensor {O w , O h , O d } generated from this convolution operation is of size as shown in Equation (3). In practice, the input tensor and filter are required to have matching depth (I d = F d ), and are usually square in shape
The convolution is implemented as a GEMM of image and filter matrices. As shown in Figure 10 in shaded red color, the input tensor is divided into small patches of the size of one filter ({F w , F h , F d }). The patches are re-arranged as rows in the image matrix. Similarly, the filters are re-arranged into columns in the filter matrix. Thus the convolution is transformed into a GEMM of an image matrix ([N × K]) and a filter matrix ([K × M ]), which generates a result matrix of size [N × M ] and later resize it into an output tensor. The size of the dimensions of the matrices are calculated as show in Equation (4) . The total number of arithmetic operations is
The compute time of GEMM is a complex function of the memory accesses, arithmetic computations, as well as the inherent parallelism in the given convolutional kernel that can be exploited by multi-threading.
B. Single Core Estimation
To capture the execution behaviour, we create a set of microbenchmarks with ARM-CL. The micro-benchmarks contain representative layers and a convolutional layer with desired configurations of various input sizes and filter sizes commonly used in the networks. The input image and filter parameters are randomly generated for measurement purpose only. The GEMM execution time is measured for different configuration points using the following values of the parameters: We observe linear correlation between the dimensions of the matrices (N, K, M ) and execution time of GEMM similar to authors in [22] . Considering a single core configuration, we can thereby model the execution time of a convolutional layer T using linear regression of (N, K, M ) with interaction terms as shown in Equation (5). The interaction terms can be physically interpreted as the size of matrices involved in GEMM (N K, KM, N M ) and total arithmetic operations (N M K).
where β 1 , β 2 , ..., β 8 are constants determined with the help of linear regression.
C. Multi-core Estimation
Several optimization techniques have been proposed to accelerate GEMM, including memory management by tiling and exploitation of parallelism on multi-core architectures. The benefit derived is largely dependent on the quality of implementation. In ARM-CL, GEMM is implemented with tiling size (ts) determined according to the cache sizes to achieve optimal memory behaviour. For execution on a H core architecture, H threads are created. The total workload is divided along the rows of the image matrix into chunks of "iterations" (with count n iter = N/ts). These iterations are then dispatched either statically or dynamically to the available threads. As a result, a thread t is assigned with iter t iterations to execute sequentially, and the workload assigned to all H threads add up to the total number of iterations ( H t=1 iter t = n iter ). For execution of GEMM with single thread, all iterations (n iter ) are assigned and processed sequentially on one thread, with execution time T obtained from Equation (5) . Therefore, the time of each iteration can be modelled as shown in Equation (6) . In execution with workload distributed among H threads, the total time taken will be the execution time of the slowest thread, as shown in Equation (7), where α 1 , α 2 and α 3 are constant coefficients obtained using linear regression.
Now consider a workload distribution among homogeneous cores with equal processing capability, an equal split can be expected (iter t = n iter /H = N/(ts * H)). The multi-threaded execution time T multi can therefore be modelled with the matrix size N , tile size ts and number of cores H as shown in Equation (8) .
Table III shows the prediction error for all the possible homogeneous core allocations. The proposed model is able to predict the execution time for individual convolutional layers with on average an acceptable 13.2% and 11.4% prediction error for Big and Small cores respectively, across all core configurations with five benchmark CNNs. As a comparison, 13.4% prediction error is reported in [22] for overall CNN inference time with only two CNNs. 
VI. DESIGN SPACE EXPLORATION
We can design many different pipelines with different number of stages and each stage with different processing core combinations for a heterogeneous multi-core. In addition, for a pipeline, the number of design points in allocating the workload to different pipeline stages grows exponentially with the total number of convolutional layers. Therefore, we propose a robust heuristic approach that quickly navigates through the design space to obtain a high-performing layer-level split design point for any CNN. A two step approach is presented, where the splitting of the workload (layers) is predicted for a given pipeline configuration first (section VI-B), and then the configuration is adjusted through merging of adjacent stages to search for a better pipeline configuration (section VI-C). The two steps are iteratively engaged to approach a high throughout pipeline configuration and the corresponding workload distribution.
A. Definitions
Consider a CNN with W convolutional layers to be deployed on a (H B + H s ) heterogeneous multi-core system with H B Big cores and H s Small cores, we aim to find a pipeline configuration P and the corresponding layer distribution L, which gives the optimal execution performance.
For a pipeline P with p stages, we use P = {P 1 , ...P p } to define the core configuration of each pipeline stage. A pipeline stage (P i ) is defined as a tuple of type of core and count of cores that construct the pipeline stage, as show in (9) . Since only homogeneous cores are used to construct a pipeline stage, core type can only be either B or s (no mixing of B and s). In total, there can be H B core combinations for big cores ((B, 1), (B, 2) , ... (B, H B ) ) and similarly H s core combinations for small cores, thus (H B + H s ) possible different pipeline stage configurations.
The corresponding layer allocation associated with the pipeline is defined as L = {L 1 , ..., L p }, where L i is a set of layers allocated to the pipeline stage P i . For example, if all the W layers are allocated to P i , we denote it as L i = {l 1 , ..., l W }. If none of the layers are allocated to P i , we have L i = ∅.
The execution time of a layer on a pipeline configuration is predicted with performance prediction models as described in Section V. Here we use a time matrix T to represent the predicted execution times. For a layer l j executing on a core configuration P i , the execution time is represented as T Pi lj .
Similarly, the execution time of a pipeline stage P i with layer allocation L i can be represented as
B. Work-Flow Split Determination
As shown in Figure 7 , we make the assumption that in general for CNN inferencing applications the former layers are more compute intensive than the latter layers and thereby requires more processing power. Thus, we order the pipeline stages to have more compute capable core combinations at the beginning, and with decreasing compute capability for stages deeper into the pipeline. In addition, such arrangement ensures a linear expansion in layer processing time as we move down the pipeline stages. The compute capability is evaluated by the execution time of layers on average. For the heterogeneous 8-core platform, the compute capability in executing a layer l with homogeneous core combinations observed in experiments follows Equation (11) .
For a pipeline P with p stages and layer allocation L, the throughput of the pipeline is determined by the pipeline stage that produce the longest latency, as shown in Equation (12) . The goal is therefore to balance the workload among all the pipeline stages to achieve the minimal latency and thus optimal throughput achievable.
Consider two adjacent pipeline stages P i and P i+1 in a pipeline, with a set of layers L wl = {l a , ..., l b } (in the original order) to be divided and allocated, as shown in Algorithm 1. The ordering of pipeline stages ensures that any layer l j that is executed on P i is faster than on P i+1 ( T Pi lj < T
Pi+1 lj
). Such arrangement results in an expansion in execution time as we move deeper into the pipeline, ensuring one way flow of workload.
To start with, all workload is allocated to the faster stage
Clearly the pipeline is bottlenecked at P i . We try to move the layers to P i+1 to balance the workload of each stage. As the layers are ordered in the original sequence of CNN execution, we start with the last layer allocated to P i , which is layer l b . Moving the layer
), as the pipeline is still bottlenecked at P i , but with a shorter stage time, resulting in improvement in overall throughput. We keep moving the layers until l k when P i+1 becomes the bottleneck instead. Further moving of layers will cause stage P i+1 to be even slower. Thus the best split between the two adjacent pipeline stage will be L i = {l a , ..., l k } and L i+1 = {l k+1 , ..., l b } respectively.
We then go to the next adjacent pipeline stages (P i+1 and P i+2 ) to balance the stage latency. As shown in Algorithm 2, the algorithm goes through all the stages in the pipeline to balance the workload with its immediate next stage. We symbolize the workload as water that flows from the first Algorithm 1: find split: Algorithm for workload split between adjacent pipeline stages.
Initialisation : Li = L wl = {la, ..., l b }; Li+1 = ∅;
1: for lj ∈ L wl do 2:
3:
new ) then
5:
Li = Li \ {lj }; Li+1 = Li+1 ∪ {lj } // move of lj is helpful 6:
break; //further flow of workload will not be helpful 8:
end if 9: end for 10: return Li, Li+1
Algorithm 2: work flow: Algorithm for workload allocation for a multi-stage pipeline.
for Pi, Pi+1 ∈ P do 4:
Li, Li+1 = find split(Ltemp, pipeline stage to the deeper stages. As more workloads flow to the deeper stages, there will be more space in the previous stage, thus the algorithm is engaged iteratively to reach the final splitting configuration.
C. Pipeline Stage Merging
As GEMM exhibits great data parallelism, running it using multi-threading is beneficial. However, we observe concavity in attainable multi-threaded speedup as shown in Figure 11 , due to saturating Thread Level Parallelism (TLP). Furthermore, different types of layers derive different level of benefits from multi-threading. As discussed in the previous sections, the co-execution across heterogeneous cores does not show performance improvement due to coherency issues and thereby we only consider homogeneous core configurations. As shown in Algorithm 3, we consider the Big cluster first and then move on to the Small cluster.
For an (H B + H s )-core heterogeneous multi-core, we start with an (H B +H s )-stage pipeline where each stage is comprising of only one core. The work flow algorithm is engaged to search for the best split of workload (layers) for this pipeline configuration as initialisation. As single core performance is suboptimal, the pipeline is likely to be bottlenecked by layers that require more compute capability. Thus, we merge pipeline stages to create a more compute capable stage to alleviate the bottleneck.
Consider the merger of stages P i and P i+1 to stage P i , with originally allocated set of layers L i and L i+1 , with the core configuration as shown in (13) . Note that P i and P i+1 must be of the same core type in order to merge.
The merging is only helpful when stage (14) holds true). Otherwise, no further merging of the involved stages to create even more capable stage will be helpful either, due to the concavity in speedup (Figure 11 ).
If merging is helpful, we then update the pipeline configuration and engage the work flow algorithm again to find a new higher performing layer split. As different layers respond differently to stage configurations, the merging decision depends largely on the layers allocated to the stage. Therefore, reallocation of workload is necessary in presenting the right layer information to the merging algorithm. The algorithm runs iteratively until no further merging of stages is helpful.
D. An Example
We illustrate with an example how the aforementioned algorithms work to locate the optimal pipeline configuration and workload allocation. We take ResNet50 benchmark, as shown in Table I , with 54 major layers to deploy on the 4 big 4 small multi-core heterogeneous architecture. For this architecture, in total eight different pipeline stages can be created with different core combinations, thus eight different set of layer execution time are predicted, therefore the time matrix T is of size (54, 8) . We plug the following corresponding inputs to merge stage algorithm (Algorithm 3).
The algorithm initialise an 8-stage pipeline where each stage consists of a single core, and engage the work flow (Algorithm 2) to find the split for the initial 8-stage pipeline.
Algorithm 3: merge stage: Algorithm for determination stage configuration and corresponding workload allocation.
if (Equation (14) In the work flow algorithm, all the layers are firstly allocated to the first pipeline stage P 1 . find split (Algorithm 1) is then engaged to balance the workload between the first two stages P 1 and P 2 . Started with the last layer allocated to P 1 , which is l 54 , the layers are moved to be processed with P 2 until the two stage is balanced. The find split algorithm returns
Here we use l 1−25 as a notation for {l 1 , ..., l 25 } for convenience. Thus the workload allocation is updated to L = {l 1−25 , l 26−54 , ∅, ∅, ∅, ∅, ∅, ∅}.
The work flow algorithm then continues to balance the workload between P 2 and P 3 , and then the rest of the pipeline stages. The first iteration returns L = {l 1−25 , l 26−38 , l 39−46 , l 47−50 , l 51 , l 52−54 , ∅, ∅}. Since the workload on P 2 is rebalanced with P 3 and following stages, the algorithm returns back to rebalance the workload of P 1 and P 2 again. The iterative rebalancing in the end returns L = {l 1−18 , l 19−32 , l 33−41 , l 42−48 , l 49−51 , l 52−54 , ∅, ∅}. Note that the last 2 pipeline stages are not allocated any workload, because of the poor compute capabilities. Merging of stage is therefore necessary in order to achieve higher performance.
Continue with Algorithm 3, the merger of the first two stages P 1 and P 2 to create a stage comprising of 2 cores ((B, 2)) is evaluated. If the Equation (14) holds true, the workload allocation is recalculated with work flow (Algorithm 2). Else, merger of the stages is not helpful and therefore the algorithm will not try with further mergers. In this case, the merger is helpful and the pipeline configuration is updated to P = { (B, 2), (B, 1), (B, 1), (s, 1), (s, 1), (s, 1) , (s, 1)}, with L = {l 1−29 , l 30−38 , l 39−48 , l 49−51 , l 52 , l 53−54 , l ∅ }. The algorithm then go on merging P 1 and P 2 to create (B, 3) and so on. The workload allocation is recalculated every time the pipeline stage is updated. The merge goes on for the small cluster afterwards following similar rules. At last, the pipeline configuration is updated to a 3-stage pipeline with configuration P = {(B, 4), (s, 2), (s, 2)} and workload allocation L = {l 1−35 , l 36−44 , l 45−54 }.
VII. EXPERIMENTAL EVALUATION
We conduct experimental evaluations on Hikey 970 mobile development platform [3] for five CNN models as specified in Table I . Figure 12 shows a photo of the board in use. The board features ARM cortex 2.4GHz A73 and 1.8GHz A53 big.LITTLE Octa-core CPU. The board is connected to a normal desktop monitor through HDMI cable for display. It is equipped with inbuilt WiFi module and is connected to the host machine through SSH access. A standard DC 5V USB fan is used in the experiment to eliminate unstable thermal effects.
For each model, we classify a continuous stream of 50 images and report the average throughput (images processed per second). Recall that the kernel-level split on all eight heterogeneous cores perform worse than four homogeneous Big cores. Therefore, our baseline configuration is a kernellevel split on the four homogeneous Big cores, which provides the best possible throughput with the default ARM-CL implementation, as shown in Table IV .
A. Resultant Configurations
The outcome of our design space exploration in the form of pipeline stages P and layer allocation L are shown in Table V . Here we simplify the notation for easier representation. For example, the pipeline configuration B4-s2-s2 for ResNet50 implies three pipeline stages consisting of four Big cores, two Small cores, and two Small cores, respectively. Convolutional layers 1-35 are allocated to the first pipeline stage, 36-44 to the second stage, and 45-54 to the third stage.
The respective pipelines are implemented and executed on the development platform to measure the throughputs as shown in Table IV . On an average, the proposed algorithm selects a layer-level split that improves throughput by 39% over the baseline.
In all cases, the throughput obtained through pipelined configuration approaches the combined throughput of the individual clusters. Even higher throughput is obtained for GoogLeNet and ResNet50. Such improvements are due to the formation of pipeline stage with subset of cores in a cluster. The layers that benefit less from kernel-level splits can now execute with fewer cores, while the remaining cores concurrently execute other layers, resulting in higher resource utilization and throughput.
On the other hand, we are not able to obtain the optimal throughput on this platform because exhaustive search is not possible. In the experiment, to ensure fairness in comparison, [1, 19] - [20, 26] each benchmark is executed for 50 times and the board is left idle for cooling down, resulting in approximately 10 seconds testing time for one design point. As discussed in Section IV, the design space explodes with the number of layers in a CNN and the number of core combinations of a multi-core platform.
For an average size CNN with 5 million design points, the exhaustive search will take hundreds of days to run.
B. Layer Performance Model
The layer performance model is constructed from the microbenchmarks we created. As the layer execution time is used for predicting the optimal pipeline configurations, the relations between different clusters as well as multi-core behaviours are more important than the absolute error compared to the actual measurements. Nevertheless, as shown in Table III , the analytical model created presents good accuracy with on average 13.2% for Big cluster and 11.4% for Small cluster with all core combinations across all the convolutional layers for the CNN applications.
We use the same algorithm with the actual measured layer timings to predict the pipeline configuration as a comparison as shown in Table VI . Comparing with the configurations in Table V and performance in Table IV , the layer performance prediction model is effective in supporting the prediction for optimal pipeline configurations. In most of the cases, the predicted layer timings is able to give the same pipeline configuration as with the actual layer timings. The layer allocations are predicted differently which gives a mere 4% worse performance. For ResNet50, with actual layer timing, the pipeline configuration predicted gives 5% benefits.
C. Power Efficiency
As no power sensors or power sense resistors are made known for the Hikey 970 development board, we are not able to obtain the individual power values of each components. Thus we utilize a power measurement module [4] that supplies and measures the whole board power consumption. For homogeneous runs, the cluster that is not engaged in the execution is turned off to eliminate their effect on total power consumption. As the socket power is measured for the whole board, the power reading (P ) includes everything on board including other peripherals such as display and LED lights. We eliminate the effect of other components by subtracting off an "idle" power (P I ). As the idle power varies with many factors, we approximate it with the power readings shortly before the execution of a CNN benchmark. The active power readings reported are therefore (P A = P − P I ). The power measurements and corresponding power efficiency are shown in Table VII. Note that since the whole board power is measured, memory power is included in the measurement. The much more powerefficient Small cluster which has very low power consumption therefore shows lower power-efficiency for memory intensive CNNs, like AlexNet. The pipelined configuration is expected to present its power efficiency in between the Big and Small cluster as an weighted average. The drop in power-efficiency can be explained by the extra memory power induced from coherency between different core clusters.
D. Quantization Considerations
Pipe-it aims to improve the throughput performance through engaging all the on-chip resources for the execution. The framework is therefore orthogonal to optimization techniques such as quantization [29] . ARM-CL provides support for execution with quantized 8-bit using asymmetric integers (QASYMM8). However, the benefit of quantization is largely dependent on the implementation. As observed by authors in [26] , the benefit of quantization on the convolutional layers is compromised by the induced overhead of de-quantization and re-quantization operations. We observe similar effect by comparing the execution of F32 and QASYMM8 for MobileNet with ARM-CL as shown in Figure 13 . The execution of convolutional layers are improved by 14% but the overall execution time remains unchanged for ARM-CL v18.05. We then compare across the different implementation with the upto-date ARM-CL version 18.11. With improved hardware level optimization, ARM-CL v18.11 is able to execute 20% faster for F32. With quantization, the convolutional layers are 24% faster with overall 19% execution improvement.
Nevertheless, as the above mentioned execution performance are with the homogeneous cores only, Pipe-it technique is orthogonal and can introduce further performance improvement. We implemented Pipe-it on top of both framework versions for original and quantized MobileNet. Figure 13 shows the effective latency per frame of Pipe-it (1 over throughput achieved). For the most effective implementationquantized MobileNet with ARM-CL v18.11, we observe a 18% performance improvement over the default implementation, reaching a throughput of 31 Imgs/s.
E. Comparison with Other Frameworks
To evaluate the performance of Pipe-it against other CNN frameworks, we compare the performance in effective throughput with several other frameworks. We take MobileNet which is commonly benchmarked. The experimental results are the execution performance on the CPUs of the SoC, taken from benchmark comparisons done online [5] [6] . As the benchmarking is performed on different technology nodes, we scale the performance to match the platform Pipe-it is tested on according to AI-benchmark [14] which provides NN performance comparison across most commonly seen mobile platform and SoCs. On the other hand, the performance of TVM, NCNN, Pipe-it and Pipe-it** are measured with actual experiments on the Hikey 970 platform. As shown in Figure 14 , all the frameworks do not effectively exploit the multi-core SoC as effectively as Pipe-it, which gives the best effective throughput.
In addition, we compare the energy efficiency of Pipe-it against DeepX ( [18] , see Section VIII) that aims to consume the least power within a latency requirement. The experiments are performed on Qualcomm Snapdragon 800 SoC with Krait 4-core 2.3 GHz CPU. With the latency requirement of 500ms (2 Img/s), DeepX provides a configuration which consumes 444 mJ of energy for AlexNet, gives an equivalent energy efficiency of 2.2 Img/J. While Pipe-it is able to achieve comparable energy efficiency of 1.8 Img/J with much higher throughput of 8.9 Img/s. The development of CNNs is moving towards more complex network structures with moderate resource requirements. Starting with 250 MB for AlexNet [16] in 2012, size of models has reduced to less than 0.5 MB for SqueezeNet [13] in 2016 without losing accuracy. Such advancements allow for CNN deployment on mobile platforms even with their limited computational and memory resources. To effectively deploy CNN on embedded platforms, researchers are approaching from different angles. The network structure is modified to fit on the resource constrained mobile platform, such as quantization [29] that accelerates the computation and reduces the memory usage, and network pruning [30] that compromise the accuracy with less resources requirement.
Accelerators that are highly energy-efficient are used to enable CNN on mobile platforms. As GPU presents its superiority in inferencing, lots of the works rely on the computational capability of the embedded GPU to enable CNN with collaborative execution of CPU, GPU and other processors. DeepX [18] framework enables NN application on embedded systems through co-execution of multiple processors including GPU and low power processors (LPU). DeepX firstly engages runtime layer compression to control the resource requirement for the NN application. It then decomposite the model into unit-blocks to be assigned to multiple processors. While gaining substantial benefit in performance and energy, the benefit for CNN mainly comes from the fully-connected layers in AlexNet which are not frequently used for newer CNNs. DeepSense [12] and DeepMon [11] present an OpenCL based framework for mobile GPUs. DeepSense adopts GPU memory management techniques which accelerates the compute heavy executions including convolutional and fully-connected layers execution on GPU with all the other operations are performed on CPU. DeepMon extends DeepSense to include further caching optimization and optimized convolutional layer implementation. In addition, new ASICs are designed specifically for neural network processing, such as Google's Tensor Processing Unit (TPU), Huawei's NPU. Researchers in addition co-design the algorithm and architecture with application specific characteristics [31] [32] .
To facilitates the implementation of CNN on mobile plat- [15] , are created to facilitate the deployment of CNN on embedded systems by automatically generating C code and GPU code (CUDA or OpenCL) that runs on respective mobile platforms with hardware specifications and optimization requirements. On the other hand, while GPU and dedicated accelerators show exceptional performance for CNN applications, the techniques are not applicable for older technology node or cost-sensitive platforms which may not have accelerators. Enabling CNN with existing on-chip resources is essential. Graphi [28] presents a framework that accelerate deep learning models through layer-level parallelism within a NN application on manycore CPU systems. The framework leverage on the inherent layer-level parallelism in a NN structure and schedule the independent layers to be concurrently executed. Graphi is beneficial with networks that present higher layer-level parallelism, such as LSTM and GoogleNet. In comparison, Pipe-it looks at computational-kernel-level parallelism, which is applicable to general network structures, and targets CNN acceleration on heterogeneous big.LITTLE architecture.
IX. CONCLUSION
On-chip inference using CNNs is now becoming commonplace on embedded platforms. In this work, we show that kernel-level splitting across heterogeneous core types does not help to improve inferencing latency, while layer-level splitting that minimize the cross cluster coherency can be adopted to improve inferencing throughout. Thus, we introduce Pipeit, a layer-level splitting technique that efficiently uses the entire heterogeneous multi-cores to improve CNN inference throughput. We study the design-space involved and introduce a search algorithm to efficiently choose a high-performing design point within the search space. With the execution time of each individual layer predicted from an analytical execution model, Pipe-it is able to improve the throughput on average by 39% using all the heterogeneous cores in comparison to the use of only homogeneous cores. This approach, on the other hand, is not limited to CNN applications and can be possibly applied to other streaming applications that show similar behaviors. In the future work, we will include more processors into the design space, including GPUs and NPUs, to further exploit the potential of the embedded SoCs in enabling deep learning.
