High-Throughput CNN Inference on Embedded ARM big.LITTLE Multi-Core
  Processors by Wang, Siqi et al.
1High-Throughput CNN Inference on Embedded
ARM big.LITTLE Multi-Core Processors
Siqi Wang, Gayathri Ananthanarayanan, Yifan Zeng, Neeraj Goel, Anuj Pathania, Tulika Mitra
Abstract—IoT Edge intelligence requires Convolutional Neural
Network (CNN) inference to take place in the edge devices
itself. ARM big.LITTLE architecture is at the heart of prevalent
commercial edge devices. It comprises of single-ISA heteroge-
neous cores grouped into multiple homogeneous clusters that
enable power and performance trade-offs. All cores are expected
to be simultaneously employed in inference to attain maximal
throughput. However, high communication overhead involved in
parallelization of computations from convolution kernels across
clusters is detrimental to throughput. We present an alternative
framework called Pipe-it that employs pipelined design to split
convolutional layers across clusters while limiting parallelization
of their respective kernels to the assigned cluster. We develop a
performance-prediction model that utilizes only the convolutional
layer descriptors to predict the execution time of each layer
individually on all permitted core configurations (type and count).
Pipe-it then exploits the predictions to create a balanced pipeline
using an efficient design space exploration algorithm. Pipe-it on
average results in a 39% higher throughput than the highest
antecedent throughput.
Index Terms—Heterogeneous Multi-Core, Asymmetric Multi-
Core, Edge Inference, CNN Performance-Prediction
I. INTRODUCTION
CONVOLUTIONAL Neural Network (CNN) inference onedge devices has become quintessential for enriched
user experience. Continuous vision tasks that use inference to
extract high-level semantic information from real-time video
streams are paramount in numerous edge application domains
such as Advanced Driver-Assistance Systems (ADAS), Virtual
Reality (VR), and Augmented Reality (AR) [15]. Inference-
driven applications project unprecedented computational re-
quirements onto underlying edge devices [34]. Fortunately,
there has been tremendous progress to port CNNs to edge
devices. Many network models such as MobileNet [9] have
been invented specifically for edge to perform high-accuracy
classifications with considerably smaller network size. Numer-
ous efficient libraries such as ARM Compute Library (ARM-
CL) [1] and Tencent NCNN [3] have been constructed precisely
to facilitate efficient CNN implementation for the edge. ARM-
CL is highly optimized for edge-specific ARM core architec-
Manuscript received March 14, 2019; accepted September 17, 2019. This
work was partially funded by Singapore Ministry of Education Academic
Research Fund Tier 2 MOE2015-T2-2-088. S. Wang, Y. Zeng, A. Pathania and
T. Mitra are with the Department of Computer Science, School of Computing,
National University of Singapore, SG. E-mail: ((wangsq, yifan122, pathania,
tulika)@comp.nus.edu.sg). G. Ananthanarayanan is with the Department of
Computer Science and Engineering, Indian Institute of Technology Dharwad,
Karnataka, IN E-mail: (gayathri@iitdh.ac.in). N. Goel is with the Department
of Computer Science and Engineering, Indian Institute of Technology Ropar,
IN. E-mail: (neeraj@iitrpr.ac.in). (Corresponding author: Tulika Mitra)
Big Cluster
Cortex A73 Cortex A73
Cortex A73 Cortex A73
SCU
2MB L2 Cache
Small Cluster
Cortex A53 Cortex A53
Cortex A53 Cortex A53
SCU
1MB L2 Cache
CCI Bus
Fig. 1: An abstract block diagram of an eight-core ARM
big.LITTLE heterogeneous multi-core within Hi3670 SoC [4].
tures with inbuilt support for multi-threading and acceleration
through ARM NEON vectorization technology.
Single-ISA heterogeneous multi-cores comprise of pro-
cessing cores that have different power-performance-area
characteristics but share the same Instruction Set Architec-
ture (ISA) [16]. Facebook [31] in 2019 reports that about half
of the mobile SoCs in the market adopts such architecture
with two CPU clusters: a high-performance cluster and an
energy-efficient cluster. This heterogeneous configuration pro-
vides higher parallel processing potential than homogeneous
multi-cores within given power and area budget provided
all cores can be simultaneously employed productively [23],
[27]. Figure 1 shows an abstract block diagram for the eight-
core state-of-the-art ARM big.LITTLE heterogeneous multi-
core in Hi3670 System on Chip (SoC) designed for edge
devices. Hi3670 groups together four high-performance Big
Cortex A73 cores and four low-performance Small Cortex A53
cores into two clusters alongside L2 caches of size 2 MB
and 1 MB, respectively. Two clusters are kept fully cache-
coherent via bus-based Cache Coherent Interconnect (CCI)
using snooping broadcast protocol. Cores within a cluster are
kept coherent using bus-based Snoop Control Unit (SCU). This
raw computational power provided by heterogeneous multi-
core makes CNN inference on edge device feasible.
Dedicated accelerators such as GPUs and dedicated IP cores
have been proven to be more efficient than CPU for inference.
However, their applicability is constrained by the extreme
diversity of accelerators and lack of easy programming sup-
port. CPU remains the platform of choice for running ML
workloads being the most common denominator with high
availability in mobile and embedded platforms [21], [30],
[31], [36]. In addition, low-cost edge devices may not contain
dedicated accelerators, and the performance gap between CPU
ar
X
iv
:1
90
3.
05
89
8v
3 
 [c
s.L
G]
  2
2 J
an
 20
20
2Image	2Layer	1 Layer	2
IN
PU
T
O
UT
PU
T
Layer	4
B B
B B
s s
s s
B B
B B
s s
s s
B B
B B
s s
s s
Layer	3
B B
B B
s s
s s
…
Ke
rn
el
-le
ve
l
La
ye
r-l
ev
el
Layer	1 Image	1Layer	2
IN
PU
T
O
UT
PU
T
Layer	4
B B
B B
s s
s s
B B
B B
s s
s s
B B
B B
s s
s s
Layer	3
B B
B B
s s
s s
IN
PU
T
O
UT
Layer	1-2 Image	1Layer	4
s s s s
Layer	3
BB B B
Time
IN
PU
T
O
UT
Layer	1-2 Image	2Layer	4
s s s s
Layer	3
BB B B
IN
PU
T
O
UT
Layer	1-2 Image	3Layer	4
s s s s
Layer	3
BB B B
…
Fig. 2: Visualization of the default Kernel-level and the proposed Layer-level splitting with a three-stage pipeline (B3-B1-s4)
on heterogeneous multi-core with four Big (B) cores and four Small (s) cores for a representational four-layer CNN.
and GPU is small, making CPU the favourable choice for ML
workloads. On the other hand, CNNs are more commonly used
as a building block to construct more complex systems. For
applications ranging from smart classroom [24] with person
and text recognition, to autonomous drones [26] with path
planning, object classification and obstacle avoidance, multiple
independent inference sub-tasks are performed concurrently.
Such applications require all the available resources to run
these inference engines in parallel. Therefore, improving in-
ference throughput on ARM big.LITTLE like architectures by
itself is a critical problem.
Motivational Example: The layers in a CNN are in a pre-
ordained order by design, which is usually in sequential. Their
associated convolutional kernels are therefore required to be
processed sequentially. Nevertheless, different images from
an image stream can potentially be processed in parallel.
Unfortunately, existing state-of-the-art deep learning libraries
such as ARM-CL is designed to process the image stream
sequentially one image at a time. The computation of one
kernel at a time is then distributed across all cores with
the default parallelization strategy we christen Kernel-level
execution. Figure 2 (top) visualizes the Kernel-level strategy
for a representative four-layer CNN on a eight-core heteroge-
neous multi-core. Section II provides further details on Kernel-
level strategy. The Kernel-level strategy works for intra-cluster
processing but fails to scale to inter-cluster processing with
multiple clusters.
Heterogeneous Multi-Processing (HMP) allows execution
of kernels using both Big and Small Cores simultaneously.
Figure 3 shows the change in throughput (measured in images
per second) of several CNNs with the increase in the num-
ber of heterogeneous cores used with Kernel-level strategy.
Throughput increases as we add more Big cores but drops
sharply on the addition of Small cores from another cluster
for HMP. Inter-cluster communication overhead involved in
the use of HMP explains the drop. No HMP configuration
surpasses the performance of configuration with four Big
cores. Therefore, Figure 3 empirically shows that we cannot
1B 2B 3B 4B
4B
+
1s
4B
+
2s
4B
+
3s
4B
+
4s
5
10
15
20
Core Configurations
T
hr
ou
gh
pu
t
[I
m
ag
es
/S
ec
on
d]
AlexNet GoogLeNet MobileNet
ResNet50 SqueezeNet
Fig. 3: Throughput of different CNNs with a different number
of heterogeneous cores (B: Big core, s: Small core) using the
default Kernel-level strategy.
improve throughput on heterogeneous multi-cores with default
Kernel-Level strategy alone. This limitation originates from
the design of Kernel-level strategy and not from the quality of
its implementation.
There are multiple convolutional layers of different di-
mensions within a CNN that project different resource re-
quirements. Therefore, it is possible to create a processing
pipeline with stages composed of only homogeneous cores
that still splits CNN processing over different heterogeneous
clusters. Let notation {core type}{core count} denote the
core configuration of a pipeline stage. Figure 2 shows a three-
stage pipeline created to process incoming images in a stream
using the Layer-level strategy. Three Big cores (B3) construct
first pipeline stage processing Layers 1 and 2. Remaining one
Big core (B1) constructs second stage processing Layer 3.
Four Small cores (s4) construct third pipeline stage processing
Layer 4. This pipeline constructively uses all eight heteroge-
neous cores in execution by processing multiple images in
parallel. Generally, initial layers operating on bigger inputs
requires more computational power and memory compared to
deeper layers. Therefore, it is intuitive to map initial convolu-
tional layers to more powerful Big cluster and deeper layers
3Al
ex
Ne
t
Go
og
Le
Ne
t
M
ob
ile
Ne
tV
1
Re
sN
et5
0
Sq
ue
ez
eN
et
0
5
10
15
20
T
hr
ou
gh
pu
t
[I
m
ag
es
/S
ec
]
ARM-CL [1] NCNN [3] TVM [6]
*TVM results are generated with NNVM-TVM framework with a pre-trained model
from mxnet.gluon.mode zoo.vision model set [2], wherein GoogLeNet is not included.
Fig. 4: Throughput of different CNN models on Big cluster
when implemented in different deep learning frameworks.
to less powerful Small cluster. However, the design space of
mapping layers to core clusters increases exponentially with
the increase in the number of layers.
Our Novel Contributions: We propose a framework called
Pipe-it that partitions CNN layers across heterogeneous cores
to improve throughput. Pipe-it creates a processing pipeline by
splitting layers among heterogeneous core clusters, wherein
a given set of homogeneous core(s) always process kernels
from a fixed set of layers. Different pipeline stages (and cores
within) are responsible for concurrently processing different
layers corresponding to consecutive images in a stream. The
pipelined execution improves throughput by employing all
on-chip memory and processing resources of heterogeneous
multi-core more effectively than the default approach of split-
ting individual kernels across all heterogeneous cores.
Pipe-it includes an analytical performance model that pre-
dicts the performance of convolutional layer on different core
configurations (type and count) from its network structure
description. Its Design Space Exploration (DSE) algorithm
then uses the predicted performance to locate the best fitting
pipeline configuration and respective layer allocation. On
average, we get 39% improvement in throughput from entire
heterogeneous multi-core compared to using only its high-
performance homogeneous Big cluster.
II. BACKGROUND
ARM Compute Library (ARM-CL) [1] is a state-of-the-art
framework for implementing CNNs on ARM architectures.
Figure 4 shows the throughput of CNN inference imple-
mented with ARM-CL (version 18.05), Tencent NCNN [3],
and TVM [6] frameworks running on Big cluster using multi-
threading. Both ARM-CL and Tencent NCNN support acceler-
ation through ARM NEON vectorization and provides NEON
assembly implementation for most computationally intensive
convolution kernels of CNN. These two frameworks present
similar performance and outperform TVM implementation
without NEON acceleration. However, Tencent NCNN is not
as well maintained or supported as ARM-CL. Therefore, we
use ARM-CL as the foundational framework in this work.
ARM-CL is a collection of functions commonly used in
machine learning. The functions are infused with hardware-
specific optimizations for superior performance on ARM ar-
TABLE I: Structure of different CNN models and the corre-
sponding major layer (node) counts in their default ARM-CL
implementations.
CNN Major Layers/Modules
ARM-CL
Major/(Total
Node Count)
AlexNet [15] 5 Conv + 3 FC 11* / (21)
GoogLeNet [29] 3 Conv + 9 Inception Modules(6 Conv Each) + 1 FC 58 / (132)
MobileNet [9] 14 Conv + 13 Conv DW + 1 FC 28 / (58)
ResNet50 [8] 1 Conv + 4 Residual Blocks(52 Conv in Total) + 1 FC 54 / (146)
SqueezeNet [12] 2 Conv + 8 Fire Module(3 Conv Each) 26 / (58)
Conv: Convolutional Layers; FC: Fully-connected Layers; Conv DW:
Depthwise Convolutional Layers. *Three convolutional layers are
implemented as two nodes each for AlexNet.
chitectures. Graph API accompanying ARM-CL facilitates the
creation of complex networks. The network is written with
dedicated API as a graph by the user at the frontend. The
execution is automatically handled at the backend. Graph
implements the layers as nodes that are connected to other
nodes in the CNN sequence as defined by the user. Table I
summarizes the architecture of several popular CNNs and their
respective implementations in ARM-CL. We count weighted
layers (convolutional or fully-connected) as major layers be-
cause they are, in general, most computationally expensive
part of CNNs.
Inside each node, the workload is represented as a series of
compute kernels. Runtime scheduler sequentially dispatches
the kernels in q node and engages respective processing unit
during execution. ARM-CL implements a convolution node
with NEON acceleration using im2col (Image to Column) and
GEMM (GEneral Matrix Multiplication) kernels. In addition,
the parallel nature of the kernels allows their computations
to be distributed across multiple cores. This node-level par-
allelization is implemented in the form of a thread pool that
spawns several new threads and distributes the computation of
a kernel among them before the scheduler dispatches them for
execution.
We extended the default ARM-CL CNN implementations to
execute multiple graphs in parallel. The implementation allows
the same network to be applied to multiple images concur-
rently. All graphs share the same copy of read-only parameters
(weights and biases) and each graph contains its unique copy
of the image CNN needs to classify as we assume images in
a stream to be independent. We modify the scheduler to run
under a one-thread-per-core model with minimal migrations
using thread pinning for faster and predictable execution.
III. CO-EXECUTION AT DIFFERENT LEVELS
A. Kernel-Level Splitting
We can explore parallelism inherent in kernels by exploiting
ARM-CL thread pool implementation to engage all cores.
While the parallelization of a kernel across homogeneous
cores within a cluster gives performance benefits, further
parallelization across heterogeneous clusters does not improve
throughput as shown in Figure 3. Authors in [7] make a similar
40 2 4 6 8 10 12
0.8
0.9
1
Ratio of Big Cluster Workload over Small Cluster WorkloadN
or
m
al
iz
ed
T
hr
ou
gh
pu
t
[I
m
ag
es
/S
ec
] AlexNet GoogLeNet MobileNet
ResNet50 SqueezeNet Big Cluster Only
Fig. 5: Throughput of CNN models with disproportionate
kernel-level workload split between Big and Small cluster
normalized against throughput achieved using Big cluster only.
Al
ex
Ne
t
Go
og
Le
Ne
t
Mo
bil
eN
et
Re
sN
et5
0
Sq
ue
eze
Ne
t20
40
60
80
100
%
of
To
ta
l
L
ay
er
Pr
oc
es
si
ng
Convolutional Fully-Connected Others
Fig. 6: The breakdown of CNN processing time between
different layer types.
observation for kernel-level splitting in the context of CPU-
GPU co-execution.
Using multiple cores within the same cluster for processing
increases parallel L2 accesses per unit time. Cluster’s SCU
successfully handles the increased accesses without being
overwhelmed and thereby improves performance. However,
when additional cores from another cluster are engaged, the
working set gets split between the L2 caches of two clusters.
Some conflict misses that occur on one cluster now get served
by L2 cache of another cluster using CCI increasing average
on-chip L2 access latency. Additional L2 cache decreases the
number of capacity misses going to main memory. However,
the decrease cannot compensate for the increased latency of
conflict misses.
Figure 3 shows the throughput obtained by splitting
the computational workload from kernel equally among all
threads. However, distributing workload disproportionately
does not improve throughput significantly either. Figure 5
shows through exhaustive search that no ratio of workload split
between Big and Small clusters results in statistically signifi-
cant higher throughput for most CNNs than when kernels run
exclusively only on Big cluster. Exhaustive search indicates we
must give little or no share of computational work to Small
cluster for optimal execution.
B. Layer-Level Splitting
Image classification CNNs are made up of multiple layers,
which process images sequentially. Figure 6 shows the share
of processing time spent on convolutional layers in different
0 10 20 30 40 50
10
20
Convolutional Layer ID
%
of
To
ta
l
C
on
vo
lu
tio
na
l
L
ay
er
Pr
oc
es
si
ng AlexNet GoogLeNet MobileNet
ResNet50 SqueezeNet
Fig. 7: Distribution of total convolution processing time among
convolutional layers for different networks.
CNNs normalized to total forward pass processing time. Pro-
cessing of convolutional layers dominates overall time spent
for all networks except in relatively older AlexNet, wherein
fully-connected layers dominate.
The convolutional layer at the start of network operates upon
the original data of the biggest size (and dimensionality) and
produces output data of smaller size due to the application of
filters. This shrunken output gets passed on to the subsequent
convolutional layer as input, which reduces its convolution
processing time. Figure 7 shows that time taken to process
convolutional layers generally decreases as we move deeper
into a network.
Observations from Figure 7 can help us in creating a load-
balanced processing split on a heterogeneous multi-core. Its
high-performance cores can process more processing-intensive
initial layers, while low-performance cores can process less
processing-intensive deeper layers. Kernels from layers can
still get split among all homogeneous cores within a cluster
using Kernel-level splitting. Kernels from non-convolutional
layers are considered part of previous convolutional layers and
get processed at the same cluster. We do not explore layer-level
splitting of CNN at non-convolutional layers.
Layer-level splitting between clusters produces a lower
number of inter-cluster L2 conflict misses than Kernel-level
splitting as most layers that feed data into each other are
processed on the same cluster reducing the load on CCI.
Furthermore, it also allows for multiple images from a stream
to be processed in parallel. The Big cluster can start processing
layers from image Z+1, while Small cluster is still processing
layers from image Z. Layer-level splitting, unlike Kernel-level,
also requires less movement of weight and biases between
clusters. It processes weights and biases shared between the
kernels of different images on the same cluster. This optimiza-
tion further reduces the amount of conflict misses between
clusters and thereby improves L2 cache usage efficiency.
IV. DESIGN SPACE
A. Split Points at Convolutional Layers
Structure of different convolutional layers can differ signif-
icantly from each other within a network. Their performance
on Big and Small clusters with a different number of allocated
cores can also be quite different. These differences mandate
5TABLE II: Description of parameters in chronological order.
Parameters Descriptions
W Workload, number of major layers (convolutional layers, with fully-connected layers for AlexNet) in a CNN.
X,X1, X2 Split-point of workload, number of layers to be allocated to pipeline stages.
H,HB , Hs Number of cores in a heterogeneous multi-core architecture. B: Big cores, s: Small cores.
p, pB , ps Number of pipeline stages; number of stages on Big and Small clusters.
Cp Number of different pipeline configurations for a pipeline with p stages.
DW Number of design points for a CNN with W major layers on a H-core heterogeneous multi-core architecture.
Iw, Ih, Id Input image tensor dimensions in width, height and depth.
Fw, Fh, Fd, Ofm Filter dimensions in width, height, depth and number of output feature maps.
Pad, S Padding, stride information for convolution.
N,K,M Dimensions of matrices in convolution converted GEMM.
α, β Regression coefficients.
ts Tile size for GEMM optimization.
niter Number of iterations generated for image tensor with tile size ts.
itert Number of iterations allocated to a thread t in multi-threaded execution.
Titer Execution time of a single iteration.
Tmulti Execution time of multi-threaded execution.
P ={P1, P2, ..., Pp} Representation of a pipeline configuration with p stages
Pi = (type, count) Representation of the configuration of the i-th stage in a pipeline P . E.g. (B, 3), also written as B3 for convenience.
L ={L1, L2, ..., Lp} Corresponding layer allocation for pipeline P with p stages.
Li ={lj , ..., lk} A set of layers in original order allocated to stage Pi, also written as lj−k for convenience.
T, TPi Time matrix for execution times of a single layer on different core configurations; Time array of execution times
of a set of layers with core configuration Pi.
TPilj , T
Pi
Li
Execution time of layer lj with pipeline configuration Pi; execution time of a pipeline stage Pi with its corresponding
layer allocation Li.
Lwl A set of layers as defined in the context (workload).
0 0.2 0.4 0.6 0.8 1
0.2
0.4
0.6
0.8
1
Ratio of Convolutional Layers Split on Cortex A73 cluster
N
or
m
al
iz
ed
T
hr
ou
gh
pu
t
[I
m
g/
s] AlexNet GoogLeNet MobileNet ResNet50 SqueezeNet
Fig. 8: Throughput of a two-stage pipeline (B4-s4) with
workload split at different convolutional layers normalized
against the maximum throughput obtained.
non-trivial decisions on splitting convolutional layers across
pipelines stages of Pipe-it.
Consider a basic two-stage layer-level split pipeline (B4-
s4) processing a network containing W major layers. First
X layers are processed on Big cluster with Kernel-level split
among all four Big cores and rest (W−X) layers are processed
on Small cluster. The challenge is to find an optimal split point
X with maximum throughput. There are
(
W−1
1
)
= (W − 1)
possible split points in this pipeline. Figure 8 shows throughput
for different CNNs with split ratio (X/W) ranging from zero
to one. We also include fully-connected layers for AlexNet
as valid points to split. Optimal split ranges from 0.60 for
GoogLeNet to 0.90 for AlexNet.
Design space for a three-stage pipeline is much larger as
we need to locate two split points X1 and X2. Consider a
pipeline configuration (B4-s2-s2). Four Big cores, two Small
cores, and remaining two Small cores are used to construct
pipeline Stages 1, 2, and 3, respectively. Figure 9 shows the
execution of ResNet50 with different configurations. The y-
20
40
20
40
2
4
6
Split between s2-s2 Sp
lit
bet
we
en
B-
sT
hr
ou
gh
pu
t
[I
m
g/
s]
1 2 3 4 5
Fig. 9: Throughput of ResNet50 with a three-stage pipeline
(B4-s2-s2) with workload split at different layers.
axis shows split point X1, which splits Stage 1 (B4) and
[Stage 2 + 3] (s2-s2). X1 also splits Big and Small clusters for
this pipeline configuration. The x-axis shows split point X2,
which splits Stages 2 (s2) and 3 (s2). The z-axis shows the
throughput for a workload split. Throughput peaks at 5.6 Img/s
with split points X1 and X2 at Layers 33 and 45, respectively.
The optimal three-stage pipeline for ResNet50 has 7% higher
throughput than the corresponding optimal two-stage pipeline.
6B. Stages of Pipelines
We can create pipelines with many more stages (up to
H on heterogeneous multi-core with H cores) in pursuit of
higher throughput for CNN inference. We eliminate pipeline
designs with heterogeneous core types within pipeline stage as
Kernel-level split between clusters is not helpful (Figure 3).
We only consider pipeline configurations with Big cores for
initial convolutional layers and Small cores for subsequent
convolutional layers as CNNs usually have more compute-
intensive convolutional kernels at the beginning (Figure 7).
Equation (1) gives the number of different pipelines possible
Cp with p pipeline stages on heterogeneous multi-core with
HB Big cores and Hs Small cores. We use pB and ps to denote
the number of stages constructed with the Big and Small
clusters, respectively.
(
HB−1
PB−1
)×(Hs−1Ps−1) gives the total number
of different pipeline that we can construct. However, the values
of pB and ps must satisfy the following requirements to
construct a meaningful p-stage pipeline.
pB ∈ [1, HB ], ps ∈ [1, Hs], pB + ps = p
Thus, the minimum value of max(1, p − Hs) and the
maximum value of min(HB , p− 1) gives a range of pB . We
then go through pB and calculate the total number of different
pipelines possible with p stages using Equation (1).
Cp =
min(HB ,p−1)∑
PB=max(1,p−Hs)
(
HB − 1
PB − 1
)
×
(
Hs − 1
(p− PB)− 1
)
(1)
Equation (2) gives the total number of design points for
CNN with W convolutional layers (DW ) in Layer-level split-
ting on H-core heterogeneous multi-core.
DW =
H∑
p=2
(
W − 1
p− 1
)
× Cp (2)
There are in total 64 possible pipelines (with p = 2 to
8) as calculated with Equation (1) for our prototype board
with eight-core heterogeneous multi-core. Furthermore, there
are in total 5,379,616 distinct possible design points for
MobileNet with its 28 convolutional layers as calculated using
Equation (2). Design space gets even larger for bigger CNNs
like GoogLeNet and ResNet50 with more layers. Therefore, it
is not possible to explore entire Layer-level splitting design
space using exhaustive search in a reasonable amount of time.
C. The Pipe-it Framework
We present a two-part Pipe-it framework to quickly go
through huge design space and locate the best configuration to
execute given CNN workload. Pipe-it first predicts the execu-
tion time of all layers on all possible core configuration from
static network-layer configuration descriptors (Section V).
Pipe-it then goes through design space heuristically using
predicted timing information to obtain near-optimal pipeline
configuration and corresponding workload allocation (Sec-
tion VI).
  
Id/(Fd) Iw
Ih
Ow
Oh
Ofm/(Od) 
Fw
Fh
X
K M
N K
Input matrix
Filter matrix
Output tensorInput tensor
Fig. 10: Visualization of a convolutional layer with in-
put image tensor of size {Iw, Ih, Id} and filter of
size {Fw, Fh, Fd, Ofm} generating output tensor of size
{Ow, Oh, Od}. The execution is realized as GEMM of input
matrix [N × K] and filter matrix [K ×M ] generates result
matrix of size [N ×M ].
V. LAYER-WISE PERFORMANCE ESTIMATION
The most time-consuming part of CNNs is the execution
of convolutional layers. Convolutional layers convolve input
tensors with filters to generate respective output tensors,
feeding into following layers as inputs. With the extensive
calculation requirements, hardware-dependent implementation
and optimization techniques are applied to accelerate the
execution of convolutions.
GEMM is commonly used to implement convolution execu-
tions. ARM-CL first converts input image tensor and filter into
a matrix (Im2col kernel). It then performs GEMM execution
and finally transforms the execution results back into output
image tensor format (Col2Im kernel). Authors in [22] show
execution time of convolution correlates linearly to the di-
mension of matrices. We build on the approach that correlates
statically available descriptors of each convolutional layers
with layer execution times. We evaluate and model individual
convolutional layers with special consideration on the effects
of multi-threading, whereas [22] only considers the overall
execution time of the network.
A. Convolution as GEMM
Figure 10 visualizes convolution using GEMM. Consider a
convolutional layer with input image tensor of size (height,
width, depth) {Iw, Ih, Id} and filter of size (height, width,
depth, number of output feature maps) {Fw, Fh, Fd, Ofm},
with padding Pad, and stride S. Convolutional layer generates
output tensor {Ow, Oh, Od} of size given by Equation (3).
Input tensor and filter are required to have matching depth
(Id = Fd) and are usually square (Iw = Id, Ow = Oh).
Ow = b(Iw − Fw + 2 ∗ Pad)/Sc+ 1
Oh = b(Ih − Fh + 2 ∗ Pad)/Sc+ 1
Od = Ofm
(3)
ARM-CL implements convolution as GEMM of input and
filter matrices. Figure 10 shows how the input tensor are
divided into small patches of size of one filter ({Fw, Fh, Fd}),
denoted as the red shaded region. The patches are re-arranged
as rows in the image matrix. Similarly, the filters are re-
arranged into columns in the filter matrix. Thus the convolu-
tion is transformed into a GEMM of an image matrix ([N×K])
and a filter matrix ([K×M ]), which generates a result matrix
7of size [N × M ] and later resize it into an output tensor.
Equation (4) gives dimensions of matrices. The total number
of arithmetic operations is (N ×K ×M).
N = Ow ×Oh
K = Fw × Fh × Fd
M = Ofm
(4)
Compute time of GEMM is a complex function of memory
accesses, arithmetic computations, and inherent exploitable
parallelism in the given convolutional kernel.
B. Single Core Estimation
We create a set of micro-benchmarks with ARM-CL to
capture the execution behaviour of layers commonly used in
networks. The micro-benchmarks contain representative layers
and a convolutional layer with desired configurations (input
sizes and filter sizes). We randomly generate input images
and filter parameters for measurement purposes. The GEMM
execution time is measured for different configuration points
using the following values of the parameters:
Iw = Ih = {7, 14, 28, 56, 112}
Fw = Fh = {1, 3, 5, 7, 11}
Id = Fd = {32, 64, 92, 128, 192, 256}
Ofm = {32, 64, 92, 128, 192, 256}
We observe a linear correlation between the dimensions of
matrices (N,K,M ) and the execution time of GEMM. Au-
thors in [22] made similar observations. Equation (5) models
the execution time of convolutional layer T by using linear
regression on (N,K,M ) for a single-core configuration, where
β1, (β2, ..., β8) are constants determined with the help of
linear regression. We can physically interpret interaction terms
in Equation (5) as the size of matrices involved in GEMM
(NK,KM,NM ) and total arithmetic operations (NMK).
T =β1N + β2K + β3M + β4NK + β5KM
+ β6NM + β7NMK + β8
(5)
C. Multi-core Estimation
ARM-CL implements GEMM optimization by multi-
threading and tiling with tile size (ts) determined according
to the cache sizes to achieve optimal memory behaviour. It
uses H threads for execution on an H-core multi-core. As
shown in Figure 11, the total workload is divided along the
rows of the image matrix into chunks of “iterations”. The
total count of iterations is niter = N/ts. These iterations are
then dispatched either statically or dynamically to available
threads. A thread t is assigned with itert number of iterations
to execute sequentially. Workload assigned to all H threads
add up to the total number of iterations (
∑H
t=1 itert = niter).
For single-threaded execution, all iterations (niter) are as-
signed and processed sequentially on one thread, with execu-
tion time T obtained from Equation (5). We model the time of
each iteration (Titer) from the single-threaded execution time
with Equation (6), assuming identical processing time for all
  
X
ts
M
N
K
1 iteration
Input matrix
Filter matrix
K
iter1
iter2
iterH
... ...
...
...
...
Fig. 11: Visualization of iteration allocation for convolutional
layer among H threads.
iterations. For multi-thread execution, the execution time of the
slowest thread determines the total time when we distribute the
workload among H threads, as shown in Equation (7) which
models Tmulti. Constant coefficients (α1, α2, α3) are obtained
using linear regression.
Titer = (T − α1)/niter + α2 (6)
Tmulti = max
t∈[1,H]
(Titer ∗ itert) + α3 (7)
We can expect an equal split (itert = niter/H = N/(ts ∗
H)) on the distribution of workload among homogeneous
cores. Equation (8) combines the previous two equations and
models the multi-threaded execution time Tmulti based on
matrix size N , tile size ts, and the number of cores H .
Tmulti = Titer ∗ itert + α3 = Titer ∗ niter/H + α3
= (T − α1)/H + α2 ∗N/(ts ∗H) + α3
(8)
Table III shows the prediction error for all the possible
homogeneous core allocations. The proposed model predicts
execution time for individual convolutional layers across all
core configurations for all five benchmark CNNs accurately.
We observed 13.2% and 11.4% prediction errors overall on
average for Big and Small cores, respectively. Our proposed
performance-prediction model is significantly more advanced
than the model presented in [22] focusing on performance
prediction for the entire neural network. Model in [22] does
not take into consideration the different number of cores
involved in CNN execution. Their model is built and tested
by profiling on all the available cores. For heavy layers, the
workload is more likely to occupy all the cores and thus
can be predicted with higher precision by [22]. However, for
light layers, such a method cannot predict the reduction in
utilization and thus results in higher errors. The authors in [22]
only evaluate the model on the entire network and do not
include layer-wise evaluations. They report 13.4% prediction
error for overall CNN inference time with only two CNNs.
We re-implement the model in [22] on four Big cores and
observe on average 15% estimation error for entire networks
across five CNNs. However, we observe an average 54% error
when using the same model to predict the execution time of
a single layer. A huge error in layer-wise predictions makes
the model in [22] unusable for Pipe-it that requires accurate
per-layer performance estimation for workload allocation.
8TABLE III: GEMM execution time prediction error averaged
across all convolutional layers in CNN for different possible
homogeneous core allocations.
CNN 1B 2B 3B 4B 1s 2s 3s 4s
AlexNet 11.3 11.9 12.3 13.1 9.6 10.5 10.5 11.1
GoogLeNet 13.8 15.0 15.1 15.0 8.8 9.5 9.6 8.9
MobileNet 21.5 19.5 17.2 17.7 18.6 17.1 17.2 18.5
ResNet50 8.2 7.5 8.0 8.4 11.5 10.9 11.1 12.1
SqueezeNet 18.1 17.9 18.0 17.7 13.9 13.0 11.8 12.7
Average 13.2% 11.4%
D. Fully-connected Layers
We also consider fully-connected layers apart from con-
volutional layers as major layers in Pipe-it. Figure 6 shows
older networks like AlexNet spend a significant portion of their
execution time executing fully-connected layers. However,
fully-connected layers involve a huge number of parameters
and hence result in excessive memory transfers during exe-
cution [20]. Newer CNNs usually adopt structures with no
fully-connected layers (SqueezeNet) or only one as classifier
at the end (GoogLeNet, MobileNet, ResNet50).
Fully-connected layers are matrix multiplications. AlexNet
has three fully-connected layers with 4096, 4096, and 1000
neurons respectively. Other networks employ fully-connected
layers as a classifier with 1000 neurons. We generate a set of
micro-benchmarks with various input tensor sizes and number
of neurons (4096 and 1000). Simple linearity is observed
between input tensor sizes and execution time for a given
number of neurons. Therefore, the regression-based model can
also be used to predict the execution time of fully-connected
layers. We observed 11.8% and 14.4% prediction errors overall
on average for the fully-connected layers in micro-benchmarks
and actual CNNs, respectively.
VI. DESIGN SPACE EXPLORATION
We can design many different pipelines with a different
number of stages and each stage with different processing core
combinations for a heterogeneous multi-core. In addition, for
fixed pipeline design, the number of design points in allocating
the workload to different pipeline stages grows exponentially
with the total number of convolutional layers. Therefore, we
propose a robust heuristic approach that quickly navigates
through the design space to obtain a high-performing layer-
level split design point for any CNN. The heuristic uses an iter-
ative two-step approach. The first step is to determine a work-
load split for a given pipeline configuration (Section VI-B).
The second step is to merge adjacent stages to search for better
pipeline configuration (Section VI-C). Two steps are iteratively
engaged to approach a high-throughout pipeline configuration
and corresponding workload distribution.
A. Definitions
Consider CNN with W convolutional layers to be deployed
on (HB +Hs) heterogeneous multi-core with HB Big cores
and Hs Small cores. The goal of Pipe-it is to find throughput
maximizing pipeline configuration P and corresponding layer
distribution L.
We use P = {P1, ...Pp} to define core configuration of
each pipeline stage for a pipeline P with p stages. We define
the pipeline stage as tuple Pi = (core type, core count)
depicting type and count of cores that are used to construct
it. The core type can only be either B or s since only
homogeneous cores are used to construct the pipeline stage.
There are HB and Hs core combinations for Big and Small
cores, respectively. Therefore, (HB + Hs) different pipeline
stage configurations are possible.
L = {L1, ..., Lp} defines corresponding layer allocation
associated with the pipeline, where Li is a set of layers
allocated to pipeline stage Pi. Li = {l1, ..., lW } if Pipe-it
allocates all the W layers to Pi. Li = ∅ if it allocates none
of the layers to Pi.
Section V describes performance-prediction models used to
predict the execution time of a layer. We use time matrix T to
represent predicted execution times. TPilj represents execution
time for layer lj on a core configuration Pi. Similarly, the fol-
lowing equation represents the execution time of the pipeline
stage Pi with layer allocation Li.
TPiLi =
∑
lj∈Li
TPilj (9)
B. Work-Flow Split Determination
We work with an assumption based on Figure 7 that initial
CNN layers are more compute-intensive than deeper layers and
thereby requires more processing power. Thus, we order the
pipeline stages to have more compute capable core combina-
tions at the beginning, and with decreasing compute capability
for stages deeper into the pipeline. Such an arrangement also
ensures a monotonous increase in layer processing time as we
move down pipeline stages. The compute capability of core
combinations is evaluated by the execution time of layers on
average. Equation (10) gives observed compute capability in
executing layer l with homogeneous core combinations on our
heterogeneous eight-core platform.
T
(B,4)
l < T
(B,3)
l < T
B,2
l / T
(s,4)
l
< T
(s,3)
l < T
(s,2)
l / T
(B,1)
l < T
(s,1)
l
(10)
Equation (11) gives the throughput of pipeline P with p
stages and layer allocation L. The pipeline stage that produces
the longest latency determines the throughput of the pipeline.
Therefore, the goal is to balance the workload among all stages
to achieve minimal latency (maximum throughput).
Throughput = 1/ max
i∈[1,p]
(TPiLi ) (11)
Algorithm 1 describes the division and allocation of a set
of layers Lwl = {la, ..., lb} (in the original order) among
two adjacent pipeline stages Pi and Pi+1. The ordering of
pipeline stages ensures that any layer lj is executed faster on
Pi than on Pi+1 (TPilj < T
Pi+1
lj
). Such arrangement results in
an expansion in execution time as we move deeper into the
pipeline and thereby ensures one-way flow of workload.
9Algorithm 1: find split: Algorithm to split the workload
between adjacent pipeline stages.
Input: Lwl = {la, ..., lb}, TPi , TPi+1 ,
Output: Li, Li+1
Initialisation : Li = Lwl = {la, ..., lb};Li+1 = ∅;
1: for lj ∈ Lwl do
2: TPinew = T
Pi
Li
− TPilj ;
3: T
Pi+1
new = T
Pi+1
Li+1
+ T
Pi+1
lj
;
4: if (TPinew > T
Pi+1
new ) then
5: Li = Li \ {lj};Li+1 = Li+1 ∪ {lj} // move of lj is helpful
6: else
7: break; //further flow of workload will not be helpful
8: end if
9: end for
10: return Li, Li+1
Algorithm 2: work flow: Algorithm for workload allo-
cation for a multi-stage pipeline.
Input: P = {P1, ..., Pp}, Lwl = {l1, ..., lW }, T = {TP1 , ..., TPp},
Output: L
Initialisation : L = {L1, ..., Lp}; for(Li ∈ L) do Li = ∅; end for
L1 = Lwl;Lold = ∅;
LOOP: Exit when allocation stabilized
1: while L 6= Lold do
2: Lold = L
3: for Pi, Pi+1 ∈ P do
4: Ltemp = Li ∪ Li+1
5: Li, Li+1 = find split(Ltemp, TPi , TPi+1 )
6: end for
7: end while
8: return L
The workload initially is entirely allocated to fastest stage
Pi (Li = {la, ..., lb}, Li+1 = ∅) making it the bottleneck. We
try to move layers to Pi+1 to balance workload in each stage,
starting with the last layer allocated to Pi (layer lb). Moving
layer lj to Pi+1 is helpful if (TPiLi −TPilj > T
Pi+1
Li+1
+T
Pi+1
lj
). We
keep moving the layers until lk when Pi+1 becomes bottleneck
instead. Moving of more layers to stage Pi+1 will make it even
slower. Thus, the best split between two adjacent pipeline stage
will be Li = {la, ..., lk} and Li+1 = {lk+1, ..., lb}.
Pipe-it then goes to the next adjacent pipeline stages
(Pi+1 and Pi+2) to continue balancing stage latency. Pipe-
it uses Algorithm 1 to go through all stages in pipeline to
balance workload with its immediate next stage. We symbolize
workload as water that flows from the first pipeline stage to
deeper stages. There will be more space available in an initial
stage once a part of workload flows from it to deeper stages.
Therefore, Pipe-it engages Algorithm 1 iteratively to reach
the final splitting configuration, wherein there is no further
workload redistribution possible.
C. Pipeline Stage Merging
Running GEMM using multi-threading is always beneficial.
However, Figure 12 shows saturating Thread Level Paral-
lelism (TLP) can lead to concavity in multi-threaded speedup
gains with increasing core allocation. Furthermore, different
types of layers derive different levels of benefits from multi-
threading. Therefore, it is important to match the size of the
pipeline stage with speedup characteristics of layers allocated
to it. Algorithm 3 describes the process of merging pipeline
stages to create bigger pipeline stages. We consider Big cluster
first before moving on to Small cluster.
1B 2B 3B 4B
1
2
3
4
(a) Big Core Configurations
Sp
ee
du
p
CONV-1 CONV-2 CONV-3
CONV-4 CONV-5
1s 2s 3s 4s
1
2
3
4
(b) Small Core Configurations
Sp
ee
du
p
Fig. 12: The concavity in speedup for the five convolutional
layers in AlexNet with different core configurations.
We start with (HB + Hs)-stage pipeline for (HB + Hs)-
core heterogeneous multi-core, where each stage comprises
of only one core. Pipe-it engages Algorithm 2 to search for
the best split of workload for this pipeline configuration. The
pipeline is likely to be bottlenecked by layers that require
more compute capability given sub-optimality of single-core
performance. Thus, we merge pipeline stages to create a more
compute capable stage to alleviate the bottleneck.
Consider the merger of stages Pi = (core type, counti)
and Pi+1 = (core type, counti+1) to stage Pi′ =
(core type, counti + counti+1) with originally allocated set
of layers Li and Li+1, respectively. Note that Pi and Pi+1
must be of the same core type to merge.
The merging is only helpful when Equation (12) holds,
which implies new stage should be better in performance than
at least one of two stages combined. Otherwise, we can stop
as the concavity in speedup (Figure 12) dictates no further
merging of the involved stages to create an even bigger stage
will be helpful either.
T
Pi′
Li′
= T
Pi′
Li
+ T
Pi′
Li+1
< max(TPiLi , T
Pi+1
Li+1
) (12)
Successful merge updates the pipeline configuration and
reengages Algorithm 2 to find a new higher-performing layer
split. Merging decision depends largely on layers allocated to
the stage as different layers respond differently to different
stage configurations. Therefore, the reallocation of workload
is necessary for presenting the right layer information to
the merging algorithm. Algorithm 3 runs iteratively until no
further merging of stages is helpful.
D. An Example
We illustrate with an example of how antecedent algorithms
work to locate optimal pipeline configuration and workload
allocation. The example considers deployment of ResNet50
with 54 major layers (Table I) on an eight-core heterogeneous
10
Algorithm 3: merge stage: Algorithm for determining
stage configuration and corresponding workload allocation.
Input: Lwl = {l1, ..., lW }, HB , Hs, T
Output: P,L
Initialisation :p = HB +Hs;P = {P1, ..., Pp};L = {L1, ..., Lp};
1: L =work flow(P,Lwl, T );
LOOP: Big cluster
2: for (Pi, Pi+1) in P do
3: if (Equation (12)) then
4: merge, update P ; L =work flow(...);
5: else
6: break; //stop further merging
7: end if
8: end for
LOOP: Small cluster
9: for (Pi, Pi+1) in P do
10: if (Equation (12)) then
11: merge, update P ; L =work flow(...);
12: else
13: break; //stop further merging
14: end if
15: end for
16: return P,L
multi-core with four Big and four Small cores. We can create
eight different pipeline stages with different core combinations
for this architecture. Therefore, eight different sets of layer
execution time are predicted to generate time matrix T of
size (54,8). We plug the following corresponding inputs to
Algorithm 3.
Lwl = {l1, l2, ..., l54};HB = 4;Hs = 4;
Algorithm 3 initializes an eight-stage pipeline, wherein
each stage consists of only a single core. It then engages
Algorithm 2 to find split for the eight-stage pipeline.
Algorithm 2 allocates all layers to the first pipeline stage
P1 at the beginning. It then engages Algorithm 1 to balance
the workload between the first two stages (P1 and P2).
Layers starting with the last layer allocated to P1 (Layer
l54) are moved to stage P2 for processing until two stages
are balanced. Algorithm 1 returns L1 = {l1, ..., l25}, L2 =
{l26, ..., l54}. We use l1−25 as a short-hand notation for
{l1, ..., l25}. Thus, Pipe-it updates the workload allocation to
L = {l1−25, l26−54, ∅, ∅, ∅, ∅, ∅, ∅}.
Algorithm 2 then continues to balance workload be-
tween P2 and P3. Algorithm 2 repeats the process with
the remaining pipeline stages. The first iteration returns
L = {l1−25, l26−38, l39−46, l47−50, l51, l52−54, ∅, ∅}. The al-
gorithm returns to rebalance workload of P1 and P2
again once it has rebalanced P2 with P3 and other
stages. The iterative rebalancing, in the end, returns
L = {l1−18, l19−32, l33−41, l42−48, l49−51, l52−54, ∅, ∅}. The
last two pipeline stages are not allocated any workload because
of poor computation capabilities. Therefore, the merging of
stages is necessary to achieve higher performance.
Algorithm 3 evaluates a merger of the first two stages P1
and P2 to create a stage comprising of two Big cores ((B, 2)).
Workload allocation is recalculated with Algorithm 2 if Equa-
tion (12) holds. Otherwise, the merger of stages is not helpful.
The algorithm will not try with further mergers. Merger in our
case is helpful and algorithm updates the pipeline configura-
tion to P = {(B, 2), (B, 1), (B, 1), (s, 1), (s, 1), (s, 1), (s, 1)},
with L = {l1−29, l30−38, l39−48, l49−51, l52, l53−54, l∅}. The
Fig. 13: Picture of Hikey 970 mobile development board.
algorithm then goes on merging P1 and P2 to create (B, 3) and
beyond. It recalculates allocation every time the pipeline stage
is updated. The merge goes on for the Small cluster afterward
following similar rules. Pipe-it finally decides upon a three-
stage pipeline with configuration P = {(B, 4), (s, 2), (s, 2)}
and workload allocation L = {l1−35, l36−44, l45−54}.
VII. EXPERIMENTAL EVALUATION
We conduct experimental evaluations on Hikey 970 mobile
development platform [4] for five CNN models as specified in
Table I. Figure 13 shows a photo of the board in use. The board
features ARM big.LITTLE octa-core CPU with four-core A73
and four-core A53 cluster running at the maximum frequency
of 2.4 GHz and 1.8 GHz, respectively. It is connected to a
normal desktop monitor through HDMI cable for display. It
comes equipped with an inbuilt WiFi module through which
it can connect with a host machine over Secure SHell (SSH).
Standard DC 5 V USB fan is used in experiments to eliminate
unstable thermal effects.
We classify a continuous stream of 50 images and report
average throughput (images processed per second) for each
data point. The board is left idle for cooling down after each
run resulting in approximately a 10 sec run-time for each point.
An exhaustive search on average size CNN with five million
points would take hundred of days to run. Therefore, the
run-time eliminates the possibility of obtaining the optimal
configuration using an exhaustive search.
Recall that Kernel-level split on all eight heterogeneous
cores performs worse than four homogeneous Big cores.
Therefore, our baseline configuration is Kernel-level split on
four homogeneous Big cores. This baseline provides the best
possible throughput with default ARM-CL (Table IV).
A. Resultant Configurations
Table V shows the outcome of our DSE in the form
of pipeline stages P and layer allocation L. We simplify
notation for easier representation. For example, the pipeline
configuration B4-s2-s2 for ResNet50 implies three pipeline
stages consisting of four Big cores, two Small cores, and two
Small cores. Pipe-it allocates Layers 1–35, 36–44, and 45–
54 to the first, second, and third stage, respectively. Table IV
shows the throughput of the respective pipelines.
11
TABLE IV: CNN throughput comparison of homogeneous vs.
Pipe-it heterogeneous execution with pipelined predicted from
actual measured and predicted layer execution time.
CNN
Homogeneous
Throughput (Imgs/s)
Pipe-it – Heteogeneous
Throughput (Imgs/s) PercentageBenefit
(%)Big
Cluster
Small
Cluster
with
measured
layer time
with
predicted
layer time
AlexNet 8.1 1.5 8.9 8.9 9.8
GoogLeNet 7.8 3.3 11.8 11.3 45.5
MobileNet 17.4 6.6 24.0 23.5 35.5
ResNet50 3.1 1.5 5.5 5.2 67.5
SqueezeNet 15.6 6.9 21.4 21.4 37.5
Average 39.2%
TABLE V: Best throughput pipeline configuration with Pipe-
it and respective layer allocations from layer performance-
prediction model.
CNN Pipeline Config. Layer allocation
AlexNet B4 - s4 [1,9] - [10,11]
GoogLeNet B4 - s2 - s1 - s1 [1,29] - [30,41] - [42,45] - [46,58]
MobileNet B2 - B2 - s3 - s1 [1,11] - [12,21] - [22,26] - [27,28]
ResNet50 B4 - s2 - s2 [1,35] - [36,44] - [45,54]
SqueezeNet B4 - s4 [1,16] - [17,26]
TABLE VI: Best throughput pipeline configuration with Pipe-
it and respective layer allocations from actual measured layer
timings.
CNN Pipeline Config. Layer Allocation
AlexNet B4 - s4 [1,9] - [10,11]
GoogLeNet B4 - s2 - s1 - s1 [1,25] - [26,39] - [40,44] - [45,58]
MobileNet B2 - B2 - s3 - s1 [1,11] - [12,19] - [20,26] - [27,28]
ResNet50 B2 - B2 - s3 - s1 [1,16] - [17,34] - [35,47] - [48,54]
SqueezeNet B4 - s4 [1,19] - [20,26]
In general, throughput benefit of Pipe-it comes from a
deep and yet balanced pipeline configuration. Pipe-it can
create a better-balanced pipeline with a large number of major
layers in a network. Nevertheless, we still observe 20.6%
benefit even for small networks like LeNet by using a three-
stage pipeline designed by Pipe-it, compared to the default
execution with 4 cores in the big cluster. Pipe-it on average
improves throughput by 39% over baseline. Throughput ob-
tained through pipelined configuration approaches or surpasses
combined throughput of individual clusters for all CNNs.
B. Layer Performance-Prediction Model
We use micro-benchmarks to create our layer performance-
prediction model. Predicted layer execution time guide the
search for optimal configuration. Table III shows the model
has good accuracy with on average overall prediction error of
13.2% and 11.4% for Big and Small clusters, respectively.
Table VI shows Pipe-it pipeline configurations with actual
measure layer timings instead of predicted timings. The con-
figurations in Table V and Table VI are the same in most
cases. There is a mere 4% difference in performance in the
worst-case. Results establish the efficacy of our model.
C. General Applicability
Pipe-it is applicable across different heterogeneous multi-
cores that have at least two clusters. We run Pipe-it, on the
TABLE VII: Benefit of Pipe-it on a non-standard configuration
with three Big cores and two Small cores.
CNN Throughput (Imgs/s) Config. Pct.Benefit (%)3 Big 2 Small Pipe-it
AlexNet 5.7 0.7 5.8 B3-s2 1.5
GoogLeNet 6.2 0.7 7.4 B3-s2 19.3
MobileNet 14.2 3.7 15.3 B3-s2 7.1
ResNet50 2.8 1.0 3.7 B3-s1-s1 31.5
SqueezeNet 11.4 3.6 13.5 B3-s2 18.1
Average 15.5%
TABLE VIII: Average power (W) and power-efficiency
(Imgs/J) for execution on homogeneous cores and with Pipe-it.
CNN
Average Active
Power (W)
Power Efficiency
(Imgs/J)
Big Small Pipe-it Big Small Pipe-it
AlexNet 3.8 0.7 5.1 2.1 2.1 1.8
GoogLeNet 4.6 1.1 6.6 1.7 3.1 1.7
MobileNet 4.2 1.0 5.9 4.2 6.6 4.0
ResNet50 4.0 1.0 6.5 0.8 1.5 0.8
SqueezeNet 4.9 1.3 6.9 3.2 5.5 3.1
same Hikey 970 platform, but with one Big core and two
Small cores turned off to simulate an arbitrary big.Little CPU
configuration. The layer timing estimations obtained as before
are plugged into the design space exploration algorithm to
locate the best pipeline configuration with the remaining three
Big cores and two Small cores. Table VII shows the results
obtained. Pipe-it predicts pipeline configurations with both
clusters engaged. The performance benefit is not as significant
compared to results shown in Table IV with 4 Big and 4 Small
cores. This is because, in this CPU configuration, only 2 small
cores are additionally engaged in the pipeline. Less additional
resources result in lower performance improvement.
D. Power Efficiency
We are not able to obtain the individual power values of
each CPU component due to lack of power sensors in our
development board. We instead utilize a power measurement
module [5] that supplies and measures whole board power
consumption. Cluster not engaged in execution during homo-
geneous runs is turned off to eliminate its contribution to total
power consumption. Measured whole board socket power P
includes everything on board beside CPU. We mitigate the
effect of non-CPU components on total power by subtracting
off idle power PI . Idle power can vary with several factors.
Therefore, we measure it again before each run. The active
power readings reported are PA = P − PI . Table VIII shows
power measurements and corresponding power-efficiency.
We cannot separate active memory power from the CPU.
Therefore, power-efficient Small cluster shows lower than
expected power-efficiency for memory intensive CNNs like
AlexNet. We attribute the lower power-efficiency with Pipe-it
to extra memory power consumed due to coherency between
different core clusters.
E. Quantization Considerations
Pipe-it aims to improve throughput by engaging all on-chip
CPU resources for execution. It is orthogonal to optimiza-
12
v18.05-F32 v18.05-QASYMM8 v18.11-F32 v18.11-QASYMM8
0
20
40
60
L
ay
er
Pr
oc
es
si
ng
Ti
m
e
[m
s]
Convolutional DW-Convolutional Others
Pipe-it effective
L
ay
er
Pr
oc
es
si
ng
Ti
m
e
[m
s]
Fig. 14: Performance comparison of MobileNet with quanti-
zation across two ARM-CL versions (v18.05 and v18.11).
tion techniques such as quantization [32]. ARM-CL provides
support for execution with quantized 8-bit using asymmetric
integers (QASYMM8). However, the benefit of quantization
is largely dependent on the implementation. Overheads in-
duced by de-quantization and re-quantization operations sub-
due the benefits of quantization [28]. Figure 14 shows a
similar effect by comparing the execution of nonquantized F32
and QASYMM8 for MobileNet with ARM-CL. Execution of
convolutional layers is improved by 14%. However, overall
execution time remains unchanged for ARM-CL v18.05.
We also evaluate the effect of quantization on the latest
ARM-CL version 18.11. F32 implementation of MobileNet
executes 20% faster on ARM-CL v18.11 compared to ARM-CL
v18.05. Its convolutional layers are 24% faster with quantiza-
tion with an overall 19% faster execution.
The performance we report above is with homogeneous
cores only. We create pipelines for both original and quantized
MobileNet using Pipe-it across ARM-CL versions. Figure 14
shows the effective per-frame latency (inverse of throughput).
Pipe-it introduces further performance improvement in all im-
plementations. MobileNet reaches a throughput of 31 Img/sec
with Pipe-it for its quantized ARM-CL v18.11 implementation.
F. Comparison with Other Frameworks
We compare the performance of Pipe-it against other CNN
frameworks using MobileNet as the common denominator.
Figure 15 shows the performance of several frameworks. We
measure the Performance of TVM, NCNN, Pipe-it, and Pipe-
it** with actual experiments on our platform. Performance
numbers for the remaining frameworks are taken from other
sources [3], [13]. The borrowed numbers are scaled approx-
imately to compensate for differences in platforms. Pipe-it
provides the highest performance amongst all cores.
We also compare the energy-efficiency of Pipe-it against
DeepX [17]. DeepX is designed to consume the least power
within a latency requirement. Authors of [17] evaluate DeepX
on Qualcomm Snapdragon 800 SoC with Krait four-core
2.3 GHz CPU. DeepX provides a configuration which con-
sumes 444 mJ of energy for AlexNet with the latency re-
quirement of 500 ms (2 Img/s) resulting in energy-efficiency
of 2.2 Img/J. Pipe-it achieves comparable energy-efficiency of
1.8 Img/J but with a much higher throughput of 8.9 Img/s.
Ca
ffe
-an
dro
id-
lib
*
mi
ni-
Ca
ffe
*
TF
-li
te* TV
M
NC
NN
Pi
pe
-it
Pi
pe
-it
**
0
10
20
30
E
ff
ec
tiv
e
T
hr
ou
gh
pu
t
[I
m
g/
s]
* scaled performance with AI-benchmark [13].
** Pipe-it with ARM-CL v18.11 and quantization as shown in Figure 14.
Fig. 15: Performance comparison of MobileNet with several
frameworks.
VIII. RELATED WORK
The development of CNNs is moving towards more com-
plex network structures with moderate resource requirements.
Starting with 250MB for AlexNet [16] in 2012, the size
of models has reduced to less than 0.5MB for SqueezeNet
[13] in 2016 without losing accuracy. Such advancements
allow for CNN deployment on mobile platforms even with
their limited computational and memory resources. To effec-
tively deploy CNN on embedded platforms, researchers are
approaching from different angles. The network structure is
modified to fit on the resource-constrained mobile platform,
such as quantization [32] that accelerates the computation and
reduces the memory usage, and network pruning [33] that
compromise the accuracy with fewer resource requirements.
In addition, sparsity is exploited in NN applications [25] to
reduce the computation and improve execution performance
on edge devices.
Accelerators enable highly energy-efficient execution of
CNNs on edge devices. Several works rely on the com-
putational capability of embedded GPUs to enable CNN
with collaborative execution on CPUs and other processors.
DeepX [17] framework enables NN on edge through co-
execution on multiple processors, including GPU and low
power processors (LPU). It first engages runtime layer com-
pression to control the resource requirement of an NN work-
load. It then decomposites the workload into unit-blocks for
assignment to multiple processors. DeepX derives substantial
benefit in performance and energy for AlexNet mainly from its
fully-connected layers. Use of fully-connected layers is now
minimal in state-of-the-art CNNs. DeepSense [11] and Deep-
Mon [10] present an OpenCL based framework for mobile
GPUs. DeepSense adopts GPU memory management tech-
niques which accelerate compute-heavy executions including
convolutional and fully-connected layers execution on GPU.
DeepMon extends DeepSense to include further caching op-
timizations and improves convolutional layer implementation.
ASICs are now being designed specifically for neural network
processing such as Google’s Tensor Processing Unit (TPU)
and Huawei’s Neural Processing Units (NPU). Researchers
also co-design algorithm and architecture with application-
specific characteristics [34], [35].
13
Researchers have characterized resource requirements of
CNNs [18], [22] that provide insights on designing CNN with
resource-constraints. Efficient libraries [1], [3], [6], [19] are
created to facilitate the implementation of deep learning on
edge devices. Frameworks, like CGOOD [14] are created to
facilitate deployment of CNN on edge devices by automati-
cally generating C and GPU (CUDA or OpenCL) code that
runs on respective platforms with hardware specifications and
optimization requirements.
On the other hand, older technology node or cost-sensitive
platforms that lack capable GPUs and accelerators still need
to execute CNNs via their CPUs. Graphi [30] presents a
framework that accelerates deep learning models through
layer-level parallelism within NN on many-cores. It leverages
on the inherent layer-level parallelism in network structure and
schedules independent layers for concurrent execution. Graphi
is beneficial for networks such as LSTM and GoogLeNet
that have high layer-level parallelism. In comparison, Pipe-
it looks at computational kernel-level parallelism. It applies to
general network structures and targets CNN acceleration on
heterogeneous multi-cores.
IX. CONCLUSION
On-chip inference using CNNs is now becoming common-
place on edge devices. We show in this work that Kernel-
level splitting across heterogeneous core types is detrimental to
throughput. Instead, Layer-level splitting that minimizes cross-
cluster coherency can be employed to improve inferencing
throughout. We introduce a layer-level splitting technique
called Pipe-it that efficiently uses entire heterogeneous multi-
cores to improve CNN inference throughput. We study the
design space involved and introduce a search algorithm to lo-
cate a high performing design point within it. Pipe-it improves
the throughput on average by 39% using all heterogeneous
cores in comparison to using only homogeneous cores. Pipe-
it is not limited to CNN applications and also applies to other
streaming applications that show similar behaviours. In future,
we plan to include more co-processors such as GPUs and
NPUs into the design space to further exploit the potential
of the embedded SoCs in enabling deep learning.
REFERENCES
[1] Compute Library: A Software Library for Computer Vision and Machine
Learning. https://developer.arm.com/technologies/compute-library.
[2] Gluon Model Zoo mxnet documentation. https://mxnet.incubator.
apache.org/api/python/gluon/model zoo.html.
[3] NCNN: A High-Performance Neural Network Inference Framework
Optimized for the Mobile Platform. https://github.com/Tencent/ncnn.
[4] Hi3670 V100 Application Processor Data Sheet. Technical report,
HiSilicon Technologies, 2018.
[5] Keysight Technologies B2900 Series Precision Source/Measure Unit
User’s Guide. Technical report, Keysight Technologies Japan, 2019.
[6] Tianqi Chen, Thierry Moreau, Ziheng Jiang, Haichen Shen, Eddie Q
Yan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind
Krishnamurthy. TVM: End-to-End Optimization Stack for Deep Learn-
ing. arXiv preprint arXiv:1802.04799, pages 1–15, 2018.
[7] Marvin Damschen, Frank Mueller, and Jo¨rg Henkel. Co-Scheduling
on Fused CPU-GPU Architectures With Shared Last Level Caches.
Transactions on Computer-Aided Design of Integrated Circuits and
Systems (TCAD), 37(11):2337–2347, 2018.
[8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep
Residual Learning for Image Recognition. In Conference on Computer
Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
[9] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko,
Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam.
Mobilenets: Efficient Convolutional Neural Networks for Mobile Vision
Applications. arXiv preprint arXiv:1704.04861, 2017.
[10] Loc N Huynh, Youngki Lee, and Rajesh Krishna Balan. DeepMon:
Mobile GPU-Based Deep Learning Framework for Continuous Vision
Applications. In International Conference on Mobile Systems, Applica-
tions, and Services (MobiSys), pages 82–95. ACM, 2017.
[11] Loc Nguyen Huynh, Rajesh Krishna Balan, and Youngki Lee.
DeepSense: A GPU-based Deep Convolutional Neural Network Frame-
work on Commodity Mobile Devices. In Workshop on Wearable Systems
and Applications (WearSys), pages 25–30. ACM, 2016.
[12] Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf,
William J Dally, and Kurt Keutzer. SqueezeNet: AlexNet-Level Ac-
curacy with 50x Fewer Parameters and <0.5 MB Model Size. arXiv
preprint arXiv:1602.07360, 2016.
[13] Andrey Ignatov, Radu Timofte, William Chou, Ke Wang, Max Wu,
Tim Hartley, and Luc Van Gool. AI Benchmark: Running Deep
Neural Networks on Android Smartphones. In European Conference
on Computer Vision (ECCV), pages 0–0, 2018.
[14] Duseok Kang, Euiseok Kim, Inpyo Bae, Bernhard Egger, and Soonhoi
Ha. C-GOOD: C-Code Generation Framework for Optimized On-
Device Deep Learning. In International Conference on Computer-Aided
Design (ICCAD), page 105. ACM, 2018.
[15] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet
Classification with Deep Convolutional Neural Networks. In Advances
in Neural Information Processing Systems (NIPS), pages 1097–1105,
2012.
[16] Rakesh Kumar, Keith I Farkas, Norman P Jouppi, Parthasarathy Ran-
ganathan, and Dean M Tullsen. Single-ISA Heterogeneous Multi-
Core Architectures: The Potential for Processor Power Reduction. In
International Symposium on Microarchitecture (ISCA), page 81. IEEE
Computer Society, 2003.
[17] Nicholas D Lane, Sourav Bhattacharya, Petko Georgiev, Claudio For-
livesi, Lei Jiao, Lorena Qendro, and Fahim Kawsar. DeepX: A Software
Accelerator for Low-power Deep Learning Inference on Mobile Devices.
In International Conference on Information Processing in Sensor Net-
works (IPSN), page 23. IEEE Press, 2016.
[18] Nicholas D Lane, Sourav Bhattacharya, Petko Georgiev, Claudio For-
livesi, and Fahim Kawsar. An Early Resource Characterization of Deep
Learning on Wearables, Smartphones and Internet-of-Things Devices. In
International Workshop on Internet of Things Towards Applications (IoT-
App), pages 7–12. ACM, 2015.
[19] Seyyed Salar Latifi Oskouei, Hossein Golestani, Matin Hashemi, and
Soheil Ghiasi. CNNdroid: GPU-Accelerated Execution of Trained Deep
Convolutional Neural Networks on Android. In International Conference
on Multimedia (MM), pages 1201–1205. ACM, 2016.
[20] Min Lin, Qiang Chen, and Shuicheng Yan. Network in Network. arXiv
preprint arXiv:1312.4400, 2013.
[21] Yizhi Liu, Yao Wang, Ruofei Yu, Mu Li, Vin Sharma, and Yida
Wang. Optimizing CNN Model Inference on CPUs. arXiv preprint
arXiv:1809.02697, 2018.
[22] Zongqing Lu, Swati Rallapalli, Kevin Chan, and Thomas La Porta.
Modeling the Resource Requirements of Convolutional Neural Networks
on Mobile Devices. In International Conference on Multimedia (MM),
pages 1663–1671. ACM, 2017.
[23] Thannirmalai Somu Muthukaruppan, Mihai Pricopi, Vanchinathan
Venkataramani, Tulika Mitra, and Sanjay Vishin. Hierarchical Power
Management for Asymmetric Multi-Core in Dark Silicon Era. In Design
Automation Conference (DAC), page 174. ACM, 2013.
[24] Alberto Pacheco, Pablo Cano, Ever Flores, Edgar Trujillo, and Pedro
Marquez. A Smart Classroom Based on Deep Learning and Osmotic
IoT Computing. In Congreso Internacional de Innovacio´n y Tendencias
en Ingenierı´a (CONIITI), pages 1–5. IEEE, 2018.
[25] Sanchari Sen, Shubham Jain, Swagath Venkataramani, and Anand
Raghunathan. Sparce: Sparsity aware general-purpose core extensions
to accelerate deep neural networks. IEEE Transactions on Computers,
68(6):912–925, 2018.
[26] Nikolai Smolyanskiy, Alexey Kamenev, Jeffrey Smith, and Stan Birch-
field. Toward Low-Flying Autonomous MAV Trail Navigation Using
Deep Neural Networks for Environmental Awareness. In International
Conference on Intelligent Robots and Systems (IROS), pages 4241–4247.
IEEE, 2017.
[27] Thannirmalai Somu Muthukaruppan, Anuj Pathania, and Tulika Mitra.
Price Theory Based Power Management for Heterogeneous Multi-Cores.
SIGPLAN Notices, 49(4):161–176, 2014.
14
[28] Dawei Sun, Shaoshan Liu, and Jean-Luc Gaudiot. Enabling Embedded
Inference Engine with ARM Compute Library: A Case Study. arXiv
preprint arXiv:1704.03751, 2017.
[29] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed,
Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew
Rabinovich. Going Deeper with Convolutions. In Conference on
Computer Vision and Pattern Recognition (CVPR), pages 1–9, 2015.
[30] Linpeng Tang, Yida Wang, Theodore L Willke, and Kai Li. Scheduling
Computation Graphs of Deep Learning Models on Manycore CPUs.
arXiv preprint arXiv:1807.09667, 2018.
[31] Carole-Jean Wu, David Brooks, Kevin Chen, Douglas Chen, Sy Choud-
hury, Marat Dukhan, Kim Hazelwood, Eldad Isaac, Yangqing Jia, Bill
Jia, et al. Machine Learning at Facebook: Understanding Inference at
the Edge. In International Symposium on High-Performance Computer
Architecture (HPCA), pages 331–344. IEEE, 2019.
[32] Jiaxiang Wu, Cong Leng, Yuhang Wang, Qinghao Hu, and Jian Cheng.
Quantized Convolutional Neural Networks for Mobile Devices. In
Conference on Computer Vision and Pattern Recognition (CVPR), pages
4820–4828, 2016.
[33] Tien-Ju Yang, Andrew Howard, Bo Chen, Xiao Zhang, Alec Go, Mark
Sandler, Vivienne Sze, and Hartwig Adam. NetAdapt: Platform-Aware
Neural Network Adaptation for Mobile Applications. In European
Conference on Computer Vision (ECCV), pages 285–300, 2018.
[34] Yuhao Zhu, Matthew Mattina, and Paul Whatmough. Mobile Machine
Learning Hardware at ARM: A Systems-on-Chip (SoC) Perspective.
arXiv preprint arXiv:1801.06274, 2018.
[35] Yuhao Zhu, Anand Samajdar, Matthew Mattina, and Paul Whatmough.
Euphrates: Algorithm-SoC Co-Design for Low-Power Mobile Con-
tinuous Vision. In International Symposium on Computer Architec-
ture (ISCA), pages 547–560. IEEE, 2018.
[36] Aleksandar Zlateski, Kisuk Lee, and H Sebastian Seung. ZNN – A Fast
and Scalable Algorithm for Training 3D Convolutional Networks on
Multi-core and Many-Core Shared Memory Machines. In International
Parallel and Distributed Processing Symposium (IPDPS), pages 801–
811. IEEE, 2016.
