Collaborative Heterogeneous Computing on MPSoCs by Wang, Siqi
ar
X
iv
:1
90
7.
10
90
4v
1 
 [c
s.D
C]
  2
5 J
ul 
20
19
Collaborative Heterogeneous Computing on MPSoCs
Extended Abstract
∗
Siqi Wang
1 INTRODUCTION
With the emerging demand for computations on mobile devices,
heterogeneous multi-processor system-on-chips (MPSoCs) are en-
visioned to dominate the current and future mobile computing
landscape. HeterogeneousMPSoCs usually comprise of various pro-
cessing elements such as general-purpose cores (CPUs) with differ-
ent performance-power characteristics and application-specific ac-
celerators, examples ofwhich are graphics processing units (GPUs),
digital signal processors (DSPs), reconfigurable accelerators (FP-
GAs, etc.) and the recent neural acceleration engines (NPUs, etc.).
Such heterogeneity presented on the SoC enables delicate match-
ing of computational kernels to the processing elements that are
best suited to perform the computation, which leads to substantial
improvements in performance and energy-efficiency.
The heterogeneity can be broadly classified into performance
and functional heterogeneity, while commercial SoCs are trending
toward adopting both in the same chip. Performance heterogene-
ity consists of cores with the same functionality (instruction-set
architecture, ISA) but with different power-performance charac-
teristics, an example of which is the ARM big.LITTLE CPU archi-
tecture. The difference stems from distinct micro-architectural fea-
tures such as in-order core versus out-of-order core. The complex
cores provide better performance at the cost of higher power con-
sumptionwhile the simpler cores exhibit low-power behavior with
lower performance. Functional heterogeneity features cores with
very different functionality (different ISA) existing on the same die.
The heterogeneity takes advantage of certain execution pattern for
exceptional speed-up to meet the performance requirement under
the stringent power budget. Under carefully managed exploitation
ofmultiple forms of heterogeneity, heterogeneousMPSoCs present
great potential to sustain the performance and power requirements
for next generation mobile computing.
While architectural heterogeneity is promising, software devel-
opment efforts are required to fully benefit from this architectural
advancement [4]. This thesis (extended abstract) presents the soft-
ware development efforts toward efficient exploitation of hetero-
geneity through intricate mapping of computational kernels, col-
laborative execution of multiple processing elements and applica-
tion specific techniques. The goal is to embrace the heterogeneity
to unleash the full potential of the heterogeneous MPSoCs towards
high-performance energy-efficient mobile computing.
2 EXPLOITATION OF HETEROGENEITY
Functional heterogeneity presents application developers with a
diverse choice of processing elements on the same chip. They now
have the opportunity and the responsibility to take advantage of
∗Accepted to ACMSIGDA Ph.D. Forum atDesignAutomation Conference (DAC) 2019.
Siqi Wang is with the Department of Computer Science, School of Computing, Na-
tional University of Singapore, SG. E-mail: (wangsq@comp.nus.edu.sg)
the unique characteristics of different processing elements to im-
prove execution performance. However, the matching of compu-
tational kernels to processing elements is difficult as the perfor-
mance is a complex interplay among the exposed parallelism, the
compiler, and the processor architecture. Furthermore, the applica-
tion kernel needs to be implemented in different processor-specific
languages to measure the performance of each processing element.
If the performance of the applications on different processing ele-
ments aremade available at an early stage, the developers will then
be able to make an informed decision in selecting the most appro-
priate processing element and concentrate on further processor-
specific languages and optimizations. CGPredict [5] is proposed to
guide developers in the early design choice without tedious rede-
velopment efforts. It is an analytical framework that accurately es-
timates the performance of a computational kernel on an embed-
ded GPU architecture from unoptimized, single-threaded C code.
CGPredict takes a computational kernel in the form of single-
threaded C code and generates its execution trace through a Trace
Extraction phase. In order to emulate the behavior of GPU, aWarp
Formation phase is introduced to transform the single-threaded
trace into its multi-threaded equivalent. CGPredict then extracts
computation (compute instructions) and memory access informa-
tion. The compute cycle count is obtained by mapping compute in-
structions to GPU instructions in the Computation Analysis stage,
while the memory cycle count is obtained through memory access
information analysis with access patterns and cache behavior in
theMemory Behavior Analysis stage. The results from the two anal-
ysis stages complete the execution characteristics we need from
the kernel for performance prediction. Lastly, together with the
hardware architectural parameters obtained from micro- bench-
marking, a comprehensive Analytical Prediction Model is engaged
to predict the final execution performance using the computation
and memory execution characteristics.
CGPredict provides accurateGPUperformance estimations from
only C code with 9% error. It also provides insights regarding the
characteristics of the kernel and the GPU that influence perfor-
mance, such as coalescing of memory accesses and shared mem-
ory usage. These insights offer opportunities for the developers to
understand the intrinsic strengths and weaknesses of the architec-
ture in the context of a particular kernel that can facilitate further
code optimizations. Furthermore, CGPredict in conjunction with
an existing FPGA performance predictor from C code [6] achieves
our objective of making the perfect choice of processing elements
(CPU, GPU or FPGA) given a kernel.
3 CO-EXECUTION ON MOBILE PLATFORM
The ever-increasing processing requirements impose higher pres-
sure on mobile devices with limited processing capability. Execut-
ing an application on a single processing element may not sustain
the performance requirements, while other processing elements
that can potentially be used are not actively contributing. The con-
current co-execution of a single computational kernel on multi-
ple processing elements thus exhibits great potential in achieving
additional performance. The design space of co-execution is huge
with the exploitation of both performance and functional hetero-
geneity. In addition, the ability to vary clock frequencies enables
the compromise between the achievable performance and power
consumption which further extends the design space. We show
through exhaustive design space search [1] that by executing a
computational kernel simultaneously on all available processing
elements (big.LITTLE CPU cores, GPUs) together with suitable
voltage-frequency settings for all these cores, as high as 39% en-
ergy savings and 19% improvement in runtime are achieved com-
pared to the stand-alone executions. The improvement in runtime
allows developers to have more flexibility in tuning the various
voltage-frequency settings to achieve higher performancewith cer-
tain constraints.
On the other hand, the inherent characteristics of mobile sys-
tems demand stringent power and thermal requirements as com-
pared to server system; this is especially so because of the lack of
active cooling measures on mobile devices. Commercial heteroge-
neous MPSoCs usually implement operating system level thermal
management techniques such as processor frequency throttling to
prevent failure of the chip at high temperatures. Engaging multiple
processing elements concurrently may expedite the heating up of
the system, necessitating frequency throttling and hence degrada-
tion of performance. Therefore, the benefit of co-execution can be
compromised by the throttling of frequency due to thermal issues.
We proposeOPTiC [2] to anticipate such thermal impact on execu-
tion when engaging multiple processing elements for performance
optimization.
OPTiC presents a static partitioning strategy to split a compu-
tational kernel across CPU and GPU cores for concurrent execu-
tion, with the voltage-frequency settings of the cores carefully de-
termined considering the thermal effects. OPTiC builds on exten-
sive and comprehensive modeling of power and runtime, resource
contention and thermal behavior. The power and runtime of the
CPU and GPU cores at all frequencies are predicted through ana-
lytical modeling from one profile run at a sample frequency. The
thermal behavior is captured through a thermal throttling model
that predicts the occurrence of OS frequency throttling and the
resultant runtime under such thermal condition. From the indi-
vidual performances, the allocation of the workload and the co-
execution performance are predicted through a co-execution model
that considers the effect of thermal frequency throttling and re-
source contention. The framework then goes through all the possi-
ble frequency settings and predicts the performance to locate the
optimal configuration and workload allocation. While the perfor-
mance of an application is largely affected by thermal conditions,
OPTiC is able to predict the configuration that presents on aver-
age 14% runtime improvement over standalone execution. OPTiC
further demonstrates great temperature control with real-life ap-
plications. With the configuration predicted by OPTiC, the chip
exhibits a much cooler temperature as compared to the Linux fre-
quency governors.
4 TOWARD MACHINE LEARNING
Lastly, the rise of machine learning applications poses great chal-
lenges to mobile platforms. Deploying neural network inferencing
on mobile platforms require the exploitation of heterogeneity to
sustain the performance requirements given limited resources and
stringent power budgets. Although dedicated neural accelerators
(NPUs, etc.) show exceptional speed-ups for applications like con-
volutional neural network (CNN), the technique is highly platform
dependent and not applicable to general architectures without the
accelerator. Furthermore, CNN are more commonly used as build-
ing blocks to construct more complex systems. We envision in the
near future that multiple independent inference sub-tasks are ex-
pected to be performed concurrently. This requires all the available
processing elements to run the inference engines in parallel. There-
fore, it is important to develop general techniques that are applica-
ble to existing heterogeneous MPSoCs on mobile platforms.
Commercial CNN libraries usually only engage one of the pro-
cessing elements and are often ignorant to the co-execution of mul-
tiple processing elements. ARM Compute Library (ARM-CL) pro-
vides out-of-the-box support for parallel execution through multi-
threading for the CPU clusters. But the concurrent co-execution
of the big and LITTLE cluster with multi-threading is harmful for
performance due to cache coherence overheads. Thus, the kernel-
level splitting among processing elements fails to either reduce
the end-to-end latency or the throughput. We present an alter-
native framework pipe-it [3] that employs a pipelined design to
split the convolutional layers across processing elements (different
CPU clusters) to improve throughput for streaming inferencing.
Here, the two CPU core clusters are divided into multiple sub-core-
clusters as processing elements to construct the pipeline stages to
better match the resources and workload. Pipe-it includes an ana-
lytical performance model that predicts the performance of a con-
volutional layer on different configurations (core type, count) from
its network structure descriptions. The predicted performance is
then used as input into a design space exploration algorithm that
navigates the design space and locates the best fitting pipeline
configuration and respective layer allocation. Pipe-it with the pre-
dicted multi-stage pipeline achieves on average 39% throughput
gain compared with the execution on a single processing element.
REFERENCES
[1] A. Prakash, S. Wang, AE. Irimiea, and T. Mitra. 2015. Energy-efficient ex-
ecution of data-parallel applications on heterogeneous mobile platforms. In
2015 33rd IEEE International Conference on Computer Design (ICCD). 208–215.
https://doi.org/10.1109/ICCD.2015.7357105
[2] S. Wang, G. Ananthanarayanan, and T. Mitra. 2019. OPTiC: Optimizing Collabo-
rative CPU-GPU Computing on Mobile Devices With Thermal Constraints. IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems 38, 3
(March 2019), 393–406. https://doi.org/10.1109/TCAD.2018.2873210
[3] S.Wang, G. Ananthanarayanan, Y. Zeng, N. Goel, A. Pathania, and T. Mitra. 2019.
High-Throughput CNN Inference on Embedded ARM big. LITTLE Multi-Core
Processors. arXiv preprint arXiv:1903.05898 (2019).
[4] S. Wang, A. Prakash, and T. Mitra. 2018. Software Support for Heterogeneous
Computing. In 2018 IEEE Computer Society Annual Symposium on VLSI (ISVLSI).
756–762. https://doi.org/10.1109/ISVLSI.2018.00142
[5] S. Wang, G. Zhong, and T. Mitra. 2017. CGPredict: Embedded GPU Performance
Estimation from Single-Threaded Applications. ACM Trans. Embed. Comput. Syst.
16, 5s, Article 146 (Sept. 2017), 22 pages. https://doi.org/10.1145/3126546
[6] G. Zhong, A. Prakash, S. Wang, Y. Liang, T. Mitra, and S. Niar. 2017. Design
Space exploration of FPGA-based accelerators with multi-level parallelism. In De-
sign, Automation Test in Europe Conference Exhibition (DATE), 2017. 1141–1146.
https://doi.org/10.23919/DATE.2017.7927161
2
