Work-in-Progress: A Simulation Framework for Domain-Specific
  System-on-Chips by Arda, Samet E. et al.
Work-in-Progress: A Simulation Framework for
Domain-Specific System-on-Chips
Samet E. Arda1, Anish NK1, A. Alper Goksoy1, Joshua Mack2, Nirmal Kumbhare2,
Anderson L. Sartor3, Ali Akoglu2, Radu Marculescu3 and Umit Y. Ogras1
1School of Electrical Computer and Energy Engineering, Arizona State University, Tempe, AZ, USA
2Electrical and Computer Engineering, The University of Arizona, Tucson, AZ, USA
3Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA, USA
1 INTRODUCTION AND BACKGROUND
Homogeneous general purpose processors provide exibility to
implement a variety of applications and facilitate programmability.
In contrast, heterogeneous system-on-chips (SoCs) that combine
general purpose and specialized processors oer great potential
to achieve higher eciency while maintaining programming exi-
bility. In particular, domain-specic SoCs (DSSoC), a class of het-
erogeneous architectures, tailor the architecture and processing
elements (PE) to a specic domain. Hence, they can provide supe-
rior energy-eciency compared to general purpose processors by
exploiting the characteristics of target applications.
DSSoCs can fulll their potential only if they integrate the set
of accelerators required by the target domain and utilize these re-
sources eectively. Therefore, the rst step in the design of DSSoCs
is analyzing the domain applications to identify the most com-
monly used kernels. This analysis is necessary to determine the
set of hardware accelerators in the architecture. For example, a
DSSoC that targets wireless communications domain will most
likely have Fourier transform (FFT) accelerators. The next step is
to design the DSSoC, including the PEs and network-on-chip that
interconnects them. Then, a wide range of design- and run-time
algorithms are employed to schedule the domain applications to the
PEs in the DSSoC [7, 8]. Similarly, dynamic voltage and frequency
scaling (DVFS) policies, such as Ondemand and thermal manage-
ment techniques have been applied to eciently manage the power
and temperature of SoCs [3]. However, existing approaches are
typically evaluated in isolated environments and dierent in-house
tools. Thus, there is a strong need for a unied simulation envi-
ronment to enable design space exploration and dynamic resource
management of domain applications.
StarPU framework provides the ability to perform run-time
scheduling and execution management for directed acyclic graph
This material is based on research sponsored by Air Force Research Laboratory (AFRL)
and Defense Advanced Research Projects Agency (DARPA) under agreemnet number
FA8650-18-2-7860. The U.S. Government is authorized to reproduce and distribute
reprints for Governmental purposes notwithstanding any copyright notation thereon.
The views and conclusion contained herein are those of the authors and should not be
interpreted as necessarily representing the ocial policies or endorsements, either
expressed or implied, of Air Force Research Laboratory (AFRL) and Defence Advanced
Research Projects Agency (DARPA) or the U.S. Government.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from permissions@acm.org.
CODES/ISSS ’19, October 13–18, 2019, New York, NY, USA
© 2019 Association for Computing Machinery.
ACM ISBN 978-1-4503-6923-7/19/10. . . $15.00
https://doi.org/10.1145/3349567.3351719
Figure 1: Simulation framework
(DAG) based programs on heterogeneous architectures [2]. It is
integrated into SimGrid [6] framework. However, SimGrid was de-
veloped in the context of providing fast simulation for distributed
systems, and the authors acknowledge that it is not intended to
scale down to simulating real-time multi-threaded systems or com-
paring kernel schedulers and policies [1]. A recent work [9] targets
domain-specic programmability of heterogeneous architectures
through intelligent compile- and run-time mapping of tasks across
dierent PEs. In this study, the authors employ three dierent simu-
lators. The proposed framework would benet this and other similar
studies by providing a single integrated simulation framework.
In this work, we present an integrated open-source simulation
framework capable of evaluating both scheduling and dynamic
thermal-power management algorithms. It addresses rapid system-
level power, performance, and temperature exploration of DSSoCs. Be-
sides facilitating the design of new scheduling and dynamic thermal-
power management (DTPM) algorithms, the proposed framework
also features built-in DVFS governors deployed on commercial
SoCs and analytical power, performance, and temperature models.
Finally, the framework includes ve reference applications from
wireless communication and radar processing domains. These ap-
plications are proled on commercial heterogeneous SoC platforms
and provided as a benchmark suite integrated to our framework.
2 OVERVIEW
The goal of the proposed simulation framework is to enable rapid
development of scheduling and power management algorithms
while enabling extensive DSSoC design space exploration.
The organization of the framework, designed to accomplish these
objectives, is shown in Figure 1. The resource database contains
ar
X
iv
:1
90
8.
03
66
4v
1 
 [c
s.A
R]
  1
0 A
ug
 20
19
Figure 2: Block diagram of WiFi transmitter application
the list of PEs along with expected latency of tasks in the applica-
tion(s). The simulation is driven by the job generator which injects
instances of an applications to the simulator following a given
probability distribution. As an example, Figure 2 presents a block
diagram for a WiFi transmitter (WiFi-TX) job that is composed of
multiple tasks. The dependency among the tasks is represented
using a DAG. Hence, the job generator produces the tasks for a
transmitter job along with their dependencies.
The simulation framework invokes the scheduler at every sched-
uling decision epoch with the list of tasks ready for execution. Then,
the simulation kernel simulates task execution on the correspond-
ing PE using execution time proles obtained from our reference
hardware implementations. Similarly, the framework employs ana-
lytical latency models to estimate interconnect delays on the SoC.
After each scheduling decision, the simulation kernel updates the
state of the simulation, which is used in subsequent decision epochs.
There are three built-in scheduling algorithms: 1) Minimum
execution time (MET) scheduler [5], 2) Earliest task rst (ETF)
scheduler [4], and 3) A table-based scheduler which can store any
oine schedule, such as an assignment generated by an integer
linear programming (ILP) solver, in the form of a look-up table.
In addition, the framework enables a plug-and-play interface to
choose between dierent scheduling algorithms. Hence, developers
can implement their own algorithms and integrate them easily.
In parallel, power and energy estimates of each schedule are
calculated by using power models [3]. Using these models, the
proposed framework aids the design space exploration of DTPM
techniques. Similarly, the memory access and on-chip intercon-
nect latency are modeled by the proposed framework. Finally, the
framework generates plots and reports of schedule, performance,
throughput, and energy consumption to aid users in analyzing the
behaviour of various algorithms.
Table 1: Execution proles ofWiFi-TX on ArmA7/A15 cores
in Odroid-XU3 and hardware accelerators
Sample
App. Task
Latency (µs)
HW Acc. Odroid A7 Odroid A15
WiFi-TX
Scrambler Enc. 8 22 10
Interleaver 10 4
QPSK Modulation 15 8
Pilot Insertion 5 3
Inverse-FFT 16 296 118
CRC 5 3
Table 2: SoC conguration for scheduling case studies
Resource Type # of Instances
Cortex-A15 ARM big Architecture 4
Cortex-A7 ARM LITTLE Architecture 4
Scrambler-Encoder Hardware Accelerator 2
FFT Hardware Accelerator 4
Figure 3: Results from dierent schedulers with a workload
consisting of WiFi-TX jobs
3 SCHEDULING CASE STUDY
The proposed simulation framework enables evaluation of various
scheduling algorithms for real-world applications with dierent
DSSoC congurations. To this end, we developed reference designs
for WiFi transmitter and receiver (WiFi-RX), low-power single-
carrier, range detection, and pulse Doppler applications on two
popular commercial heterogeneous SoC platforms, i.e., Xilinx Zynq
ZCU-102 UltraScale MPSoC and Odroid-XU3. We proled the task
execution times for each application executed on these platforms.
As an example, the execution times for dierent tasks in WiFi-
TX are shown in Table 1. The SoC conguration chosen for the
scheduling case studies are shown in Table 2.
In this study, the simulations run on an SoC conguration that
mimics a typical heterogeneous SoC with a total of 14 general
purpose cores and hardware accelerators. We schedule and execute
the WiFi-TX task ow graph using the simulation framework and
plot the average job execution time trend with respect to the job
injection rate, as shown in Figure 3. To understand the performance
of scheduling algorithms, we analyze the average execution time
at varying rates of job injection.
All schedulers perform similar at low job injection rates (less
than 5 job/ms). However, as the job injection rates increases, the
schedule from MET results in higher execution time since MET uses
a naive representation of the system state for scheduling decisions,
by only considering PEs with best execution times. On the other
hand, ILP uses a static table based schedule which is optimal for
one job instance. At low injection rates (less than 5 job/ms), ILP
provides a comparable schedule as jobs do not interleave. However,
as the injection rate increases, the ILP schedule is not optimal. The
performance of ETF is superior in comparison to the others (see
Figure 3) since ETF utilizes the information about the communi-
cation cost between tasks and the current status of all PEs while
making the scheduling decision.
To validate the proposed framework, we also implemented a
subset of the scheduling algorithms on the Xilinx Zynq FPGA and
then, compared the results for the applications in the benchmark
suite with hardware measurements. In summary, the experiment
presented in Figure 3 demonstrates one of the many capabilities
of the simulation environment. It allows the end user to evaluate
workload scenarios exhaustively by sweeping the conguration
space to determine the most suitable scheduling algorithm for a
given SoC architecture.
REFERENCES
[1] SimGrid 3.21 Documentation. http://simgrid.gforge.inria.fr/simgrid/3.21/doc/
intro_concepts.html#simgrid-limits Accessed 2 Apr. 2019.
[2] C. Augonnet et al. StarPU: A Unied Platform for Task Scheduling on Hetero-
geneous Multicore Architectures. Concurrency and Computation: Practice and
Experience, 23(2):187–198, 2011.
[3] G. Bhat et al. Algorithmic Optimization of Thermal and Power Management for
Heterogeneous Mobile Platforms. IEEE Trans. Very Large Scale Integr. (VLSI) Syst.,
26(3):544–557, 2018.
[4] J. Blythe et al. Task Scheduling Strategies for Workow-based Applications in
Grids. In Proc. of the IEEE Int. Symp. on Cluster Computing and the Grid, volume 2,
pages 759–767, 2005.
[5] T. D. Braun et al. A Comparison of Eleven Static Heuristics for Mapping a Class of
Independent Tasks onto Heterogeneous Distributed Computing Systems. Journal
of Parallel and Distributed computing, 61(6):810–837, Jun 2001.
[6] H. Casanova et al. Simgrid: A Sustained Eort for the Versatile Simulation of
Large Scale Distributed Systems. arXiv preprint arXiv:1309.1630, 2013.
[7] E. L. de Souza Carvalho et al. Dynamic Task Mapping for MPSoCs. IEEE Design &
Test of Computers, 27(5):26–35, 2010.
[8] L. T. Smit et al. Run-time Mapping of Applications to a Heterogeneous SoC. In
Int. Symp. on System-on-Chip, pages 78–81. IEEE, 2005.
[9] Y. Xiao et al. Self-Optimizing and Self-Programming Computing Systems: A
Combined Compiler, Complex Networks, and Machine Learning Approach. IEEE
Trans. Very Large Scale Integr. (VLSI) Syst., pages 1–12, 2019.
