INTRODUCTION AND BACKGROUND
Homogeneous general purpose processors provide exibility to implement a variety of applications and facilitate programmability. In contrast, heterogeneous system-on-chips (SoCs) that combine general purpose and specialized processors oer great potential to achieve higher eciency while maintaining programming exibility. In particular, domain-specic SoCs (DSSoC), a class of heterogeneous architectures, tailor the architecture and processing elements (PE) to a specic domain. Hence, they can provide superior energy-eciency compared to general purpose processors by exploiting the characteristics of target applications.
DSSoCs can fulll their potential only if they integrate the set of accelerators required by the target domain and utilize these resources eectively. Therefore, the rst step in the design of DSSoCs is analyzing the domain applications to identify the most commonly used kernels. This analysis is necessary to determine the set of hardware accelerators in the architecture. For example, a DSSoC that targets wireless communications domain will most likely have Fourier transform (FFT) accelerators. The next step is to design the DSSoC, including the PEs and network-on-chip that interconnects them. Then, a wide range of design-and run-time algorithms are employed to schedule the domain applications to the PEs in the DSSoC [7, 8] . Similarly, dynamic voltage and frequency scaling (DVFS) policies, such as Ondemand and thermal management techniques have been applied to eciently manage the power and temperature of SoCs [3] . However, existing approaches are typically evaluated in isolated environments and dierent in-house tools. Thus, there is a strong need for a unied simulation environment to enable design space exploration and dynamic resource management of domain applications.
StarPU framework provides the ability to perform run-time scheduling and execution management for directed acyclic graph This material is based on research sponsored by Air Force Research Laboratory (AFRL) and Defense Advanced Research Projects Agency (DARPA) under agreemnet number FA8650-18-2-7860. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusion contained herein are those of the authors and should not be interpreted as necessarily representing the ocial policies or endorsements, either expressed or implied, of Air Force Research Laboratory (AFRL) and Defence Advanced Research Projects Agency (DARPA) or the U.S. Government. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specic permission and/or a fee. Request permissions from permissions@acm.org. [2] . It is integrated into SimGrid [6] framework. However, SimGrid was developed in the context of providing fast simulation for distributed systems, and the authors acknowledge that it is not intended to scale down to simulating real-time multi-threaded systems or comparing kernel schedulers and policies [1] . A recent work [9] targets domain-specic programmability of heterogeneous architectures through intelligent compile-and run-time mapping of tasks across dierent PEs. In this study, the authors employ three dierent simulators. The proposed framework would benet this and other similar studies by providing a single integrated simulation framework.
In this work, we present an integrated open-source simulation framework capable of evaluating both scheduling and dynamic thermal-power management algorithms. It addresses rapid systemlevel power, performance, and temperature exploration of DSSoCs. Besides facilitating the design of new scheduling and dynamic thermalpower management (DTPM) algorithms, the proposed framework also features built-in DVFS governors deployed on commercial SoCs and analytical power, performance, and temperature models. Finally, the framework includes ve reference applications from wireless communication and radar processing domains. These applications are proled on commercial heterogeneous SoC platforms and provided as a benchmark suite integrated to our framework.
OVERVIEW
The goal of the proposed simulation framework is to enable rapid development of scheduling and power management algorithms while enabling extensive DSSoC design space exploration.
The organization of the framework, designed to accomplish these objectives, is shown in Figure 1 . The resource database contains the list of PEs along with expected latency of tasks in the application(s). The simulation is driven by the job generator which injects Figure 2 : Block diagram of WiFi transmitter application instances of an applications to the simulator following a given probability distribution. As an example, Figure 2 presents a block diagram for a WiFi transmitter (WiFi-TX) job that is composed of multiple tasks. The dependency among the tasks is represented using a DAG. Hence, the job generator produces the tasks for a transmitter job along with their dependencies.
The simulation framework invokes the scheduler at every scheduling decision epoch with the list of tasks ready for execution. Then, the simulation kernel simulates task execution on the corresponding PE using execution time proles obtained from our reference hardware implementations. Similarly, the framework employs analytical latency models to estimate interconnect delays on the SoC. After each scheduling decision, the simulation kernel updates the state of the simulation, which is used in subsequent decision epochs.
There are three built-in scheduling algorithms: 1) Minimum execution time (MET) scheduler [5] , 2) Earliest task rst (ETF) scheduler [4] , and 3) A table-based scheduler which can store any oine schedule, such as an assignment generated by an integer linear programming (ILP) solver, in the form of a look-up table.
In addition, the framework enables a plug-and-play interface to choose between dierent scheduling algorithms. Hence, developers can implement their own algorithms and integrate them easily.
In parallel, power and energy estimates of each schedule are calculated by using power models [3] . Using these models, the proposed framework aids the design space exploration of DTPM techniques. Similarly, the memory access and on-chip interconnect latency are modeled by the proposed framework. Finally, the framework generates plots and reports of schedule, performance, throughput, and energy consumption to aid users in analyzing the behaviour of various algorithms.
SCHEDULING CASE STUDY
The proposed simulation framework enables evaluation of various scheduling algorithms for real-world applications with dierent DSSoC congurations. To this end, we developed reference designs for WiFi transmitter and receiver (WiFi-RX), low-power singlecarrier, range detection, and pulse Doppler applications on two popular commercial heterogeneous SoC platforms, i.e., Xilinx Zynq ZCU-102 UltraScale MPSoC and Odroid-XU3. We proled the task execution times for each application executed on these platforms. As an example, the execution times for dierent tasks in WiFi-TX are shown in Table 1 . Figure 3 : Results from dierent schedulers with a workload consisting of WiFi-TX jobs In this study, the simulations run on an SoC conguration that mimics a typical heterogeneous SoC with a total of 14 general purpose cores and hardware accelerators. We schedule and execute the WiFi-TX task ow graph using the simulation framework and plot the average job execution time trend with respect to the job injection rate, as shown in Figure 3 . To understand the performance of scheduling algorithms, we analyze the average execution time at varying rates of job injection.
All schedulers perform similar at low job injection rates (less than 5 job/ms). However, as the job injection rates increases, the schedule from MET results in higher execution time since MET uses a naive representation of the system state for scheduling decisions, by only considering PEs with best execution times. On the other hand, ILP uses a static table based schedule which is optimal for one job instance. At low injection rates (less than 5 job/ms), ILP provides a comparable schedule as jobs do not interleave. However, as the injection rate increases, the ILP schedule is not optimal. The performance of ETF is superior in comparison to the others (see Figure 3 ) since ETF utilizes the information about the communication cost between tasks and the current status of all PEs while making the scheduling decision.
To validate the proposed framework, we also implemented a subset of the scheduling algorithms on the Xilinx Zynq FPGA and then, compared the results for the applications in the benchmark suite with hardware measurements. In summary, the experiment presented in Figure 3 demonstrates one of the many capabilities of the simulation environment. It allows the end user to evaluate workload scenarios exhaustively by sweeping the conguration space to determine the most suitable scheduling algorithm for a given SoC architecture.
