The emergence of applications that demand to handle efficiently growing amounts of data has stimulated the development of new computing architectures with several Processing Units (PUs), such as CPUs core, graphics processing units (GPUs) and Intel Xeon Phi (MIC). Aiming to better exploit these architectures, recent works focus on proposing novel runtime environments that offer a variety of methods for scheduling tasks dynamically on different PUs. A main limitation of such proposals refers to the constrained system configurations, usually adopted to tune and test the proposals, since setting more complete and diversified evaluation environments is costly. In this context, we present D-STHARk, a GUI tool for evaluating Dynamic Scheduling of Tasks in Hybrid Simulated ARchitectures. D-STHARk provides a complete simulated execution environment that allows evaluating dynamic scheduling strategies on simulated applications and hybrid architectures. We evaluate our tool by simulating the dynamic scheduling strategies presented in [3], using the same architecture and application. D-STHARk was able to achieve the same conclusions originally reported by the authors. Moreover, we performed an experiment varying the number of coprocessors, which was not previously verified due to lack of real architectures, showing that we may reduce the energy consumption, while keeping the same performance.
Introduction
Currently, a new era is characterized by the massive data generation and the development of new technologies. The continuous data growth, associated with more sophisticated processing in different areas of knowledge, has been pushing for significant advances in computing architectures, reflecting in more efficient storage systems, as well as the use of different types of processing units (PUs), in the so-called hybrid systems. An example of these new architectures are computers with multiple processors (multicore architecture) and different coprocessors. Two of the most popular coprocessors are the graphics processing units (GPUs) [12] and Intel Xeon Phi (MIC) [13] . These coprocessors have emerged as alternative architectures as the golden standard of frequency scaling broke down in the first decade of the century. They consist of massively parallel architectures privileging ALU operations over I/O and control flow operations. The end result is that coprocessors may yield very high compute density if certain criteria are matched to the application and its data.
Therefore, it is becoming essential for applications from different domains to be able to explore all available PUs in a coordinated and efficient way, while taking full advantage of their processing capabilities. Several runtime environments have been proposed in the literature aiming to make the use of different PUs transparent to developers [7, 6] . Among the main methods provided by these environments, we highlight task schedulers, which are responsible for adequately distributing various tasks that compose an application to the available PUs. The task schedulers can be: (1) static [2] , on which the characteristics of the tasks are evaluated along with the capabilities of different PUs in a preprocessing stage, using global information of the application; and (2) dynamic [4] , on which this evaluation is done at runtime, considering a limited view of the execution of entire application and thus a more challenging scenario.
From many proposed dynamic schedulers in the literature, most of them are evaluated in restricted real system configurations, composed of reduced number of PUs (e.g., some CPU cores, one or two GPUs and/or one or two MICs) due to high costs associate with creating more complete evaluation environments. For example, the price of an Intel R Xeon Phi TM 7120P Henhexaconta-Core Socket PCI Express x16 Coprocessor, 1.24 GHz is almost $5,000. As a consequence, the conclusions achieved by evaluating the scalability of these proposals might be limited. Moreover, the performance of these strategies in hybrid architectures composed by many PUs is unknown. In this context, a simulating environment on which it is possible to configure different hybrid architectures, composed by an unlimited number of different PUs, in order to evaluate dynamic scheduling strategies, becomes an important contribution and it is, therefore, the focus of this work.
More specifically, in this paper, we present D-STHARk, a GUI tool for evaluating Dynamic Scheduling of Tasks in Hybrid Simulated ARchitectures. The goal of this tool is to provide a complete simulated execution environment that allows evaluating dynamic scheduling strategies on simulated applications. Furthermore, D-STHARk allows users to simulate hybrid architectures, varying the types and number of PUs (CPUs, GPUs and MICs) and to create tasks, from different applications, with different characteristics (i.e., relative performance, task dependencies, volume of data to be manipulated, among others.). Our tool provides the different dynamic scheduling strategies presented by [4, 3] . Moreover, it is possible to insert other new strategies through an API.
We describe D-STHARk through its three main parts. In Section 3.1 we detail the simulated execution environment and its components. In the Section 3.2 we explain how the previous described components interact with each other in a simulation process. Finally, in Section 3.3, we show the main elements of the API that allows users to implement, insert and evaluate new dynamic scheduling strategies. We validate our tool by simulating the execution of pathology image analysis application, used to investigate brain cancer morphology [3] . In this analysis, we adopted different dynamic scheduling strategies, presented in [4] , varying the architecture setup. We show that D-STHARk was able to present the same results found in the original paper. Moreover, we present some new analysis of the scheduling strategies using different architecture configurations, evincing the usefulness of D-STHARk in providing broader analyses.
Related Work
The efficient use of hybrid systems equipped with CPUs and accelerators is a challenging problem, requiring the implementation of application codes optimized for multiple processors and scheduling of work among heterogeneous devices. Recently, a number of compiler techniques [16] , domain-specific libraries [9] , and, mainly, runtime systems [15, 17, 5, 10, 16] have been proposed to reduce the programming effort involved in porting applications to these systems.
Execution on distributed CPU-GPU-MIC platforms has been the target of several projects [5, 10, 16, 14] . Ravi et al. [16] developed a compiler based translation of generalized reductions to CPU-GPU systems. In [10] , the authors proposed OmpSs, a parallel programming model for dataflow applications that allows parallelizing codes via compiler from user annotated code. Augonnet et. al. [5] developed StarPU, a runtime environment that expresses computations as a Directed Acyclic Graph (DAG). Similarly, XKaapi is another runtime environment that supports cooperative execution on CPU-GPU-MIC, machines using a multi-versioning scheme in which operations may have multiple implementations targeting different computing devices [14] .
Recent efforts on runtime environments have given particular attention to exploiting the so-called dynamic schedulers, which distribute on runtime tasks of a given application among the different PUs available. Basically, these schedulers perform a runtime evaluation of the characteristics of each task and PU. Then, based on that evaluation, they determine the most appropriate PU to execute each task. The challenge, in this case, is how to maximize parallelization opportunities based on only a local and limited knowledge of the task set that compose an application. Schedulers should prevent events that compromise the proper distribution of tasks, such as overload of a PU, choice of PUs not suitable to perform a given task, and even excessive data transfer among non-shared memories of distinct PUs. There are several proposals of dynamic schedulers in the literature [8, 18, 4] . A main limitation of such proposals, however, refers to the constrained system configurations usually adopted to tune and test the proposed environments. Most of them are evaluated considering only a particular architecture, since setting more complete and diversified evaluation environments is costly.
In this paper propose D-STHARk, a GUI tool for evaluating dynamic scheduling strategies simulating different hybrid architectures. D-STHARk allows users to simulate the execution of applications, contrasting the effects of different dynamic scheduling strategies on distinct architecture configurations. To the best of our knowledge, our approach is the first simulating environment focused on the evaluation of dynamic scheduling of tasks proposed in the literature.
D-STHARk
In this section, we present D-STHARk, the GUI tool proposed for evaluating Dynamic Scheduling of Tasks in Hybrid Simulated ARchitectures 1 . We divide the description of our tool into three parts. First, we present its main components. Then, we describe how these components work and interact with eah other to execute a simulation. Finally, we present the dynamic scheduler API that allows users to define distinct scheduling strategies.
D-STHARk Components
Basically, D-STHARk is composed of four main components: (1) environment configuration;
(2) task creation; (3) scheduling strategy definition; (4) experimental execution and evaluation. Using these components, users can adjust the simulation according to their goal, evaluating each particular scenario close to its real conditions. 1. Environment Configuration. D-STHARk allows users to create hybrid architecture using three types of Processing Units (PUs): CPU, GPU and MIC. Each one of these PUs may be instantiated many times. For example, we may create an environment with 6 CPUs, 2 GPUs and 2 MICs. Moreover, for each coprocessor (i.e., GPU and MIC) it is possible to define the bus bandwidth, which corresponds to the amount of data that can be transfered between the coprocessor and CPU in a given time unit. This configuration makes the simulations even closer to real heterogeneous scenarios. Further, users may store each configuration into a file.
2. Task Creation. D-STHARk represents each simulated application as a set of distinct tasks and dependencies among them. In turn, each task is defined through three main characteristics: (1) task type, (2) error rate and (3) workload size. There exists a high variability on the speedups achieved by the same PU as different operations are considered. Moreover, the relative performance among PUs varies according to the operation executed. Consequently, different PUs are more efficient for particular types of operations. Therefore, (1) Task Type represents the behavior that a group of similar tasks usually exhibits w.r.t. execution time in each type of PU. All tasks belong to the same type will have a specific execution time in each PU, informed by the user. However, in order to allow a simulation closer to real scenario, users can define the (2) Error Rate, an interval (i.e., minimum and maximum error value) that corresponds to how much the execution times of a task are expected to vary. After defining these two characteristics, the next step is to create the tasks properly. D-STHARk allows users to insert all the tasks that compose the main application to be simulated, where each task must belong to one of the aforementioned task types. According to the task type and the error rate, a random time is defined for each task for different PUs. Finally, the third characteristic required by D-STHARk is the (3) Wokload Size in MB. This characteristic is relevant, since many scheduling strategies consider the data size and the number of operations to be executed on each PU in order to estimate the communication cost. For simplicity, we assume that the transfer cost varies linearly with the data size. In order to identify each task created, users can assign a unique ID to each task on D-STHARk. Using these IDs, users may define dependencies among distinct tasks that compose an application, such as in real scenarios. For instance, a dependence of a task t1 to a task t2 means that t2 must be executed before t1. Again, users may store all these configurations into an output file.
3. Scheduling Strategy Definition. In this step, users select the dynamic scheduling strategies to be evaluated. Currently, D-STHARk has four strategies implemented: (1) FCFS (First Come First Served), a common strategy based on a global queue; (2) HEFT (Heterogeneous Earliest Finish Time); (3) HEFT-DA (Heterogeneous Earliest Finish Time Data-Aware); and (4) SEQ (Sequencial), a straightforward serialization of all tasks to be executed. While the first three strategies were extensively evaluated in [3] , the last one serves as a baseline to contrast the performance of different scheduling strategies. Moreover, D-STHARk allows users to load and evaluate their own schedulers, as further explained in Section 3.3. Finally, we highlight that users can select different schedulers to compose a single simulation. In this case, D-STHARk executes individually each selected scheduler, allowing users to compare the performance achieved by each one.
Experimental Execution and Evaluation.
In this step, users can review and confirm all setup configurations. After confirming, D-STHARk starts performing the simulation. The proposed tool also allows defining how many times each simulation will be executed. The results are reported as the average of these executions and the standard deviation is also presented. During the simulation execution, D-STHARk exhibits a log visualization, detailing the simulated application and showing which task is being executed by each PU at each moment. This log visualization can be used to debug the schedulers, verifying if they are working as expected. At the end of the simulation, D-STHARk presents detailed results for each simulated scheduler, which may be exported to an output file. More specifically, it shows (1) Speedup achieved; (2) Histogram with the task distribution among PUs, assigned by each scheduler; and (3) Graphics showing the percentage of processing performed by each PU. 
Execution Process of Simulations
Once defined or loaded all required configurations, D-STHARk is ready to execute the simulation. For this purpose, D-STHARk creates a distinct thread 2 , named Worker Thread, to simulate each PU. Moreover, a main thread is instantiated and it is responsible for reading the task configurations (GetTask routine) and submit the task for execution (SubmitTask routine). Finally, the main thread creates a queue for each Worker Thread that will be managed according to the dynamic scheduling strategy defined in the configuration process.
Meanwhile, the task management is executed concurrently with the task execution, such as on real dynamic scheduling environments. After receiving a task to run, each Worker Thread verifies the task dependencies. If the task has dependencies, it is inserted in the Stuck Task List, which stores all tasks that do not have all dependencies solved. Otherwise, it is sent to the method PushTask that inserts the task into one of the tasks queues, according to the scheduler policy definition.
When a specific Worker Thread becomes idle, it fetches a new task to be executed using the method PopTask, defined according to each dynamic scheduling strategy. As aforementioned, each task has a specific execution time defined for each PU. Thus, the Worker Thread calls a sleep method during the execution time corresponding to the PU that it is simulating. In addition, another sleep call is made when the execution is simulating a coprocessor. This second sleep call simulates the data transferring time from the CPU memory to the coprocessor memory. As the user has indicated previously the bus bandwidth, as well as the workload size, the Worker Thread can calculate how long to sleep for simulating this communication cost. For instance, the PCI-EXPRESS 16x bus features a 4GB transfer speed per second. Hence, for each MegaByte of the task data, we perform a sleep of 0.000244141s, getting closer to real conditions of execution. Finalizing the execution process, the Worker Thread verifies the Stuck Task List and removes all dependencies associated with the task just executed. Then, the main thread starts to fetch tasks from Stuck Task List, since all of them, at this point, will have all their dependencies solved. By this way, we guarantee that tasks will be executed in the correct order. Figure 1 illustrates the above described process.
Dynamic Scheduler API
As aforementioned, D-STHARk allows users to include their own scheduling strategies. For this purpose, our tool offers an API specifying a guideline to be followed in order to make these new strategies compatible to the system. Basically, all new schedulers must be implemented in C language. These implementations may use different routines and file names, which must be loaded into D-STHARk through the GUI available. However, the source code must have a file named scheduler.c, with four mandatory routines:
1. InitializeScheduler: it is responsible to initiate the scheduler's components. This routine also creates the task queues, associating them with the defined PUs.
PushTask:
defines which strategy will be used to schedule task among PU queues.
PopTask:
this routine is called by each Worker Thread, whenever it is idle, and defines the next task to be executed in the thread.
DestroyScheduler:
it is responsible to finalize the scheduler. In this routine the final procedures must deallocate the memory used by the scheduler.
Experimental Evaluation
In this section, we present the experiments performed to evaluate D-STHARk. These experiments consist of simulating the same scenario originally presented in [3] , on which the authors evaluated different dynamic scheduling strategies for a real application [11] . Our evaluations contrast the simulated results against the original ones, considering both execution times and distribution of tasks among PUs. Moreover, we extended these analyses by simulating the use of more coprocessors than originally reported due to lack of real architectures.
Simulation Setup

Simulated Application
In our experiments, we simulated an application related to studies of brain cancer [11] , which intends to find better tumor classifications using high resolution Whole Tissue Slide Images (WSIs). This application partitions each WSI into multiple image tiles that can be independently analyzed. There are many phases of this image analysis, but the time demanding phases are segmentation and features computation ones. Hence, these two phases have been the focus of execution optimizations in large-scale hybrid machines [3] . Each image tile is submitted to different operations, forming a dataflow graph, such as shown in Figure 2 ).
Figure 2: Application Dataflow
In D-STHARk, we instantiated a distinct task type for each operation, represented as colored rectangles in the dataflow. The relative performance (i.e. specific times) for each PU on each task type was defined based on the values originally reported in [19] . In order to fit the original experiments, we considered the error rate as zero. Observe that processing several image tiles is an embarrassingly parallel problem, since each image tile determines a different dataflow. Thus, we instantiated an individual task for each different operation applied on each different image tile. Moreover, we inserted the dependencies among tasks, such as exhibited in the dataflow, to simulate the correct execution order for each pair . Aiming to depict the above process, we consider only the operations Morph.Open and Recon.Nuclei applied to two image tiles. First, we create two types of task, T ype1 (i.e., Morph.Open) and T ype2 (i.e., Recon.Nuclei ). Then, D-STHARk automatically determines the execution times in the PUs for each task type, based on the configurations previously defined by the user. Later, we instantiate the tasks related to each image tile: (a) t1 of T ype1 for the first tile; (b) t2 of T ype2 for the first tile; (c) t3 of T ype1 for the second tile; and (d) t4 of T ype2 for the second tile. Finally, we add the dependencies from t2 to t1 and from t4 to t3, ensuring the correct execution order.
We experimentally provide simulations with 800 image tiles, which generates 10,400 finegrained tasks for execution. Moreover, based on descriptions in [3] , the authors reported that each tile presents a resolution of 4Kx4K pixels. Considering a 256 color representation, each image tile has a size of 15.25MB. Consequently, we defined 15.25MB as the data workload that each instantiated task needs to handle.
Simulated Schedulers
As the above discussed application characteristics induce quite similar behaviors on two strategies (i.e., HEFT-DA and HEFT), we restricted our analyses to two strategies evaluated in [3] :
• FCFS: it is a simple scheduling strategy based on a global queue. When a Worker Thread is idle, it fetches the first task on the queue head.
• HEFT: it maintains a task queue for each PU. The task distribution across the PUs is defined according to their processing capabilities. More specifically, the scheduler maintains a history of the execution times and thereby assigns a task to a specific PU that minimizes Equation 1 .
P i is the PU being evaluated; Avail(P i) represents the amount of time that P i takes to process all tasks assigned to it; and Est P i denotes the estimated time that P i takes to run a specific task T .
Simulated Architectures
Basically, our analyses consider three distinct hybrid architectures, originally evaluated in [3] :
(1) CPU-GPU -15 CPU cores and 1 GPU; (2) CPU-MIC -15 CPU cores and 1 MIC; and (3) CPU-GPU-MIC -14 CPU cores , 1 GPU, and 1 MIC. Thus, we evaluate the previously mentioned scheduling strategies on each of these architectures. Further, we consider other configurations of PUs in order to evaluate scenarios with more coprocessors than originally reported by [3] , evincing the usefulness of D-STHARk in providing broader analyses. We set the bus bandwidth as 4GB, like one PCI-EXPRESS 16x used in original experiments.
Analysis of Results
We evaluated D-STHARk considering three issues. First, we analyzed how similar are the distribution of tasks, defined by each scheduler strategy on the simulated architecture, with the distributions observed in a real architecture. Second, we investigated how close are the execution times of our simulations to those times measured in a real architecture. For these two issues, we consider the CPU-GPU-MIC architecture, such as evaluated in [3] . We highlight that, through these two issues, we intend to evaluate whether these simulated results are close to those achieved using actual architectures, demonstrating the effectiveness of D-STHARk. Finally, the third issue concerns about extending the original evaluation to other types of architectures, in order to achieve broader conclusions.
Distribution of Tasks
In [3] , the authors evaluated the impact of the number of distinct image tiles concurrently processed by each Worker Thread. They concluded that the overall performance may be improved using high levels of concurrence (i.e., 55 image tiles). In our analyses, the simulated schedulers worked without limitations on this number as well, presenting some of the results found by the authors. Figure 3 presents D-STHARk's results. First, we observe that FCFS scheduling to PUs is nearly the same for all operations. Hence, we found FCFS is not able to take full advantage on the performance variability among operations. On the other hand, we may notice that HEFT has prioritized the use of CPU cores for most tasks, regardless of the performance/speedup of the other PUs. As presented in [19] , some operations (e.g., PreWa-terShad, ReconNuclei and RBC ), with high computational costs, may decrease the execution time using coprocessors. Indeed, HEFT was able to better assign tasks to processing unit that minimize its execution time, exhibiting the smallest execution times. Further explanations of these decisions and their consequences are better discussed in [3] . 
Execution Times
The execution times of the schedulers are presented in Figure 4 (a) and (b), corresponding to simulated and real results, respectively. Once more, the results achieved using D-STHARk are quite similar to those achieved using a real architecture, on which FCFS attains the worst performance among all configurations. As reported in [3] , it is also important to highlight that the use of GPU always leads to good performance. For instance, the best CPU-GPU execution time is about 1.26× faster than the best CPU-MIC execution time. This result motivates us to perform the next described experiments in order to demonstrate the wider applicability of our tool. 
Increasing Coprocessors
Despite some outcomes achieved by the authors on [3] indicating that, for the evaluated application, increasing the number of processors could improve the system performance, it was not verified due to the lack of real architectures. In this section, we provide this analysis by configuring different architectures in D-STHARk. Specifically, using just the dynamic scheduling strategy HEFT, we decrease the number of CPU cores and increase the number of coprocessors coordinately. Our goal is to show that a small increasing on the number of coprocessors may produce the same performance using less CPU cores. Table 1 presents the execution times reported by D-STHARk on different configurations, on which the first one is related to the CPU-GPU-MIC architecture. As we can note, by increasing just one coprocessor MIC it is possible to remove 2 CPU cores. Similarly, by in- Table 1 : Simulating different architectures, varying the number of coprocessors. creasing one GPU it is possible to remove 4 CPU cores, both without degrading the system performance. These results are consistent with those expected in [3] , showing that we may reduce the energy consumption (reducing CPU cores and increasing coprocessors), while keeping the same performance [1] .
Conclusions and Future Work
In this paper, we present D-STHARk, a GUI tool for evaluating Dynamic Scheduling of Tasks in Hybrid Simulated ARchitectures. By this tool, it is possible to evaluate new proposals of dynamic scheduling strategies, simulating applications with distinct characteristics (task dependences, manipulating data etc. ) in different hybrid architectures (CPUs, GPUs and MICs). We evaluated our tool by simulating some of the dynamic scheduling strategies presented in [3] , adopting the same application related to studies of brain cancer [11] and an architecture with 14 CPU cores, 1 MIC and 1 GPU. The results and conclusions achieved with D-STHARk were the same as originally reported, showing the effectiveness of our proposal. Moreover, we performed an experiment in which we varied the number of coprocessors (MIC and GPU), which was not previously verified due to lack of a real architecture, showing that we may reduce the energy consumption, while keeping the same performance [1] . As future work, we will extend D-STHARk to simulate inner buses, in order to consider contentions related to increasing the number of processing units. Moreover, we intend to include new scheduling strategies and allow the insertion of more complex tasks, with finer-grained details.
