Introduction
One of the basic problems in parallel computing is how to execute a parallel program on a collection of heterogeneous processors, that is, processors of different and possibly changing speeds. In this paper we simulate scheduling algorithms that are designed to run efficiently in heterogeneous parallel computing environments.
We model and simulate the execution of parallel jobs represented by directed acyclic graphs (DAG). Each job is a multi-threaded parallel program. A thread is a chain of tasks ordered by their execution dependencies. Dependencies of the taskdthreads in a job are represented by a DAG. The nodes in a DAG correspond to tasks and directed edges represent precedence relationship among tasks.
We study two online schedulers - Applications of decentralized adaptive online scheduling algorithms include web-based computing, utilizing idle processors within organizations, scientific computing, military applications, etc. Related work is found in the area of asynchronous parallel computing [ 2 , 3 , 1, 81 as well as in scheduling theory [7] .
In our simulation, ECS runs efficiently even when processors have different and dynamically changing speeds. ECS is robust, it is distributed which is more scalable for fast scheduling than centralized and it is fault tolerant. The quality of schedules exhibited by ECS are almost as good as centrally managed ones.
Experimental Study
We study a scheduling problem in a networked system of heterogeneous processors. Each processor is described by its set of attributes: maximum speed, current speed, and steal-mug interval (the time between attempts by the processor to stedmug work from other processors). Our model also allows for changes in the speeds of processors.
Our study compares two types of schedulers on a network of heterogeneous processors, that is, processors of different (fixed or changing) speeds. The Enhanced Cilk Scheduler (ECS) (see Fig. l) , is based on a non-centrally managed randomized ENHANCED CILK SCHEDULER 1 . Processor i chooses a victim processor j uniformly at 2. If the victim j's double ended queue (deque) is not empty, it steals the thread T from the top ofsthe deque.
3.
If the victim j ' s deque is empty, but the victim is working on a thread T and it is slower than processor i, then i mugs j , that is, i interrupts j and takes the thread T.
4. If processor i has located a thread T, i works on T until one of four situations:
random.
(a) Thread T spawns 6 new threads. In this case, the processor puts Ton the bottom of the ready deque and starts work on the last spawned new thread.
(b) The thread T returns or terminates. If the . deque is not empty, the processor begins working on the bottom thread. If the deque is empty, the processor attempts to work steal.
(c) The thread reaches a synchronization point. In this case, the processor attempts to work steal.
(Note that the deque is empty.) model that employs steals and muggings. The Centrally Managed Scheduler, or Central Manager (CM) uses a simple greedy heuristic to assign subtasks to processors, but relies on tightly coupled centralized control.
The ECS attempts to complete tasks in the network as quickly as possible by using Work Stealing or Processor Mugging. Work stealing happens when an idle processor takes a ready task from a busy processor's queue 'and begins executing it. An idle processor PFast performs processor mugging when it encounters a slower processor Pslow executing a task T and the queue of PslOw is empty. In that case, PFast takes over the execution of task T and processor Pslow becomes idle and starts looking for work. In ECS, the "victim" processor is chosen at random, uniformly among all other processors in the system. Attempts to stedmug occur at regular intervals (specified by the stedmug interval), whose lengths are inversely proportional to the speed of the processor exercising the stealingimugging.
The CM scheduler applies a greedy strategy: It keeps a FIFO queue of the ready-to-process threads and assigns a thread to the currently fastest idle processor. When some processor Pi becomes idle, the CM scheduler possibly assigns a task to it: if the slowest active processor P, is slower than Pi, then the CM reassigns the work on the active processor to Pi, effectively "mugging" P, by central authority.
Both the ECS and CM use preemption: each task can be preempted, or interrupted and continued on some faster processor. (This means that we are assuming checkpointing or other support in order to enable essentially continuous preemption and restart.)
In both models, the experimental results reported here have the migration cost set to zero (i. e., there is no calculated delay in migrating a task, whether by stealing, mugging, or assignment by the central manager). Instead, we calculate these communication costs separately because different platforms have different communication costs.
Our simulation program is written in Simscript II.5. Processor features, system utilization, network topology, and characteristics of jobs (DAGS) are input for the simulation. Experiments were conducted on a Sun Ultra 30 with 512MB memory, running Solaris 2.6. Comparison of CM vs. ECS in several contexts is performed with promising results.
Experiment 1
In this experiment, the input DAG consists of a "fan out" (from a single node to 50 nodes) followed by a "fan in" (back to a single node). All tasks are uniform in this case: each of the nodes in the middle of the DAG correspond to tasks requiring 50000 work units. This experiment is meant to model the case in which the job is readily parallelized into equal-sized subtasks, which are readily combined into the final output. (Note that in the traditional Cilk implementation [6] , each tread can fork into only two treads at the time.) The system is comprised of 8 networked processors. The processors have various speeds: one works at 100 work units per unit time (ms), one at 200, one at 300, two at 400, two at 800, and one at 1600 work units per unit time. Communication cost is zero along edges of the network, since we count migrations separately.
SteaVMug

10132
I0/16
Time interval IO denotes initial steamug interval, that is approximately proportional to the reciprocal of the processor speed. Specifically, the interval is 1 ms for the processor of speed 100, 0.7 for the processor of speed 200,0.5 for the processor of speed 300, 0.3 for the processors of speed 400, 0.1 for the processors of speed 800, and 0.05 for the processor of speed 1600. In (Tables 1-2) , we show the data for values of steaVmug interval ranging from 10/64 up to 51210. This broad range allows us to see how the speed approaches an asymptote as the stedmug interval approaches zero and how the performance deteriorate as this interval increases. For each steaVmug interval, we ran the simulation 500 times, for every run the processor initiating DAG execution is selected uniformly at random. We tabulated minimum, average, maximum, standard deviation for the completion time and average numbers of successful steals and muggings over the 500 runs. Two main lower bounds on the time required to complete a DAG in this setting are total work of DAG divided by sum of processor speeds, and critical path divided by the speed of the fastest processor. However for this special case there is a better lower bound, obtained as follows. First, the sum of the speeds of the 8 processors is 4600 work units per unit time. The best we can hope to do is to process the first task (node) with the fastest processor in time 50000/1600, then the 50 middle tasks in time 50 -50000/4600, then the final task (after fan-in) with the fastest processor in time 50000/1600. This gives a lower bound of 605.98 time units to complete the DAG. In comparing with this lower bound, we see that the ECS method, using the most frequent steaVmug attempts (I0/64), performs within 2.5% of lower bound, on average, with the maximum completion time among the 500 runs only 3.5% worse than lower bound for this experiment. The average number of steals decreases with the stedmug interval. Muggings decrease more steeply. This behavior is expected, since once the first task is over, 50 tasks are released and put in the queue of the processor that just completed the first task. A smaller number of attempts results in a smaller number of successful steals or muggings.
The completion times and their volatility for the given configuration are increasing with the decreasing rate of steaVmug attempts. Because of infrequent steaVmug attempts and random choice of the "victim", faster processors can stay idle longer while slower processors are busy, thus increasing completion time.
Experiment 2
In this experiment, the input task DAG (Fig. 2) consists of a task of size 16000, which fans out to 50 parallel tasks, each of size 50000, then fans in to a task of size 16000, then fans out to 6 parallel tasks of size 500,000 each, then finally fans in to a task of size 16000. This experiment is meant to model the case in which the job is initially parallelized into many equal-sized subtasks, which are then combined and a small number of follow-up tasks are run in parallel. This DAG models some practical image-object recognition applications, as the large fan-out is done in the raw image processing, while the longer, narrower part of the DAG models the more time-consuming, less number-crunching tasks of feature matching and object recognition. This experiment uses a network of 12 processors. The processors have various speeds: one works at 100 work units per unit time (ms), one at 200, one at 300, three at 400, three at 800, and three at 1600 work units per unit time. For each processor, the product of speed and time interval IO is 80 work units. Each batch of 500 runs increases steal-mug interval 20% in respect to the previous batch. Each run starts on a randomly selected processor. Task migrations are counted separately. the lower bound. This lag is mostly due to "persistence" of fast processors; i. e., once they grab a task, it will be processed completely. For experiment, 2 processors, with the first being twice as fast as the second one, work on 2 equally long tasks. It would be efficient to swap tasks at halftime of the optimal run. That way, both would be active from the beginning to the end. Instead, the faster processor finishes his task first and then takes over the other leaving the slower processor idle.
Experiment 3
This experiment runs simulations of the ECS system with changing speeds of processors. In the following set of simulations, we tested ECS method robustness to processors' speed change, as may be expected to occur in real systems. Processors change speed in a stepwise manner (an alternating renewal process); i. e. they start working full speed and after some randomly generated time with exponential distribution, speed drops down to a level randomly chosen from a uniform distribution between two input parameters: minimum and maximum percentage of full speed. After the exponential random time of a processor working at full speed, the speed changing pattern is repeated. We executed five simulations and compared average completion times with the system in which processors work full speed all the time. Each simulation is executed for 100 DAGS of "graphic" and "fan-out-fan-in" type. This experiment employs a network of 12 processors. The processors have various speeds: one works at 100 work units per unit time (ms), one at 200, one at 300, two at 400, four at 800, and three at 1600 work units per unit time. For each processor, the product of speed and steaVmug interval is 80 work units. Communication cost for task migration is zero along edges of the network. For each simulation run, the duration of both full speed periods and reduced speed periods are randomly chosen from an exponential distribution with mean of 50 time units.
Refer to Table 5 . Top row denotes speed range during the slowdown phase as the percentage of full speed. First data column is the benchmark -when Graphic Fu-out. speeds do not change. In the first run, speeds are only reduced to between 80 and 100%. Consecutive simulations are reducing processor speeds to 60-8070, 50-70%, 40-60%, and finally to 10-50%. As the overall system computing power decreases, the completion times are decreasing linearly. Thus, ECS performance degrades gracefully with changes of processor speeds.
Experiment 4
In this experiment, we simulate a networked system of 100 processors. Their speeds range from 1600 down to 100 (17 processors of speed 1600, 29 of speed 800, 25 of speed 400, 12 of speed 300, 8 of speed 200 and 9 of speed 100). Two types of DAGs, "graphic" and "fan-out-fan-in" with uniformly-sized tasks (as described in Experiments 1 and 2), arrive into a system according to a Poisson process with some mean DAG interarrival time. At the arrival time, the type of DAG is randomly determined, according to a discrete uniform distribution. The simulation is continuous, and statistics are collected after each consecutive 1000 DAG completions. Both systems are simulated on an identical DAG arrival timeline; thus, we can directly compare the quality of scheduling schemes. Again, we simulate ECS and CM systems behavior. In the ECS system, processors attempt to steaVmug at the Io level as described in Experiment 2.
The ECS scheme is implemented here by two queues on each processor. A ready queue is maintaining tasks of a DAG in process and a waiting queue is keeping newly arriving DAGs in FIFO order. When a DAG appears on a processor, it is processed immediately if the processor is idle. If the processor is busy, the initial task of the DAG is placed in the waiting queue. When a processor is finished with a task, successor tasks from the DAG are released and placed in the ready queue and the processor continues with the tasks from the bottom of the ready queue. If the ready queue is empty, the processor takes on the task from the top of the waiting queue. If both queues are empty, the processor attempts to steamug. The steaVmug procedure is as follows: the victim's ready queue is checked first and if not empty the task from the top is stolen. If the ready queue is empty, the waiting queue is checked and the task from the top, if any, is stolen. If both queues are empty, we have a mug attempt.
The mean DAG interarrival time is 300 time units. The system utilization' is low, slightly above 0.2 (0.2047); thus, completion times distribution is positively skewed, i. e. more than 50% of DAGs are completed in time less than average. Longer completion times are due to peaks in DAG arrivals. Again, CM model performs slightly better on average but it is not penalized for polling the status of each processor in the system that usually requires enormous amount of network traffic. ECS steaVmug attempts are one-on-one, requiring far less polling network traffic. On the other hand, task migrations from processor to processor are almost 50% more frequent in ECS than they are in CM system that will lead to network traffic delays of DAG completion times (we don't incur migration costs in this experiment.) Lower bound on processing time of graphic DAG is 390.577 and our statistical analysis shows that 95% of the time that kind of DAG will be processed in time less than 534.03 time units that is only 37% worse than the lower bound. Half of the times that type of DAG will be processed in time less than 421.85 time units, only 8% slower than lower bound.
Refer to Fig. 3 . CM performs slightly better than ECS on the "fan-out-fan-in" DAG, as expected, but the lag is small. The system is lightly loaded and DAGs are with high probability completed at most 50% beyond lower bound. The vertical axis measures time units; the horizontal axis refers to the actual runs (with marks per 1000 runs).
In Fig. 4 , completion times with "graphic" DAGs batches of size 1000 are given. DAGs enter the system according to a Poisson process, with exponential interarrival time distribution having a mean of 100 time units. System utilization is thus increased to about 0.6 and completion times are more volatile. CM scheduler is still better than ECS on average, but it exhibits larger spreads between minimum and 'The system utilization is defined to be the ratio of the time integral of the active processing power to the time integral of the available processing power in the system. , -_
500-
300+
I I , , , maximum times and maximum completion times are larger in CM than in ECS.
Conclusions
Based on our simulation results it can be stated that ECS is a viable well behaved scheduler in distributed heterogeneous environment. Even though CM was equipped with much higher degree of knowledge of overall system state, the results show that on average ECS is not that much slower and measured by QoS measure it performs even better. Another huge benefit of the ECS is its decentralized nature and adaptability versus C M s centralized control. Once control unit fails or becomes unusable the whole system is rendered unusable as well (unless there is some mechanism of control component back up). On the other hand, ECS system participants behave according to simple local protocol and thus the system is highly scalable and resilient to failures of its elements.
Future work may include: (1) Further understanding of parallel program types; (2) Inclusion of migration costs, query costs for CM system, faults (within the system or coming from outside), multi-layered systems inhomogeneous processors; (3) Perform the simulation experiments with real, world parameters and compare ECS with Depth First (DF) and Depth First Deques (DFD); and (4) Implement the work stealing paradigm in solving graph-partitioning problems.
