The primary goal of processor scheduling is to assign tasks in a parallel program to processors, so as to minimize the execution time. Most existing approaches to processor scheduling for multiprocessors assume that the execution time of each task is xed and is independent of processor scheduling. In this paper, we argue that the execution time of a given task is not xed but is critically dependent on the performance of the caches, which have become an essential component of shared-memory multiprocessors and propose a scheduling algorithm, called data distribution constrained scheduling algorithm. The proposed scheduling algorithm tries to maximize the number of cache hits by scheduling the processors so that the task that brings a memory block into the cache and the tasks that subsequently access the same memory block are executed on the same processor.
1. Introduction. The desire and need for more computing power have motivated the use of MIMD(Multiple instruction multiple data) computer systems in which several processing elements execute independent streams of instructions. One of the important issues in MIMD multiprocessors is processor scheduling that determines which processor will execute which tasks in a parallel program. The main goal of processor scheduling is to assign tasks in a program to processors so as to minimize the execution time. Most existing approaches to processor scheduling for multiprocessors assume that the execution time of each task is xed and is independent of processor scheduling. In this paper, we argue that the execution time of a given task is not xed but is critically dependent on the performance of the caches, which have become an essential component of shared-memory multiprocessors and propose a scheduling algorithm, called data distribution constrained scheduling algorithm (we will call it DDCS algorithm for short hereafter). The proposed scheduling algorithm tries to maximize the number of cache hits by scheduling the processors so that the task that brings a memory block into the cache and the tasks that subsequently access the same memory block are executed on the same processor.
The rest of the paper is organized as follows. The following section gives the notation used throughout this paper and provides a brief literature survey of the areas related to the work reported herein. In section 3, a complete description of our approach along with examples is given. Section 4 concludes and suggests possible areas of future research.
2. Background and survey of previous work. The purpose of this section is to provide context for this paper. The section begins with a brief survey of data dependence analysis that is essential to our scheduling algorithm to be presented in section 3. Our model of parallel programs is then presented in section 2.2. Finally, section 2.3 brie y surveys literature on multiprocessor scheduling.
Data dependence. It is clear that when executing programs on
multiprocessor systems, we would like to break up the program into tasks that can be executed concurrently. Data dependences and control dependences determine which operations may be executed in parallel and which operations must be executed sequentially.
In conventional programs, there are four kinds of data dependences :
ow-dependence, anti-dependence, output-dependence, and input-dependence 5] . Let r and r 0 be read operations and w and w 0 be write operations in a program. r is de ned to be ow-dependent on w if the memory location written by w may be read by r. w is de ned to be anti-dependent on r if the memory location read by r may be later written by w. w is de ned to be output-dependent on w 0 if the memory location written by w 0 may be later rewritten by w. r is de ned to be input-dependent on r 0 if both of them read the same memory location and r 0 precedes r. It should be noted that contrary to the rst three data dependences the input dependence does not actually put a constraint on the order of the two read operations involved in the dependence. We de ne four functions each of which associates a dependence with an element in ftrue; falseg. If a data dependence exists between two statements S 1 and S 2 , we associate with the dependence a set of distance vectors de ned as D = f(~i ?j)j(9ĩ)(9j)(~i 2Ĩ S 1^j 2Ĩ S 2^Sĩ 1 is ? dependent on S~j 2 )g. It is the set of pairwise di erences of the two iteration vectors associated with two dependent statements. A dependence is de ned to have a constant distance if jD j = 1.
As an example consider the loop shown in Fig. 1 Data dependence relationships among statements in a program can be graphically represented by a labeled directed multigraph called data dependence graph. The nodes of the graph are statements in the program and (S i ; S j ; ) is in the arc set if S j is -dependent on S i (denoted by S j S i ). S i and S j are also called, respectively, the source and sink of the dependence .
The Do loop shown in Fig. 1 has the following data-dependence graph.
? 6
The ow dependence 1 is caused by the dependence between A(i) (in S 1 ) and A(i) (in S 2 ) and the ow dependence 2 is caused by the dependence between B(i) (in S 2 ) and B(i ? 1) (in S 1 ).
The process of deciding whether two given statements in a program are data-dependent is called the data dependence analysis and has been a subject of much research since the mid-seventies. The studies reported in 1, 6, 10] represent only a subset of such investigations.
2.2. Parallel programs. In order to exploit the parallelism available in a shared-memory multiprocessor, programs may be written with explicit parallel constructs or conventional sequential programs may be transformed into equivalent parallel ones by a restructuring compiler or a preprocessor.
These parallel programs usually contain three types of Do loop : Serial loop, DoAll loop, and DoAcross loop. Serial loops are loops with inter-iteration delay equal to the execution time of the whole loop body. DoAll loops are loops with the delay equal to 0. Between serial loops and DoAll loops, are DoAcross 3] loops which have a delay between 0 and the execution time of the loop body caused by inter-iteration dependences. DoAll and DoAcross loops will also be called parallel loops.
Data dependences can also be classi ed as external dependences, parallel loop-carried dependences, or internal dependences (which are not necessarily disjoint). External dependences are dependences that may be carried by loops outside of a parallel loop and parallel loop-carried dependences are those that may be carried by a parallel loop. Internal dependences are de ned as dependences that may be carried by loops inside of a parallel loop or carried independent of loops. We de ne, as before, three functions each of which maps a dependence to ftrue; falseg, namely, external , parallel loop?carried , and internal .
We assume that a parallel program is composed of a set of epochs 7] which are either parallel loops or serial regions between them. Each epoch consists of a set of instances. Execution of an iteration of a parallel loop constitutes an instance of the epoch of type parallel loop. A serial region is a special type of epoch which has only one instance. We assume that only one level of nested Do loops is to be executed concurrently (i.e., only di erent iterations of a single parallel loop can be executed in parallel on multiple processors).
We 
Multiprocessor scheduling. The main goal of multiprocessor
scheduling is to assign tasks in a parallel program to processors so as to minimize the execution time. Many classical scheduling schemes assume that the execution time of each task in a parallel program is xed and known a priori and statically schedule the processors for the tasks. These assumptions for the static scheduling are di cult to justify because of: 1. the wide variance in synchronization time 2. the wide variance in memory access time due to a complex memory hierarchy with caches and local memories 3. the wide variance in the number of instructions executed in a task due to conditional statements. Consequently, many dynamic scheduling schemes have been proposed as an alternative to the above static approach. The goal of these dynamic approaches is to balance the workload among processors by dynamically assigning tasks to idle processors. However, approaches such as those presented in 4, 8, 9] still assume that the execution time of tasks is independent of processor scheduling, which is not always true in the presence of a complex memory hierarchy with caches and local memories as we will see later.
One notable exception is the work by Callahan in the context of a sharedmemory multiprocessor without caches, but with local memory associated with each processor 2]. The goal of the scheduling algorithm proposed in 2] is to reduce the number of global memory references by replacing them with local ones.
His approach is to distribute components of shared variables across local memories and then to schedule the processors in order to maximize the number of accesses to the locally-stored copies of shared variables. In this approach, a distribution function, called allocate, is assigned for each shared composite variable such as a shared array. It maps each element of the composite variable to a particular processor. The mapped processor will have the corresponding element in its local memory during execution time.
In 2], an algorithm is presented that distributes a set of user arrays across local memories over a set of parallel loops without loop-carried ow, anti, output and input dependences. To see the approach, let us consider the following example given in 2].
In the above example, the arrays A, B and C can all be distributed and the distribution functions for A, B and C are: allocated A (i) = p j where (i-j-1) mod jPj = 0 allocated B (i) = p j where (i-j) mod jPj = 0 allocated C (i) = p j where (i-j) mod jPj = 0 where allocated X (i) returns the processor which has X(i) in its local memory and P is the set of allocated processors. Fig. 2 shows the distribu- There are, however, some limitations to this approach. First, the scheme is applicable only to a limited set of parallel loops (i.e., parallel loops without loop-carried ow, anti, output and input dependences). This limits the applicability of the scheme since the scheme cannot accommodate parallel loops with inter-iteration dependences such as DoAcross and rst order linear recurrence 5] loops. Furthermore, even some of parallel loops without inter-iteration dependences (i.e., DoAll loops) cannot be accommodated since the approach excludes parallel loops with loop-carried input dependences. Second, the correctness of the execution of a parallel program depends on processor scheduling. In other words, the scheduling imposed by the compiler for data distribution should be strictly observed at execution time for the correct execution of the program. Also, the approach requires local memories with an address space distinct from the global address space and time-consuming copy operations between local and global memories. In the next section, we propose a cache-based DDCS algorithm that overcomes the above three limitations of the previous approach.
3. Cache-based data distribution constrained scheduling. In our viewpoint, the three problems associated with Callahan's algorithm are originated from the following two facets of the approach.
1. The scheduling was performed based on parallel loops instead of data dependences among statements that are more basic units of data reuse. 2. The scheduling problem was formulated based on local memories that are not architecturally transparent. In this section, we present our Data Distribution Constrained Scheduling (DDCS) scheme. It is based on data dependences among statements and thus, as will be seen, eliminates the undue restrictions imposed by Callahan's scheme. Furthermore, by considering a cache-based architecture rather than one with local memories, we do not have to wrestle with the correctness issues and the explicit copy operations.
In the following, we give a problem formulation of DDCS in terms of data dependences and caches. The goal of the scheduling algorithm is to maximize inter-epoch cache reuse. Cache reuse is dictated by the following three conditions.
1. It should be guaranteed that the source reference that brings the entry into the cache is not succeeded by a write reference to the same memory location in the epoch containing the source. (Otherwise, such a succeeding write reference would make the just loaded cache entry stale.) 2. Similarly, it should be guaranteed that the sink reference that subsequently accesses the entry is not preceded by a write reference to the same memory location in the epoch that contains the sink. 3. There should be at least one path from the source to the sink in the ow graph of the parallel program that does not traverse an epoch that has writes to the same memory location. The above three conditions, when combined, guarantee that the cache entry loaded by the source reference is reused by the sink reference without being invalidated by an intervening write reference if the epoch containing the source and the epoch containing the sink are executed on the same processor and the entry is not replaced in-between.
In our approach, the rst and second conditions are checked through data dependence analysis and the results are expressed in terms of reference markings. The check for the third condition is a trivial matter.
3.1. Reference marking scheme. Our DDCS algorithm requires a reference marking scheme for reads and writes to determine the possibility of cache entry reuse across di erent epochs. The reference marking scheme is based on data dependence analysis of a parallel program. Each write operation has only one marking that indicates whether the cache entry loaded by the given write reference can be reused in a later epoch. On the other hand, two markings are given to each read operation since a read operation can be both a source and a sink of a cache entry reuse.
Marking of write operations. A reference marking scheme
for each write operation is required to decide whether the cache entry written by it can be reused in future epochs. The above decision is based on whether the write reference may have a succeeding write reference to the same memory location from other processors in the same epoch. If a given write reference cannot have such a succeeding write reference, it is marked as CBR (Can Be Reused) and the cache entry loaded by it can be potentially reused in later epochs.
We now give a more formal marking scheme for write operations in a parallel loop and in a serial region.
In a parallel loop In a serial region Every write operation in a serial region is marked as CBR since the serial region is executed by only one processor from the start to the end and, therefore, a given write reference cannot be succeeded by another write to the same memory location from other processors. As an example of the marking of write operations, let us consider the DoAcross loop given in Fig. 3 . To simplify our presentation, the synchronizations required to satisfy the inter-iteration dependencies are omitted in the example. In this example, the rst write to the array (i.e., A(i 1 )) cannot be marked as CBR since the second write to the array A (i.e., A(i 1 ? 1)) is parallel loop-carried output dependent on it. This indicates that the cache word written by A(i 1 )'s should be regarded as stale at the end of the above parallel loop since the corresponding memory location may be overwritten.
The second write to the array A (i.e., A(i 1 ?1)) is marked as CBR since it is guaranteed that the memory locations written by it will not be overwritten The rst marking indicates whether the cache entry loaded by the given read operation can be reused in later epochs (i.e., whether the entry can survive without being invalidated by the end of the epoch to which the reference belongs). This marking is necessary to take care of the case when the read operation acts as a source of a cache entry reuse and its meaning is analogous to that of CBR for write operations (i.e., a read operation in an epoch is marked as CBR if it cannot have a succeeding write reference to the same memory location from other processors in the same epoch). If the given read operation may have such a succeeding write reference, the resultant cache entry would be invalidated and might not be reused in later epochs. The second marking, on the other hand, determines whether the given read reference can make use of cache entries loaded during a past epoch. This marking is needed when the read operation acts as a sink of a cache entry reuse. If a given read reference cannot be preceded by a write reference to the same memory location from other processors in the same epoch, it is marked as CR and can potentially utilize a cache entry loaded in a past epoch.
The above marking policy can be described formally as follows.
In a parallel loop In a serial region Every read operation in a serial region is marked as CBR for the same reason as for writes. It is also marked as CR since it cannot be preceded in the serial region by another write to the same memory location from other processors. An example of marking of read operations is given in Fig. 4 . Again synchronizations are omitted in the example to simplify the discussion. In the example, the read operation A(f(i 1 )) is marked as CR if f(k); 1 k n 1 is not equal to any g(k 0 ) for 1 k 0 < k. If the above condition is satis ed, the read reference A(f(i 1 )) can access cache words loaded by CBR reads and writes during past epochs. To handle the case when A(f(i 1 )) acts as a source of a cache entry reuse, it is marked as CBR if f(k); 1 k n 1 is not equal to any g(k 0 ) for k < k 0 n 1 . In this case, it is guaranteed that the cache words loaded into the cache on read misses on A(f(i 1 ))'s remain up-to-date, if they ever exist, at the end of the parallel loop and may be referenced by CR reads in future epochs.
3.2. Overall scheme. In our approach, rst, we calculate for each alignable external ow and input dependence the number of cache words loaded into the cache by the source of the dependence and referenced by its sink without being invalidated in-between. (Recall that if two statements are related by an alignable dependence, the distance of the dependence is always constant.) Also recall that for the reuse of cache entries, three conditions should be met. (We rewrite these three conditions for the reuse of cache entries along an external dependence in terms of the reference markings given in the previous subsection.)
1. The source of the dependence must be a CBR write or a CBR read. 2. The sink of the dependence must be a CR read.
3. There should be at least one path from the source to the sink in the ow graph of the program that does not traverse an epoch that has writes to a memory location associated with the dependence.
We de ne an alignable dependence as forwardable if it meets the above three conditions. For each forwardable dependence, we calculate the gain with respect to no caching in terms of cache hits that can be obtained if we align the dependence. This alignment would allow the cache entries loaded into the cache by the source references (i.e., references from the source of the dependence) to be reused by the sink references. Such gain is given by gain total = X In the following, we give an algorithm that gives a schedule for each parallel loop so as to maximize the number of cache hits. It takes as input an undirected multigraph G=(V,E) where V consists of parallel loops in a program and E consists of labeled edges (Loop i ; Loop j ; ) where Loop i ; Loop j 2 V and is a forwardable dependence and Loop i and Loop j are parallel loops enclosing the source and the sink of the dependence respectively. The algorithm visits every forwardable dependence in the order of decreasing gain total . For each forwardable dependence it checks whether there is a possible schedule for the two parallel loops related by the dependence that maximizes the reuse of the cache entries associated with the dependence. If so, the algorithm schedules those two parallel loops. Otherwise, the schedules for the two parallel loops are already interrelated to align other forwardable dependences with larger gain total . The algorithm is given in Fig. 5 .
In the algorithm, line 2 selects the forwardable dependence with the largest total gain. Lines 3-5 check whether the two parallel loops which enclose the source and the sink of the dependence belong to the same set. If not, it merges the two sets to which those two parallel loops belong in line 6. In the algorithm, each set is represented by a tree based on father pointers.
Every set element v except the root has a father denoted by father(v) which is another element in the same set. The root of a set does not have a father and names the set it belongs. Being in the same set implies that scheduling iterations of one parallel loop in that set is constrained by the schedules of other parallel loops in the same set. The above constraint is necessary to let an iteration of a parallel loop which produces an item and the iteration of another parallel loop consuming the same item be executed by the same processor. This constraint is represented by the function f align which maps Algorithm 3.2. To see the mechanics of the above algorithm, let us consider the example given in Fig. 6 . The associated data dependence graph is shown in Fig. 7 . Only forwardable dependences are depicted in the graph since only those are relevant to our discussion. We annotate each forwardable dependence with its source and sink in the graph.
Assume that n 1 = 50 and n 2 = n 3 = 30. Then the forwardable dependences would be processed in the order of 1 (gain total . Initially each parallel loop is in its own set. So the initial partition is:
