BDSC, which extends Yang and Gerasoulis's Dominant Sequence Clustering (DSC) algorithm, is a new efficient automatic scheduling algorithm for parallel programs in the presence of resource constraints on the number of processors and their local memory size; it is based on a precise cost model and addresses both shared and distributed parallel memory architectures. We describe BDSC, its integration within the PIPS compiler infrastructure and its application to the parallelization of four well-known scientific applications: Harris, ABF, equake and IS. Our experiments suggest that BDSC's focus on efficient resource management leads to significant parallelization speedups on both shared and distributed memory systems, improving upon DSC results, as shown by the comparison of the sequential and parallelized versions of these four applications running on both OpenMP and MPI frameworks.
Introduction
"Anyone can build a fast CPU. The trick is to build a fast system." Attributed to Seymour Cray, this quote is even more pertinent when looking at multiprocessor systems that contain several fast processing units; parallel system architectures introduce subtle system constraints to achieve good performance. Real world applications, which operate on a large amount of data, must be able to deal with limitations such as memory requirements, code size and processor features. These constraints must also be addressed by parallelizing compilers that target such applications and translate sequential codes into efficient parallel ones.
One key issue when attempting to parallelize sequential programs is to find solutions to graph partitioning and scheduling problems, where nodes represent computational tasks and edges, data exchanges. Each node is labeled with an estimation of the time taken by the computation it performs; similarly, each edge is assigned a cost that reflects the amount of data that need to be exchanged between its adjacent nodes. Task scheduling is the process that assigns a set of tasks to a network of processors such that the completion time of the whole application is as small as possible while respecting the dependence constraints of each task. Usually, the number of tasks exceeds the number of processors; thus some processors are dedicated to multiple tasks. Since finding the optimal solution of a general scheduling problem is NP-complete [7] , providing an efficient heuristic to find a good solution is needed. Efficiency is strongly dependent here on the accuracy of the cost information encoded in the graph to be scheduled. Gathering such information is a difficult process in general, in particular in our case where tasks are automatically generated from program code.
Scheduling approaches can be categorized in many ways. In preemptive techniques, the current executing task can be preempted by an other higher-priority task, while, in a non-preemptive scheme, a task keeps its processor until termination. Preemptive scheduling algorithms are instrumental in avoiding possible deadlocks and for implementing real-time systems, where tasks must adhere to specified deadlines; however, preemptions generate run-time overhead. Moreover, when static predictions of task characteristics such as execution time and communication cost exist, task scheduling can be performed statically (offline). Otherwise, dynamic (online) schedulers must make run-time mapping decisions whenever new tasks arrive; this introduces run-time overhead.
In the context of the automatic parallelization of scientific applications we focus on in this paper, we are interested in non-preemptive static scheduling policies of parallelized code 1 . Even though the subject of static scheduling is rather mature (see Section 6), we believe the advent and widespread market use of multi-core architectures, with the constraints they impose, warrant to take a fresh look at its potential. Indeed, static scheduling mechanisms have, first, the strong advantage of reducing run-time overheads, a key factor when considering execution time and energy usage metrics. One other important advantage of these schedulers over dynamic ones, at least over those not equipped with detailed static task information, is that the existence of efficient schedules is ensured prior to program execution. This is usually not an issue when time performance is the only goal at stake, but much more so when memory constraints might impede a task to be executed at all on a given architecture. Finally, static schedules are predictable, which helps both at the specification (if such a requirement has been introduced by designers) and debugging levels.
We introduce thus a new non-preemptive static scheduling heuristic that strives to give as small as possible schedule lengths, i.e., parallel execution time, in order to extract the task-level parallelism possibly present in sequential programs, while enforcing architecture-dependent constraints. Our approach takes into account resource constraints, i.e., the number of processors, the computational cost and memory use of each task and the communication cost for each task exchange, to provide hopefully significant speedups on realistic shared and distributed computer architectures. Our technique, called BDSC, is based on an existing best-of-breed static scheduling heuristic, namely Yang and Gerasoulis's DSC (Dominant Sequence Clustering) list-scheduling algorithm [29] [8] , that we equip to deal with new heuristics that handle resource constraints. One key advantage of DSC over other scheduling policies (see Section 6), besides its already good performance when the number of processors is unlimited, is that it has been proven optimal for fork/join graphs: this is a serious asset given our focus on the program parallelization process, since task graphs representing parallel programs often use this particular graph pattern. Even though this property may be lost when constraints are taken into accounts, our experiments on scientific benchmarks suggest that our extension still provides good performance speedups (see Section 5) .
If scheduling algorithms are key issues in sequential program parallelization, they need to be properly integrated into compilation platforms to be used in practice. These environments are in particular expected to provide the data required for scheduling purposes, a difficult problem we already mentioned. Beside BDSC, our paper introduces also new static program analysis techniques to gather the information required to perform scheduling while ensuring that resource constraints are met, namely a static instruction and communication cost model, a data dependence graph to enforce scheduling constraints and static information regarding the volume of data exchanged between program fragments.
The main contributions of this paper, which introduces a new task-based parallelization approach that takes into account resource constraints during the static scheduling process, are:
• "Bounded DSC" (BDSC), an extension of DSC that simultaneously handles two resource constraints, namely a bounded amount of memory per processor and a bounded number of processors, which are key parameters when scheduling tasks on actual parallel architectures;
• a new BDSC-based hierarchical scheduling algorithm that uses a new data structure, called the Hierarchical Clustered Dependence Graph (KDG), to represent partitioned parallel programs;
• an implementation of BDSC and KDG parallelization in the PIPS [12] source-to-source compilation framework, using a new cost model based on time complexity measures, convex polyhedral approximations of data array sizes and code instrumentation for the labeling of KDG nodes and edges; Figure 1: A Directed Acyclic Graph (left) and its scheduling (right); starred tlevels (*) correspond to the selected clusters
• performance measures related to the BDSC-based parallelization of four significant programs, targeting both shared and distributed memory architectures: the image and signal processing applications Harris and ABF, the SPEC2001 benchmark equake and the NAS parallel benchmark IS.
This paper is organized as follows. Section 2 presents the original DSC algorithm that we intend to extend. We detail our algorithmic extension, BDSC, in Section 3. Section 4 introduces the partitioning of a source code into a Hierarchical Clustered Dependence Graph (KDG), our cost model for the labeling of this KDG and a new BDSC-based hierarchical scheduling algorithm. Section 5 provides the performance results of four scientific applications parallelized on the PIPS platform: Harris, ABF, equake and IS. We also assess the sensitivity of our parallelization technique on the accuracy of the static approximations of the code execution time used in task scheduling. Section 6 compares the main existing scheduling algorithms and parallelization platforms with our approach. Finally Section 7 concludes the paper and addresses future work.
List Scheduling: the DSC Algorithm
In this section, we introduce the notion of list-scheduling heuristics and present the list-scheduling heuristic called DSC [29] .
List-Scheduling Processes
A labelled direct acyclic graph (DAG) G is defined as G = (T, E, D), where (1) T = nodes(G) is a set of n tasks (nodes) τ annotated with an estimation of their execution time task time(τ ), ( 2) E, a set of m edges e = (τ i , τ j ) between two tasks, and (3) D, a n × n communication edge cost matrix edge cost(e); task time(τ ) and edge cost(e) are assumed to be numerical constants, although we show how we lift this restriction in Section 4.2. The functions successors(τ, G) and predecessors(τ, G) return the list of immediate successors and predecessors of a task τ in the DAG G. Figure 1 provides an example of a simple graph, with nodes τ i ; node times are listed in the node circles while edge costs label arrows.
A list scheduling process provides, from a DAG G, a sequence of its nodes that satisfies the relationship imposed by E. Various heuristics try to minimize the schedule total length, possibly allocating the various nodes in different clusters, which ultimately will correspond to different processes or threads. A cluster κ is thus a list of tasks; if τ ∈ κ, we note cluster(τ ) = κ. List scheduling is based on the notion of node priorities. The priority for each task τ is computed using the following attributes:
• The top level tlevel(τ, G) of a node τ is the length of the longest path from the entry node 2 of G to τ . The length of a path is the sum of the communication cost of the edges and the computational time of the nodes along the path. Tlevels are used to estimate the start times of nodes on processors: the tlevel is the earliest possible start time. Scheduling in an ascending order of tlevel tends to schedule nodes in a topological order. The algorithm for computing the top level of a node τ in a graph is given in Algorithm 1.
ALGORITHM 1:
The tlevel of Task τ in Graph G function t l e v e l ( τ , G) t l = 0 ; foreach τi ∈ p r e d e c e s s o r s ( τ , G) l e v e l = t l e v e l ( τi , G)+ t a s k t i m e ( τi)+ e d g e c o s t ( τi , τ ) ; i f ( t l < l e v e l ) then t l = l e v e l ; return t l ; end
• The bottom level blevel(τ, G) of a node τ is the length of the longest path from τ to the exit node of G. The maximum of the blevel of nodes is the length cpl(G) of a graph's critical path, which has the longest path in the DAG G. The latest start time of a node τ is the difference (cpl(G) − blevel(τ, G)) between the critical path length and the bottom level of τ . Scheduling in a descending order of blevel tends to schedule critical path nodes first. The algorithm for computing the bottom level of τ in a graph is given in Algorithm 2.
To illustrate these notions, the tlevels and blevels of each node of the graph presented in the left of Figure 1 are provided in the adjacent table (we discuss the other entries in this table later on).
The general algorithmic skeleton for list scheduling a graph G on P clusters (P can be infinite and is assumed to be always strictly positive) is provided in Algorithm 3: first, priorities priority(τ ) are computed for all currently unscheduled nodes; then, the node with the highest priority is selected for scheduling; finally, this node is allocated to the cluster that offers the earliest start time. Function f characterizes each specific heuristic, while the set of clusters already allocated to tasks is clusters. Priorities need to be computed again for (a possibly updated) graph G after each scheduling of a task: task times and communication costs change when tasks are allocated to clusters. This is performed by the update priority values function call.
ALGORITHM 3: List scheduling of Graph G on P processors procedure l i s t s c h e d u l i n g (G, P) c l u s t e r s = ∅ ; foreach τi ∈ nodes (G) p r i o r i t y ( τi ) = f ( t l e v e l ( τi , G) , b l e v e l ( τi , G) ) ; UT = nodes (G) ; // u n s c h e d u l e d t a s k s while UT = ∅ τ = s e l e c t t a s k w i t h h i g h e s t p r i o r i t y (UT) ; κ = s e l e c t c l u s t e r ( τ , G, P , c l u s t e r s ) ; a l l o c a t e t a s k t o c l u s t e r ( τ , κ , G) ; u p d a t e g r a p h (G) ; u p d a t e p r i o r i t y v a l u e s (G) ; UT = UT−{τ } ; end
The DSC Algorithm
DSC (Dominant Sequence Clustering) is a list-scheduling heuristic for an unbounded number of processors. The objective is to minimize the top level of each task. A DS (Dominant Sequence) is a path that has the longest length in a partially scheduled DAG; a graph critical path is thus a DS for the totally scheduled DAG. The DSC heuristic computes a Dominant Sequence (DS) after each node is processed, using tlevel(τ, G) + blevel(τ, G) as priority(τ ). A ready node τ , i.e., for which all predecessors have already been scheduled 3 , on one of the current DSs, i.e., with the highest priority, is clustered with a predecessor τ p when this reduces the tlevel of τ by zeroing, i.e., setting to zero, the cost of the incident edge (τ p , τ ).
To decide which predecessor τ p to select, DSC applies the minimization procedure tlevel decrease, which returns the predecessor that leads to the highest reduction of tlevel for τ if clustered together, and the resulting tlevel; if no zeroing is accepted, the node τ is kept in a new single node cluster 4 . More precisely, the minimization procedure tlevel decrease for a task τ , in Algorithm 4, tries to find the cluster cluster(min τ ) of one of its predecessors τ p that reduces the tlevel of τ as much as possible by zeroing the cost of the edge (min τ, τ ). All clusters start at the same time, and each cluster is characterized by its running time, cluster time(κ), which is the cumulated time of all tasks τ scheduled into κ; idle slots within clusters may exist and are also taken into account in this accumulation process. The condition scheduled(τ p ) is tested on predecessors of τ in order to make it possible to apply this procedure for ready and unready τ nodes; an unready node has at least one unscheduled predecessor.
DSC is the instance of Algorithm 3 where select cluster is replaced by the code in Algorithm 5 (new cluster extends clusters with a new empty cluster; its cluster time is set to 0). Note that min tlevel will be used in Section 2.3. Since priorities are updated after each iteration, DSC computes dynamically the critical path based on both tlevel and blevel information. The table in Figure 1 represents the result of scheduling the DAG in the same figure using the DSC algorithm.
Dominant Sequence Length Reduction Warranty (DSRW)
DSRW is an additional greedy heuristic within DSC that aims to further reduce the scheduling length. A node on the DS path with the highest priority can be ready or not ready. With the DSRW heuristic, DSC schedules the ready nodes first, but, if such a ready node τ r is not on the DS path, DSRW verifies, using the procedure in Algorithm 6, that the corresponding zeroing does not affect later the reduction of the tlevels of the DS nodes τ u that are partially ready, i.e., such that there exists at least one unscheduled predecessor of τ u . To do this, we check if the "partial top level" of τ u , which does not take into account unexamined (unscheduled) predecessors and is computed using tlevel decrease, is reducible, once τ r is scheduled.
The table in Figure 1 illustrates an example where it is useful to apply the DSRW optimization. There, the DS column provides, for the task scheduled at each step, its priority, i.e., the length of its dominant sequence, while the last column represents, for each possible zeroing, the corresponding task tlevel; 
s t a r t t i m e = max ( l e v e l , s t a r t t i m e ) ; i f ( m i n t l e v e l > s t a r t t i m e ) then m i n t l e v e l = s t a r t t i m e ; min τ = τp ; return ( min τ , m i n t l e v e l ) ; end ALGORITHM 5: DSC cluster selection for Task τ for Graph G on P processors function s e l e c t c l u s t e r ( τ , G, P , c l u s t e r s ) ( min τ , m i n t l e v e l ) = t l e v e l d e c r e a s e ( τ , G) ; return ( s c h e d u l e d ( min τ ) ) ? c l u s t e r ( min τ ) : n e w c l u s t e r ( c l u s t e r s ) ; end starred tlevels (*) correspond to the selected clusters. Task τ 4 is mapped to Cluster κ 0 in the first step of DSC. Then, τ 3 is selected because it is the ready task with the highest priority. The mapping of τ 3 to Cluster κ 0 would reduce its tlevel from 3 to 2. But the zeroing of (τ 4 , τ 3 ) affects the tlevel of τ 2 , τ 2 being the unready task with the highest priority. Since the partial tlevel of τ 2 is 2 with the zeroing of (τ 4 ,τ 2 ) but 4 after the zeroing of (τ 4 ,τ 3 ), DSRW will fail, and DSC allocates τ 3 to a new cluster, κ 1 . Then, τ 1 is allocated to a new cluster, κ 2 , since it has no predecessors. Thus, the zeroing of (τ 4 ,τ 2 ) is kept thanks to the DSRW optimization; the total scheduling length is 5 (with DSRW) instead of 7 (without DSRW) (Figure 2 ). This section details the key ideas at the core of our new scheduling process BDSC, which extends DSC with a number of important features, namely (1) verifying predefined memory constraints, (2) targeting a bounded number of processors and (3) trying to make this number as small as possible.
DSC Weaknesses
A good scheduling solution is a solution that is built carefully, by having knowledge about previous scheduled tasks and tasks to arrive in the future. Yet, as stated in [18] , "an algorithm that only considers blevel or only tlevel cannot guarantee optimal solutions". Even though DSC is a policy that uses the critical path for computing dynamic priorities based on both the blevel and the tlevel for each node, it has some limits in practice.
The key weakness of DSC for our purpose is that the number of processors cannot be predefined; DSC yields blind clusterings, disregarding resource issues. Therefore, in practice, a thresholding mechanism to limit the number of generated clusters should be introduced. When allocating new clusters, one should verify that the number of clusters does not exceed a predefined threshold P (Section 3.3). Also, zeroings should handle memory constraints, i.e., by verifying that the resulting clustering does not lead to cluster data sizes that exceed a predefined cluster memory threshold M (Section 3.3).
Finally, DSC may generate a lot of idle slots in the created clusters. It adds a new cluster when no zeroing is accepted without verifying the possible existence of gaps in existing clusters. We handle this case in Section 3.4, adding an efficient idle cluster slot allocation routine in the task-to-cluster mapping process.
Resource Modeling
Since our extension deals with computer resources, we assume that each node in a DAG is equipped with an additional information, task data(τ ), which is an over-approximation of the memory space used by Task τ ; its size is assumed to be always strictly less than M . A similar cluster data function applies to clusters, where it represents the collective data space used by the tasks scheduled within it. Since BDSC, as DSC, needs execution times and communication costs to be numerical constants, we discuss in Section 4.2 how this information is computed.
Our improvement to the DSC heuristic intends to reach a tradeoff between the gained parallelism and the communication overhead between processors, under two resource constraints: finite number of processors and amount of memory. We track these resources in our implementation of allocate task to cluster given in Algorithm 7; note that the aggregation function data merge is defined in Section 4.2.
ALGORITHM 7: Task allocation of Task τ in Graph G to Cluster κ, with resource management procedure a l l o c a t e t a s k t o c l u s t e r ( τ , κ , G) s c h e d u l e d ( τ ) = TRUE; c l u s t e r ( τ ) = κ ; c l u s t e r t i m e ( κ ) = max ( c l u s t e r t i m e ( κ ) , t l e v e l ( τ , G))+ t a s k t i m e ( τ ) ; c l u s t e r d a t a ( κ ) = d a t a m e r g e ( c l u s t e r d a t a ( κ ) , t a s k d a t a ( τ ) ) ; end Efficiently allocating tasks on the target architecture requires reducing the communication overhead and transfer cost for both shared and distributed memory architectures. If zeroing operations, that reduce the start time of each task and nullify the corresponding edge cost, are obviously meaningful for distributed memory systems, they are also worthwhile on shared memory architectures. Merging two tasks in the same cluster keeps the data in the local memory, and even possibly cache, of each thread and avoids their copying over the shared memory bus. Therefore, transmission costs are decreased and bus contention is reduced.
Resource Constraint Warranty
Resource usage affects speed. Thus, parallelization algorithms should try to limit the size of the memory used by tasks. BDSC introduces a new heuristic to control the amount of memory used by a cluster, via the user-defined memory upper bound parameter M. The limitation of the memory size of tasks is important when (1) executing large applications that operate on large amount of data, (2) M represents the processor local (or cache) memory size, since, if the memory limitation is not respected, transfer between the global and local memories may occur during execution and may result in performance degradation, and (3) targeting embedded systems architecture. For each task τ , BDSC computes an over-approximation of the amount of data that τ allocates to perform read and write operations; it is used to check that the memory constraint of Cluster κ is satisfied whenever τ is included in κ. Algorithm 8 implements this memory constraint warranty MCW; data merge and size data are functions that respectively merge data and yield the size (in bytes) of data (see Section 4.2).
Another scarce resource is the number of processors. In the original policy of DSC, when no zeroing for τ is accepted, i.e. that would decrease its start time, τ is allocated to a new cluster. In order to limit the number of created clusters, we propose to introduce a user-defined cluster threshold P . This processor constraint warranty PCW is defined in Algorithm 8. return | c l u s t e r s | < P ; end
Efficient task-to-cluster mapping
In the original policy of DSC, when no zeroings are accepted -because none would decrease the start time of Node τ or DSRW failed -, τ is allocated to a new cluster. This cluster creation is not necessary when idle slots are present at the end of other clusters; thus, we suggest to select instead one of these idle slots, if this can decrease the start time of τ , without affecting the scheduling of the successors of the nodes already in these clusters. To insure this, these successors must have already been scheduled or they must be a subset of the successors of τ . Therefore, in order to efficiently use clusters and not introduce additional clusters without needing it, we propose to schedule τ to the cluster that verifies this optimizing constraint, if no zeroing is accepted.
This extension of DSC we introduce in BDSC amounts thus to replacing each definition of the cluster of τ to a new cluster by a call to end idle clusters. The end idle clusters function given in Algorithm 9 returns, among the idle clusters, the ones that finished the most recently before τ 's top level or the empty set, if none is found. This assumes, of course, that τ 's dependencies are compatible with this choice. To illustrate the importance of this heuristic, suppose we have the DAG presented in Figure 3 . Table 4 exhibits the difference in scheduling obtained by DSC and our extension on this graph. We observe here that the number of clusters generated using DSC is 3, with 5 idle slots, while BDSC needs only 2 clusters, with 2 idle slots. Moreover, BDSC achieves a better load balancing than DSC, since it reduces the variance of the clusters' execution loads, defined, for a given cluster, as the sum of the costs of all its tasks: 0.25, for BDSC, vs. 6, for DSC. Finally, with our efficient task-to-cluster mapping, in addition to decreasing the number of generated clusters, we gain also in the total execution time since our approach reduces communication costs by allocating tasks to the same cluster; for example, as shown in Figure 4 , the execution time with DSC is 13, but is equal to 12 with BDSC.
To get a feeling for the way BDSC operates, we detail the steps taken to get this better scheduling in the table of Figure 3 . BDSC is equivalent to DSC until Step 5, where κ 0 is chosen by our cluster mapping heuristic, since successors(τ 3 , G) ⊂ successors(τ 5 , G); no new cluster needs to be allocated.
The BDSC Algorithm
BDSC extends the list scheduling template provided in Algorithm 3 by taking into account the various extensions discussed above. In a nutshell, the BDSC select cluster function, which decides in which cluster κ a task τ should be allocated, tries successively the four following strategies: Figure 3 : A DAG amenable to cluster minimization (left) and its BDSC step-by-step scheduling (right)
Figure 4: DSC (left) and BDSC (right) cluster allocation 1. choose κ among the clusters of τ 's predecessors that decrease the start time of τ , under MCW and DSRW constraints; 2. or, assign κ using our efficient task-to-cluster mapping strategy, under the additional constraint MCW; 3. or, create a new cluster if the PCW constraint is satisfied; 4. otherwise, choose the cluster among all clusters in M CW clusters min under the constraint MCW. Note that, in this worst case scenario, the tlevel of τ can be increased, leading to a decrease in performance since the length of the graph critical path is also increased.
BDSC is described in Algorithms 10 and 11; the entry graph G u is the whole unscheduled program DAG, P , the maximum number of processors, and M , the maximum amount of memory available in a cluster. U T denotes the set of unexamined tasks at each BDSC iteration, RL, the set of ready tasks and U RL, the set of unready ones. We schedule the nodes of G according to the four rules above in a descending order of the nodes' priorities. Each time a task τ r has been scheduled, all the newly readied nodes are added to the set RL (ready list) by the update ready set function.
BDSC returns a scheduled graph, i.e., an updated graph where some zeroings may have been performed and for which the clusters function yields the clusters needed by the given schedule; this schedule includes, beside the new graph, the cluster allocation function on tasks, cluster. If not enough memory is available, BDSC returns the original graph, and signals its failure by setting clusters to the empty set.
We suggest to apply here an additional heuristic, in that, if multiple nodes have the same priority, the node with the greatest bottom level is chosen for τ r (likewise for τ u ) to be scheduled first to favor the successors that have the longest path from τ r to the exit node. Also, an optimization could be performed when calling update priority values(G); indeed, after each cluster allocation, only the tlevels of the successors of τ r need to be recomputed instead of those of the whole graph.
Theorem 1. The time complexity of Algorithm 10 (BDSC) is O(n
3 ), n being the number of nodes in Graph G.
Proof. In the "while" loop of BDSC, the most expensive computation is the function end idle cluster used in f ind cluster that locates an existing cluster suitable to allocate there Task τ ; such reuse intends to optimize the use of the limited of processors. Its complexity is proportional to
which is of worst case complexity O(n 2 ). Thus the total cost for n iterations of the "while" loop is O(n 3 ). Even though BDSC's worst case complexity is larger than DSC's, which is O(n 2 log(n)) [29] , it remains polynomial, with a small exponent. Our experiments (see Section 5) showed this theoretical slowdown is indeed not a significant factor in practice.
BDSC-Based Hierarchical Parallelization
In this section, we detail how BDSC can be used, in practice, to schedule applications. We show how to build from an existing program source code what we call a Hierarchical Clustered Dependence Graph (KDG), which will play the role of DAG G above, how to then generate the numerical cost of nodes and edges in G and how to perform what we call Hierarchical Scheduling for KDGs. We use PIPS to illustrate how these new ideas can be integrated in an optimizing compilation platform.
PIPS [12] is a powerful, source-to-source compilation framework initially developed at MINES ParisTech in the 1990s. Thanks to its open-source nature, PIPS has been used by multiple partners over the years for analyzing and transforming C and Fortran programs, in particular when targeting vector, parallel and hybrid architectures. Its advanced static analyses provide sophisticated information about possible program behaviors, including use-def chains, preconditions, transformers, in-out array regions and worst-case code complexities. All information within PIPS is managed via specific APIs that are automatically provided from data structure specifications written with the Newgen domain specific language [14] .
ALGORITHM 10: BDSC scheduling Graph
t e r s , P ) ) then κ = n e w c l u s t e r ( c l u s t e r s ) ; a l l o c a t e t a s k t o c l u s t e r ( τr , κ , G) ; e l s e i f ( ¬ f i n d c l u s t e r ( τr , G, c l u s t e r s , P , M) ) then return e r r o r ( ′ Not enough memory ′ , Gu ) ; e l s e a l l o c a t e t a s k t o c l u s t e r ( τr , c l u s t e r ( τm ) , G) ; e d g e c o s t ( τm , τr ) = 0 ; e l s e i f ( ¬ f i n d c l u s t e r ( τr , G, c l u s t e r s , P , M) ) then return e r r o r ( ′ Not enough memory ′ , Gu ) ; u p d a t e p r i o r i t y v a l u e s (G) ; UT = UT−{τr } ; RL = u p d a t e r e a d y s e t (RL, τr , G) ; URL = UT−RL; c l u s t e r s (G) = c l u s t e r s ; return G; end
Hierarchical Clustered Dependence Graph (KDG)
PIPS represents user code as abstract syntax trees S based on the following grammar, limited here to the control statements at stake in this paper: 
c l u s t e r s (G) = ∅ ; return G; end We assume that each task τ includes a statement S = task statement(τ ), which corresponds to the code it runs when scheduled.
In order to partition real applications, which include loops, tests and other structured constructs, into DAGs, our approach is to use the hierarchy of the source code abstract syntax tree to build an acyclic hierarchical graph, which will be the input for the BDSC algorithm; this new data structure is called a Hierarchical Clustered Dependence Graph (KDG). A KDG is a data dependence DAG between nodes labeled with statements, while control dependences are encoded in the abstract syntax trees of statements. Any statement S can label a KDG node. Contrarily to the usual program dependence graph defined by [6] , a KDG is thus not built only on simple instructions, represented here as call statements 5 ; compound statements such as test statements (both true and false branches) and loop nests may recursively include KDGs, yielding the hierarchical nature of KDGs.
We extend thus the grammar of S to allow the case for KDGs, with the new statement KDG(K,S), which, beside encoding a KDG K, allows us to keep the original statement S, which will be of use later on. Figure 5 shows the KDG K of the last part of the equake code excerpt given in Figure 6 : note how the body of the last loop has been converted to an inner nested KDG node K0 of S0. This KDG has been generated automatically with PIPS; we use the Graphviz tool for pretty printing [9] . For a statement S, one computes the hierarchical clustered dependence graph KDG(S) using a recursive descent of program statements S, creating a node for each statement in a sequence and building loop body, true branch and false branch statements as recursively nested graphs. To create a DAG from this set of nodes, we add an edge between two nodes if the possibly compound statements in the nodes are data dependent; we rely on one of PIPS static analyses to provide this data dependence information. We discuss node times and edge costs in the following.
Our introduction of the KDG is motivated by the following observations, which also support our design decisions:
1. The true and false statements of a test are control dependent upon the condition of the test statement, while every statement within a loop (i.e., statements of its body) is control dependent upon the loop statement header. If we define a control area as a set of statements transitively linked by the control dependence relation, our KDG construction process insures that the control area of the statement of a given node is in the node. This way, we keep all the control dependences of a task in our KDG within itself. 2. We decided to consider test statements as single nodes in the KDG to ensure that they will be scheduled on one cluster, which guarantees the execution of the enclosed code (true or false statements), whichever branch is taken, on this cluster. 3. We do not group successive simple call instructions into a single "basic bloc" node in the KDG in order to let BDSC fuse the corresponding statements so as to maximize parallelism and minimize communications. Note that PIPS performs interprocedural analyses, which will allow call sequences to be efficiently scheduled whether these calls represent trivial assignments or complex function calls.
Cost Model Generation
Since the volume of data used or communicated by KDG tasks are key factors in the BDSC scheduling process, we need to as precisely as possible assess this information. PIPS provides a precise intra-and inter-procedural analysis of array data flow called regions analysis [4] that computes dependences for each array element access. Sets of array accesses are gathered into array regions, which are represented by convex polyhedra expressions over the variables values in the current memory store. For each statement S, four types of regions are considered: read regions(S) and write regions(S) contain the array elements respectively read and written by S, while in regions(S) contain those read by S but not defined in S and out regions(S), those written by S and used after S.
From Convex Polyhedra to Ehrhart Polynomials
Our analysis uses the following set of operations on regions R i (convex polyhedra) 6 :
1. regions intersection(R 1 , R 2 ) is the intersection of two convex polyhedra R 1 and R 2 , which is also a convex polyhedron; 2. regions difference(R 1 , R 2 ) is the set difference between two convex polyhedra; since this is not generally a convex polyhedron, an under approximation is computed in order to obtain a convex representation;
3. regions union(R 1 , R 2 ) is the union of two convex polyhedra, which is again not necessarily a convex polyhedron; a hull convex polyhedron approximation is computed in order to return the smallest enclosing polyhedron.
Since we are interested in the size of these regions to precisely assess communication costs and memory requirements, we compute Ehrhart polynomials [5] , which represent the number of integer points contained in a given parameterized polyhedron, from these regions. To manipulate these polynomials, we use various operations using the Ehrhart API provided by the polylib library [24] . Edge Cost To assess the communication cost between two KDG nodes, τ 1 as source and τ 2 as sink nodes, we rely on the number of bytes involved in dependences of type "read after write" (RAW) data, using the read and write regions as follows:
R w = write regions(task statement(τ 1 )); R r = read regions(task statement(τ 2 )); edge cost(τ 1 , τ 2 ) = ehrhart(regions intersection(R w ,R r ))
Task Data To provide an estimation of the volume of data used by each node τ , we use the number of bytes of data read and written by the task statement, via the following definitions :
S = task statement(τ ); task data(τ ) = data merge(read regions(S), write regions(S))
where we define data merge and size data as follows:
Task Time In order to determine an average execution time for each node in the KDG, we use a static execution time approach based on a program complexity analysis provided by PIPS. There, each statement S is automatically labeled with an expression, represented by a polynomial complexity estimation(S) over program variables, that denotes an estimation of the execution time of this statement, assuming that each basic operation (addition, multiplication...) has a fixed, architecture-dependent execution time. This sophisticated static complexity analysis is based on inter-procedural information such as preconditions. Using this approach, one can define task time as: task time(τ ) = complexity estimation(task statement(τ ))
From Polynomials to Values
We have just seen how to represent cost, data and time information in terms of polynomials; yet, running BDSC requires actual values. This is particularly a problem when computing tlevels and blevels, since cost and time are cumulated there. We decided to convert cost information into time by assuming that communication times are proportional to costs, which amounts in particular to setting communication latency to zero 7 . When program variables used in the above-defined polynomials are numerical values, each polynomial is a constant; this happens to be the case for one of our applications, ABF. However, when input data are unknown at compile time (as for the Harris application), we suggest to use a very simple heuristic to approximate the values of the polynomials. When all polynomials at stake are monomials on the same base, we simply keep the coefficient of these monomials. Even though this heuristics appears naive at first, it actually is quite useful in the Harris application: Table 1 shows the complexities and time estimation generated for each function of Harris using PIPS default operation cost model, where the sizeN and sizeM variables represent the input image size. The general case deals with polynomials that are functions of many variables, such as the ones that occur in equake and that depend on the variables ARCHelems or ARCHnodes and timessteps. In such cases, we suggest to first instrument the input sequential code and run it once in order to obtain the numerical values of the polynomials. The instrumented code contains the initial user code plus instructions that compute the values of the cost polynomials for each statement. BDSC is then applied, using this cost information, to yield the final parallel program. Note that this approach is sound since BDSC ensures that the value of a variable (and thus a polynomial) is the same, whichever scheduling is used. Of course, this approach will work well, as our experiments suggest, when a program performance does not change when some part of its input parameters are modified; this is the case for many signal processing applications, where performance is mostly a function of structure parameters such as image size, and is independent of the actual signal (pixel) values upon which the program acts.
We show an example of this final case using a part of the instrumented equake code 8 in Figure 6 . The added instrumentation instructions are fprintf statements, the second parameter of which represents the statement number of the following statement, and the third, the value of its execution time for task time instrumentation. For edge cost instrumentation, the second parameter is the number of the incident statements of the edge, and the third, the edge cost polynomial. After execution of the instrumented code, the numerical results of the polynomials are printed in the file instrumented equake.in. This file will be an entry for the PIPS implementation of BDSC.
Hierarchical Scheduling
Now that all the information needed by the basic version of BDSC presented above has been gathered, we detail in Algorithm 12 how we suggest to adapt it to the hierarchical KDG structure introduced above in order to eventually generate nested parallel code. Hierarchically scheduling a given statement S is seen here as the definition of a hierarchical schedule σ which maps each substatement s of S to σ(s) = (s ′ , n). If there are enough processor and memory resources to schedule S using BDSC, (s ′ , n) is a pair made of a scheduled statement s ′ = schedule(σ(s)) and the number n = nbclusters(σ(s)) of additional clusters the inner scheduling of s requires, in addition to the one already allocated for S. Otherwise, scheduling is impossible, and the program stops. In a scheduled statement, all sequences are replaced by scheduled KDGs.
A successful call to the hierarchical schedule(S, P, M, σ) function, which assumes that P is strictly positive, yields a new version of σ that takes into account all substatements of S; only P clusters, with a data size at most M each, can be scheduled. σ[S → (S ′ , n)] is the function equivalent to σ except for S, where its value is (S ′ , n). For a sequence S (other constructs are simply traversed, scheduling information being gathered in nested KDGs), one first computes its scheduled KDG is traversed along a topological sort-ordered descent: topological sort(K ′ ) yields a list of stages of computation, each cluster stage being a list of independent lists L of tasks τ , one L for each cluster κ generated by BDSC for this particular stage in the topological order. The recursive hierarchical scheduling of each statement s = task statement(τ ) may take advantage of at most P ′ available clusters, since |cluster stage| clusters are already reserved to schedule the current stage cluster stage of tasks for Statement S; it yields a new scheduling function σ s . Figure 7 illustrates the various entities involved in the computation of such a scheduling function. 
′ −= n b c l u s t e r s L ; return σ ; end function t r a n s f e r c o s t ( s , n b c l u s t e r s ) Ri = i n r e g i o n s ( s ) ; Ro = o u t r e g i o n s ( s ) ; return ( n b c l u s t e r s == 0 ) ? 0 : e h r h a r t (Ri)+ e h r h a r t (Ro ) ; end Our approach is top-down in order to yield tasks that are as coarse grained as possible when dealing with sequences: BDSC is called in hierarchical schedule before substatements are hierarchically scheduled. However, a unique pass over substatements could be suboptimal, since parallelism may also exist within substatements, to be discovered by later recursive calls to hierarchical schedule; if this parallelism had been known ahead of time, previous values of task time used by BDSC would have been smaller, which could have had an impact on the higher-level scheduling. In order to address this issue, our hierarchical scheduling algorithm iterates the hierarchical schedule step top-down pass on a new KDG K ′ in which BDSC takes into account these modified task complexities; iteration continues while K ′ provides a smaller KDG scheduling length than K and the iteration limit MAX ITER has not been reached. Note that one needs to be careful in hierarchical schedule step to ensure that each rescheduled substatement s is allocated a number of clusters consistent with the one used when computing its parallel execution time; we check the condition nbclusters ′ s ≥ nbclusters s , which ensures that the parallelism assumed when computing time complexities within s will remain available. A final check needs however to be made before accepting a new hierarchical schedule σ s : the additional overhead incurred by the transfer of data, computed using the in and out regions defined in Section 4.2, to a nested parallel task much be offset by the time gained by using the available parallelism.
Note that our top-down, iterative, hierarchical scheduling approach also helps dealing with limited memory resources. If BDSC fails at first because not enough memory is available for a given task, the hierarchical schedule step function is nonetheless called to schedule nested statements, possibly loosening up the memory constraints by distributing some of the work on less memorychallenged additional clusters. This might enable the subsequent call to BDSC to succeed.
Hierarchical scheduling uses extended definitions of task time and task data for tasks τ whose statements S = task statement(τ ) are scheduled KDGs KDG(K,S), extending the definitions provided in Section 4.2, which still applies to non-KDG statements. For such a case, K can either be a scheduled KDG (i.e., such that S ∈ domain(σ)), in which case we define: task time(τ , σ) = max κ∈clusters(K)/KDG(K,S)=schedule(σ(S)) cluster time(κ) task data(τ , σ) = empty region or, otherwise, one simply uses the previous definitions of task time and task data on S provided in Section 4.2. We assume that BDSC and other relevant functions take σ as an additional argument to access the KDGs associated to statement sequences and handle the modified definitions of task time and task data.
Theorem 2. The time complexity of Algorithm 12 (hierarchical
where n is the number of call statements in S and k a constant greater than 1.
Proof. Let t(l) be the worst-case time complexity for our hierarchical scheduling algorithm on the structured statement S of hierarchical level 9 l. Time complexity increases significantly only in sequences, loops and tests being simply managed by straightforward recursive calls of hierarchical schedule on substatements. For a sequence S, t(l) is proportional to the time complexity of BDSC followed by a call to hierarchical schedule step; the proportionality constant is k =MAX ITER (supposed to be greater than 1).
The time complexity of BDSC for a sequence of m statements is at most O(m 3 ) (see Theorem 1) . Assuming that all subsequences have a maximum number m of (possibly compound) statements, the time complexity for the hierarchical scheduling step function is the time complexity of the topological sort algorithm followed by a recursive call to hierarchical schedule, and is thus O(m 2 +mt(l−1)). Thus t(l) is at most proportional to k(m 3 +m 2 +mt(l−1)) ∼ km 3 + kmt(l − 1). Since t(l) is an arithmetico-geometric series, its analytical
. Let l S be the level for the whole Statement S. The worst performance occurs when the structure of S is flat, i.e., when l S ∼ n and m is O(1); here t(n) = t(l S ) ∼ k n . Even though the worst case time complexity of hierarchical schedule is exponential, we expect and our experiments suggest that it will behave more tamely on actual, properly structured code. Indeed, note that l S ∼ log m (n) if S is balanced for some large constant m; in this case, t(n) ∼ (km) log(n) , showing a subexponential time complexity.
Experiments
The BDSC algorithm presented in this paper has been designed to offer better task parallelism extraction performance for parallelizing compilers than traditional list-scheduling techniques such as DSC. To verify its effectiveness, BDSC has been implemented in PIPS and tested on actual applications written in C. In this section, we provide preliminary experimental BDSC-vs-DSC comparison results based on the parallelization of four such applications, namely ABF, Harris, equake and IS. We chose these particular applications since they are well-known benchmarks and exhibit task parallelization that we hope our approach will be able to take advantage of. They are: (1) ABF (Adaptive Beam Forming), a 1,065-line program that performs adaptive spatial radar signal processing [10] ; (2) Harris, a 105-line image processing corner detector [11] ; (3) the 1,432-line SPEC benchmark equake [2] , which is used in the finite element simulation of seismic wave propagation; and (4) Integer Sort (IS), one of the eleven benchmarks in the NAS Parallel Benchmarks suite [22] , with 1,076 lines.
Protocol
We have extended PIPS with our implementation in C of BDSC-based hierarchical scheduling. To compute the static execution time and communication cost estimates needed by BDSC, we relied upon the PIPS run time complexity analysis and a more realistic, architecture-dependent communication cost matrix (Table 1 was computed using the simpler PIPS default cost model). For each code S of our four test application, PIPS performed automatic parallelization, applying our hierarchical scheduling process hierarchical schedule(S, P, M, ⊥) (using either BDSC or DSC) on these sequential programs to yield σ. PIPS automatically generated an OpenMP [23] version from the scheduled KDGs in σ(S), using omptask directives; another version, in MPI [20] , was generated from the scheduled KDGs. We also applied DSC in the hierarchical scheduling process of these applications and generated the corresponding OpenMP and MPI codes. Compilation times for these applications were quite reasonable, the longest (equake) being 84 seconds. In this last instance, most of the time (79 seconds) was spent by PIPS to gather semantic information such as regions, complexities and dependences; our prototype implementation of BDSC is only responsible for the remaining 5 seconds.
We ran all these parallelized codes on two shared and distributed memory computing systems. To increase available coarse-grain task parallelism in our test suite, we have used both unmodified and modified versions of our applications. We tiled and fully unrolled the four most costly loops in ABF and equake; the tiling factor for the BDSC version is the number of available processors, while we had to find the proper one for DSC, since DSC puts no constraints on the number of needed processors but returns the number of processors its scheduling requires. For Harris and IS, our experiments have looked at both tiled and untiled versions of the applications.
Experiments on Shared Memory Systems
We measured the execution time of the parallel OpenMP codes on the P = 1, 2, 4, 6 and 8 cores of a host Linux machine with a 2-socket AMD quadcore Opteron with 8 cores, with M = 16 GB of RAM, running at 2.4 GHz. Figure 8 shows the performance results of the generated OpenMP code on the two versions scheduled using DSC and BDSC on ABF and equake. The speedup data show that the DSC algorithm is not scalable, when the number of cores is increased; this is due to the generation of more clusters with empty slots than with BDSC, a costly decision given that, when the number of clusters exceeds P , they have to share the same core as multiple threads. Figure 10 shows the hierachically scheduled KDG for Harris, generated automatically with PIPS using the Graphviz tool for three cores without tiling any loops (we used three cores because the maximum parallelism in Harris is three, as can be seen in the graph). The first two columns for each subchart in Figure 11 present the speedup obtained using P = 3 for two parallel versions: BDSC with and BDSC without tiling of the kernel CoarsitY (we tiled by 3). The performance is given using three different input image sizes: 1024 × 1024, 2048 × 1024 and 2048 × 2048. The best speedup corresponds to the tiled version with BDSC because, in this case, the three cores are fully loaded. The DSC version (not shown in the figure) yields the same results as our versions because the code can be scheduled using three cores. In Figure 12 , the first five subcharts show the performance results of the generated OpenMP code on the NAS benchmark IS after applying BDSC. The maximum task parallelism without tiling in IS is two, which is shown in the first subchart; the other subcharts are obtained after tiling. The program has been run with three IS input classes (A, B and C [22] ). The subpar performance of our implementation for Class A programs is due to the large task creation overhead, which drowns the potential parallelism gains, even more limited here because of the small size of Class A data.
Experiments on Distributed Memory Systems
We measured the execution time of the parallel codes on P = 1, 2, 4 and 6 processors of a host Linux machine with 6 bicore processors Intel(R) Xeon(R), with M = 32 Gb of RAM per processor, running at 2.5 GHz. Figure 9 presents the speedups of the parallel MPI vs. sequential versions of ABF and equake using P = 2, 4 and 6 processors. As before, the DSC algorithm is not scalable, when the number of processors is increased, since the generation of more clusters with empty slots leads to higher process scheduling cost on processors and communication volume between them. The last two columns of each subchart of Figure 11 present the speedups of the parallel MPI vs. sequential versions of Harris using three processors. The tiled version with BDSC gives the same result as the non-tiled version since the communication overhead is so important when the three tiled loops are scheduled on three different processors that BDSC scheduled them on the same processor; this led thus to a schedule equivalent to the one of the non-tiled version. Compared to OpenMP, the speedups decrease when the image size is increased because the amount of communication between processors increases. The DSC version (not shown on the figure) gives the same results as the BDSC version because the code can be scheduled using three processors. In Figure 12 , the last four subcharts show the performance results of the generated MPI code on the NAS benchmark IS after application of BDSC. The same analysis as the one for OpenMP applies here, in addition to communication overhead issues.
Scheduling Robustness
Since our BDSC scheduling heuristic relies on the numerical approximations of the execution time and communication costs of tasks, one needs to assess its sensitivity over the accuracy of these estimations. Since a mathematical analysis of this issue is made difficult by the heuristic nature of BDSC and, in fact, of scheduling processes in general, we provide below experimental data that show that our approach is rather robust.
In practice, we ran multiple versions of each application using various static execution and communication cost models:
• the naive variant, in which all execution times and communications costs are supposed constant (only data dependence is enforced during the scheduling process);
• the default BDSC cost model described above;
• a biased BDSC cost model, where we modulated each execution time and communication cost value randomly by at most ∆% (the default BDSC cost model would thus correspond to ∆ = 0).
Our intent, with introduction of different cost models, is to assess how small to large differences to our estimation of task times and communication costs impact the performance of BDSC-scheduled parallel code. We would expect that parallelization based on the naive variant cost model would yield the worst schedules, thus motivating our use of complexity analysis for parallelization purposes if the schedules that use our default cost model are indeed better. Adding small random biases to task times and communication costs should not modify too much the schedules (to demonstrate stability), while adding larger ones might, showing the quality of the default cost model used for parallelization. Table 2 provides, for each application (Harris, ABF, equake and IS) and execution environment (OpenMP and MPI), the worst execution time obtained within batches of about 20 runs of programs scheduled using the naive, default and biased cost models. For this last case, we only kept in the table the entries corresponding to significant values of ∆, namely those at which, for at least one application, the running time changed. So, for instance, when running ABF on OpenMP, the naive approach run time is 321 ms, while BDSC clocks at 214; adding random increments to the task communication and execution estimations provided by our cost model (Section 4.2) of up to, but not including, 80% does not change the scheduling, and thus running time. At 80%, running time increases to 230, and reaches 297 when ∆ = 3, 000.
As expected, the naive variant always provides schedules that have the worst execution times, thus motivating the introduction of performance estimation in the scheduling process. Even more interestingly, our experiments show that one needs to introduce rather large task time and communication cost estimation errors, i.e., values of ∆, to make the BDSC-based scheduling process switch to less efficient schedules. This set of experimental data thus suggests that BDSC is a rather useful and robust heuristic, well adapted to the efficient parallelization of scientific applications.
Related Work
In this section, we survey the main existing list-scheduling algorithms and review the key approaches to automate the parallelization of programs using different scheduling policies; we compare them to BDSC and our BDSC-based parallelization process.
Scheduling Algorithms
Given the breadth of the literature on scheduling, we limit this presentation to heuristics that implement static list-scheduling processes. We compare BDSC with eight scheduling algorithms for a bounded number of clusters: HLFET, ISH, MCP, HEFT, CEFT, LPGS, LSGP and ELT.
The Highest Level First with Estimated Times (HLFET) [1] and Insertion Scheduling Heuristic (ISH) [17] algorithms use static blevels for ordering; scheduling is performed according to a descending order of blevels. To schedule a task, they select the cluster that offers the earliest execution time, using a non-insertion approach, i.e., not taking into account idle slots within existing clusters to insert that task. If scheduling a given task introduces an idle slot, ISH adds the possibility of inserting from the ready list tasks that can be scheduled to this idle slot. Since, in both algorithms, only blevels are used for scheduling purposes, optimal schedules for fork/join graphs cannot be guaranteed.
The Modified Critical Path (MCP) algorithm [28] uses the latest start times, i.e., the critical path length minus blevel, as task priorities. It constructs a list of tasks in an ascending order of latest start times, and searches for the cluster yielding the earliest execution using the insertion approach. As before, it cannot guarantee optimal schedules for fork/join structures.
The Heterogeneous Earliest-Finish-Time (HEFT) algorithm [27] selects the cluster that minimizes the earliest finish time using the insertion approach. The priority of a task, its upward rank, is the task blevel. Since this algorithm is based on blevels only, it cannot guarantee optimal schedules for fork/join structures.
The Constrained Earliest Finish Time (CEFT) [16] algorithm schedules tasks on heterogeneous systems. It uses the concept of constrained critical paths (CCPs) that represent the tasks ready at each step of the scheduling process. CEFT schedules the tasks in the CCPs using the finish time in the entire CCP. The fact that CEFT schedules critical path tasks first cannot guarantee optimal schedules for fork and join structures even if sufficient processors are provided.
Contrarily to the five proposals above, BDSC preserves, when no resource constraints exist, the DSC characteristics of optimal scheduling for fork/join structures, since it uses the critical path length for computing dynamic priorities, based on blevels and tlevels. HLFET, ISH and MCP guarantee that the current critical path will not increase, but they do not attempt to decrease the critical path length; BDSC decreases the length of each task DS and starts with a ready node to simplify the computation time of new priorities. When resource scarcity is a factor, BDSC introduces a simple, two-step heuristics for task allocation: to schedule tasks, it first searches for possible idle slots in already existing clusters and, otherwise, picks a cluster with enough memory. Our experiments suggest that such an approach provides good schedules.
The Locally Parallel-Globally Sequential (LPGS) [19] and Locally SequentialGlobally Parallel (LSGP) [13] algorithms are two techniques that, from a schedule for an unbounded number of clusters, remap the solutions to a bounded number of clusters. In LSGP, clusters are partitioned into blocks, each block being assigned to one cluster (locally sequential). The blocks are handled separately by different clusters, which can be run in parallel (globally parallel). LPGS links each original one-block cluster to one processor (locally parallel); blocks are executed sequentially (globally sequential). BDSC computes a bounded schedule on the fly and covers many more other possibilities of scheduling than LPGS and LSGP.
Extended Latency Time (ELT) algorithm [26] assigns tasks to a parallel machine with shared memory. It uses the attribute of synchronization time instead of communication time because this does not exist in a machine with shared memory. BDSC targets both shared and distributed memory systems.
Parallelization Tools
In this section, we review several approaches that intend to automate the parallelization of programs using different granularities and scheduling policies. Given the breadth of literature on this subject, we limit this presentation to approaches that focus on static list-scheduling methods.
Sarkar's work on the partitioning and scheduling of parallel programs [25] for multiprocessors introduced a compile-time method where a GR (Graphical Representation) program is partitioned into parallel tasks at compile time. A GR graph has four kinds of nodes: "simple", to represent an indivisible sequential computation, "function call", "parallel", to represent parallel loops, and "compound", for conditional instructions. Sarkar presents an approximation parallelization algorithm. Starting with an initial fine granularity partition, P 0 , tasks (chosen by heuristics) are iteratively merged till the coarsest partition P n (with one task containing all nodes), after n iterations, is reached. The partition P min with the smallest parallel execution time in the presence of overhead (scheduling and communication overhead) is chosen. For scheduling, Sarkar introduces the EZ (Edge-Zeroing) algorithm that uses blevels for ordering: it is based on edge weights for clustering; all edges are examined from the largest edge weight to the smallest; it then proceeds by zeroing the highest edge weight if the completion time decreases. While this algorithm is based only on the blevel for an unbounded number of processors and does not recompute the priorities after zeroings, BDSC adds resource constraints and is based on both blevels and dynamic tlevels.
The OSCAR Fortran Compiler [15] is used as a preprocessor from Fortran to parallelized OpenMP Fortran. OSCAR partitions a program into a macrotask graph, where nodes represent macro-tasks of three kinds, namely basic, repetition and subroutine blocks. The coarse grain task parallelization proceeds as follows. First, the macro-tasks are generated by decomposition of the source program. Then, a macro-flow graph is generated to represent data and control dependences on macro-tasks. The macro-task graph is subsequently generated via the analysis of parallelism among macro-tasks using an earliest executable condition analysis that represents the conditions on which a given macro-task may begin its execution at the earliest time, assuming precedence constraints. If a macro-task graph has only data dependence edges, macro-tasks are assigned to processors by static scheduling. If a macro-task graph has both data and control dependence edges, macro-tasks are assigned to processors at run time by a dynamic scheduling routine. In addition to dealing with a richer set of resource constraints, BDSC targets both shared and distributed memory systems with a cost model based on communication, used data and time estimations.
Pedigree [21] is a compilation tool based on the program dependence graph (PDG). The PDG is extended by adding a new type of node, a Par node, which groups children nodes reachable via the same branch conditions. Pedigree proceeds by estimating a latency for each node and data dependences edge weights in the PDG. The scheduling process orders the children and assigns them to a subset of the processors. For scheduling, nodes with minimum early and late times are given highest priority; the highest priority ready node is selected for scheduling based on the synchronization overhead and latency. While Pedigree operates on assembly code, PIPS and our extension for task-parallelism using BDSC offer a higher-level, source-to-source parallelization framework. Moreover, Pedigree generated code is specialized for only symmetric multiprocessors, while BDSC targets many architecture types, thanks to its resource constraints and cost models.
The SPIR (Signal Processing Intermediate Representation) compiler [3] takes a sequential dataflow program as input and generates a multithreaded parallel program for a multicore system. First, SPIR builds a stream graph where a node corresponds to a kernel function call or to the condition of an "if" statement; an edge denotes a transfer of data between two kernel function calls or a control transfer by an "if" statement (true or false). Then, for task scheduling purposes, given a stream graph and a target platform, the task scheduler assigns each node to a processor in the target platform. It allocates stream buffers, and generates DMAs under given memory and timing constraints. The degree of automation of BDSC is larger than SPIR's, because this latter system needs several keywords extensions plus the C code denoting the streaming scope within applications. Also, the granularity in SPIR is a function, whereas BDSC uses several granularity levels.
Conclusion
This paper presents the resource-constrained, list-scheduling heuristic BDSC, which extends the DSC (Dominant Sequence Clustering) algorithm, and its practical use via hierarchical scheduling when parallelizing scientific applications. This extension improves upon DSC by dealing with memory-and numberof-processors-constrained parallel architectures, while still yielding faster task schedules thanks to efficient task-to-cluster mapping. The use of BDSC benefits from a sophisticated execution and communication cost model, based on either static code analysis or a dynamic-based instrumentation assessment tool.
Preliminary experimental results suggest that BDSC provides more efficient schedules than DSC. We illustrate the positive impact of the integration of BDSC within the PIPS automatic parallelization compiler infrastructure using the signal processing application ABF (Adaptive Beam Forming), the image processing application Harris, the SPEC benchmark equake and the NAS parallel benchmark IS on both shared and distributed memory systems.
Future work might address the benefits an hybrid approach that would mix BDSC and dynamic scheduling could offer. It might also be interesting to try to find a more efficient hierarchical processor allocation strategy, which would address load balancing and task granularity issues, in order to yield an even better BDSC-based parallelization process.
