The developmentof parallellarge-scaleapplicationcodes is a challenging problem, because it requires a combination of application knowledge, understanding of the various aspects of parallelism involved, and software engineering. Moreover, the size of largescale applications usually is input-dependent, and the parallel algorithm needs to be scalable to various numbers of processors. This paper combines the theoretical as well as the practical aspects required for the understanding, realisation, and manageability of the development process of parallel large-scale applications. It provides a formal framework in which their (partly machine-model speci c) potential parallelism can be expressed and requirements on scheduling and implementation are given. The paper further describes a practical software-engineering development approach build on this framework, and discusses and illustrates its usage in two large case studies.
INTRODUCTION
Current research on parallel software development largely concentrates on parallel language formalisms expressing parallelism, and on compiler techniques supporting mainly lowlevel parallelisations. The well-known parallel-speci c software aspects, like e ciency, portability, and scalability are studied. Speci c software engineering requirements on software (e.g. maintainability) and on the development process (low costs, structuredness, manageability) are mainly studied for the development of large sequential software. The development of high-quality parallel large-scale software, however, has to combine both parallel-speci c software requirements and software engineering requirements. It therefore demands a well balanced engineering methodology.
For the development of e cient parallel large-scale applications, it is essential to be able to fully express the potential parallelism inherent in such applications, taking the possibilities o ered by the various machine models into account. In our formal framework, the speci c possibilities o ered by the distributed memory and shared memory machine models are discussed. These two machine models are particularly wellsuited for the exploitation of the type of parallelism involved, and are implemented by many of today's high performance architectures. A formal, but understandable, notation is given to express the (partly machine-model speci c) potential parallelism inherent in a large-scale application. The restrictions and di culties of scheduling (task-allocation) of parallel algorithms are explained and formalised, and, based on this, a high-level protocol is derived giving general implementation requirements, independent of the number of processors and the speci c schedule.
Strongly supported by this formal framework, a practical development approach has been developed and tested in a number of large case studies. At the start of the proposed development process, a hierarchical description of the application in terms of subsystems and a sound execution order (sequential or parallel) is assumed available, without making any assumptions about the describing formalism. Our approach reconsiders, based on parallelism considerations, the given structure in order to achieve an e cient parallel structure. This is done by rst decomposing the original structure up to {at most{ the elementary subsystems inherent in the application problem, and then clustering them. The initial description can be derived by well-known software engineering methods and techniques (e.g. 8]). Moreover, for practical (large-scale) application problems it often is already available in the form of a sequential code (or some earlier parallel version). Our approach then fully exploits the work done in those earlier developments.
Our approach has an easy-to-understand structure, and guides the development through a stepwise analysis, parallelisation, and optimisation phase, each of which is supported by the formal framework. The rst phase analyses the subsystems, the necessary restrictions on their execution order, and other application properties necessary to evaluate parallelisation options. The detection and actual evaluation of these options takes place in the second phase, where the parallelisation choices are made, and, based on the rules given in the formal framework, stepwise implemented and veri ed. Finally, phase three detects performance problem areas and applies appropriate optimisation techniques. As an intrinsic part of the approach, the design notation and documentation are produced during these development phases. The division into phases and various steps, guided by the underlying formal framework, supports a structured and manageable development process which guards the parallel-speci c requirements mentioned above.
The applicability of the approach, the guidance o ered by the formal framework, and the manageability of the development process are illustrated by two practical case studies. A large-scale distributed memory Navier-Stokes solver has been developed by parallelising an existing sequential version of the solver ( 5] ). Both a shared and a distributed memory version of a solver of large sparse matrices have been developed ( 7] ). The analysis, parallelisation, and optimisation phases of the three developments are discussed in detail and compared.
FORMAL FRAMEWORK
It is explained that the familiar notion of DAGs is insucient to fully express potential parallelism. The notion is extended speci cally for distributed memory (to DMDAG) as well as for shared memory (SMDAG) machine models. Schedules of DMDAGs and SMDAGs are discussed within the formal framework, and, based on their formalisation, a high-level protocol is derived giving general implementation requirements, independent of the number of processors and the speci c schedule. The latter both for the case of unique allocation of parallel tasks as for the case of (possible) multiallocation, for which two di erent protocols are given.
DMDAGs
Traditionally, parallel algorithms are expressed by Directed Acyclic Graphs ( 11] ). The nodes in a DAG correspond to tasks, where the execution time of task u is given as x(u).
The arrows in a DAG correspond to a prescribed execution order. An arrow (u; v) can cause a delay (u; v) if tasks u and v are scheduled at di erent processors. This delay represents, depending on the machine model in mind, synchronisation time or communication time.
DAGs with uniform x's and uniform 's can represent, for instance, the low-level parallelism possible between elementary statements. For higher-level parallelism, the execution times of the various tasks will di er. Whether the delay can be chosen uniform depends on the machine model. For shared memory, for instance, the arrows represent synchronisation only (communication is done during execution of tasks), and, consequently, the delays can be chosen uniform. For distributed memory, the arrows represent the data communication that is required if tasks are scheduled at di erent processors. In the latter case, the delays will depend on the amount of data involved.
A number of aspects of parallel algorithms can, however, not be expressed by DAGs. The selection of tasks based on guards (which might become known only during execution), i.e. conditional execution, cannot be expressed. Furthermore, the arrows in DAGs express only one type of dependence: a strict execution order, independent of the scheduling. In order to exploit all available parallelism, it becomes important for distributed memory to distinguish two types of dependences: those that always induce an execution order (hard dependences), and those that induce an execution order only if the same memory is used (soft dependences).
Both types of dependences indicate the use of a common data set. For a soft dependence this usage is such that an execution order is required if the same copy of the common data set (the same memory) is used, and no order otherwise.
Such a soft dependence from task u to task v is required if, for instance, u is due to consume an old version of data set d and v produces a new version. A hard dependence from u to v indicates that v requires the version handed over by u. This is the case if, for instance, task u produces the version of d that is required by v. A hard dependence therefore induces an execution order`v after u', and, if a di erent copy of d is used (if u and v are allocated to di erent processors), it also induces a communication action. Notice that a common data set not necessarily induces a dependence: in case both u and v only consume d, without changing it, their execution order can be unrestricted, and no dependence is required.
These types of dependences resemble the ones used in data dependence graphs ( 2] ) with the main di erence that here they refer to high-level tasks and common data sets instead of to statements and variables respectively. The notion of DAG will be extended with these dependences, and with control dependences that deal with selections. The dependences in this extended notion enforce correct execution order and correct data-usage. The requirements on the implementation of this extended notion are discussed later on in this section. The next section discusses how these structures can be derived and transformed, and illustrates the use of the dependences (Phase I).
In order to be able to express and exploit the parallelism that is inherent in an algorithm, we de ne a DMDAG as a quintuple hT; S; tt; ts; spi, where T is the set of tasks; S is the set of selections; tt T T denotes dependences between tasks, where u?! h v denotes a hard dependence, and u?! s v a soft dependence from u to v; ts T S denotes dependences from tasks to selections. These dependences are all hard; sp S P(T S) denotes control dependences.
These arrows may be labeled with a guard. Scheduling of DMDAGs A scheduling assigns, for each selection-alternative, the tasks and selections to processors 1 to m (for natural m). For now, we assume that each task v is assigned to a unique processor. This is based on the assumption that the high-level tasks have a larger execution time than the delays of the outgoing dependences. How to deal with multi-allocation is explained later. A selection s may be allocated to multiple processors.
A scheduling may di er, to some extend, for the various selection-alternatives. We therefore denote, for a selection- A control dependence (s; U) induces an execution order between s and all tasks in U (including, possibly, a control communication). A scheduling is correct if the allocation of tasks and selections does not depend on the outcome of selections that are, according to the above rules, executed later. That is, if for any A1, task v must {according to the rules above{ be executed before selection s, then for all A2 with v 2 T(A2) : TP(A2; v) = TP(A1; v) (similar for selections instead of tasks). Roughly, this means that for the ancestors of s in the DMDAG, the allocation to processors must be equal for all outgoing arrows of s, although for soft dependence ancestors this rule is too restrictive.
Schedules that allocate tasks or selections di erently for di erent selection-alternatives are dynamic; they are, apparently, determined during execution of the algorithm.
For a given scheduling, the minimal makespan (overall execution time) can be calculated relatively easy, using techniques as explained in 11] .
Although the restrictions given above o er a straightforward means of detecting correct schedules, they do not provide e cient algorithms to detect a schedule with a good, or an optimal, makespan. Determining an optimal schedule, and thereby fully exploiting the available parallelism, is known to be NP-complete for DAGs ( 11] ). E cient algorithms to nd a schedule for DAGs with a good makespan are restricted to special cases (uniform execution times and delays; potentially unbounded number of processors 11, 16] ). For the more general structures presented here, concessions will certainly have to be made to handle the scheduling problem in a manageable way. In the 3-phase development approach (Section 3), we propose a practical scheduling strategy to do this.
Implementation Requirements
In this subsection we assume that a schedule is given (or produced during execution), and we analyse the consequences of the dependences on the execution of tasks and selections. This will lead to general high-level requirements on the implementation of the DMDAG, independent of the number of processors and the speci c schedule.
The dependences will lead to execution order restrictions, and, in case of hard dependences, also to communication actions between processors. The communication is realised by send and receive actions. In order to ensure that arriving data does not interfere with the usage of a previous version of the data, the following assumption on the communication protocol is made. The updating of the local copy of a data set as a result of a received renewed version takes place the moment the ( rst) task that requires this renewed version starts. This assumption is necessary since the arrival of data at a processor can coincide with (or even precede) the usage of the old version of the data. The implementation of the assumption (by means of, for instance, a transient local bu er-copy' of the received data) is a lower level concern related to the network{processor communication.
Assume that a schedule is available as TP(v) for tasks, Notice that the control ow (send/receive of results of selections) is purely inter-processor communication. We will see that the data ow (results of tasks) is communication between speci c tasks or between a task and a selection.
A processor is allowed to execute only tasks v with TP(v) = procnr. Whether a process v can be executed depends further on the incoming dependences tt only. As explained above, all dependences between tasks scheduled to the same processor must induce an execution order. Con- ).
The consequences of the outgoing dependences tt and ts for a task v lead to a protocol that has to be performed after task v is completed. First of all, in view of the outgoing dependences to tasks and selections scheduled on the same processor, the value Finished(v) must be set on true, which is the only requirement w.r.t. these dependences. The external outgoing soft dependences do not lead to any requirement, since the tasks are not related. For external outgoing hard dependences (to tasks as well as to selections), the data must be send to the processors to which the tasks or selections are scheduled. The above leads to the following exit protocol for task v. Multi-allocation This subsection discusses the consequences of allowing the scheduling of tasks to multiple processors. If two paths via a speci c task are critical, multi-allocation may be a sensible strategy to avoid the delay caused by the outgoing dependences of the task. In the case of possible multi-allocation, a soft dependence u?! s v induces a local execution order only between the copies of u and v that are scheduled to the same processor. In case of a hard dependence u?! h v, task v requires data produced by task u. The data produced for v by the various copies of u can arrive more than once at each processor that executes v. Let v be scheduled to (possibly among others) processor j. If processor j does not execute u itself, only the rst arrival of receive(u;v) is assumed to be regarded; all others are ignored. If u and v are both scheduled to processor j, but the execution of u has not yet been started when the rst receive(u;v) arrives, j will not execute u itself and ignores all following receive(u;v)'s. In case the execution of u has already been started, all receive(u;v)'s are ignored. Notice that the above means that hard dependences induce execution order only between the rst copy of the sender and the receiver.
In the sequel, TP(u) denotes the set of processors task u is allocated to. Whether the local execution of a task u has started is recorded as Start(u), which is initially false. Notice that the value Finished(u) can be used for this purpose only if the processors are strictly sequential in the execution of tasks and the receipt of data-messages. In order to avoid unnecessary execution of a task u, the local version of TP(u) may be changed. Based on the explanation above, the following protocol is derived for the arrival of receive(u;v) at processor procnr. exit-statement1(v) The protocol for selections need only be altered by changing condition`TP(u) = procnr' into`procnr 2 TP(u)' in G(s). Notice the di erence in handling multi-allocation for selections and for tasks. Unnecessary recomputation of tasks is avoided, since tasks have, in general, a large execution time. The double receipts are ignored, since they can interfere with the proper execution of tasks writing data. For selections, the execution time is generally small, the delay caused only by synchronisation tokens and therefore negligible, and the double execution of receives is harmless. Of course, the protocol above can also be used for selections.
SMDAGs
For shared memory systems, all dependences are hard dependences which are to be regarded as synchronisation, as a result of which the delay can be chosen uniform. The notion of DMDAG can easily be weakened into a notion of SMDAG by using only hard dependences. Each DMDAG can be transformed into the corresponding SMDAG by changing all soft dependences into hard ones. For the correctness of a scheduling, this implies that the rough rule given above is exact in this case, since all dependences are precedences. The implementation requirements with unique location of tasks are the same as given above (though the superscript h is meaningless). The only di erence is the interpretational one that instead of data, synchronisation tokens are being send and received. Multi-allocation of tasks does not seem pro table for shared memory systems, since it does not avoid any (possibly large) data communication. Furthermore, the (hard) dependences u?! h vs must induce an execution order between all copies of u and all copies of vs, since the recomputation of a v at one processor may interfere with the (shared) data used by another. That is, the execution of any vs can be started only after the slowest u is completed. Consequently, the protocols with multi-allocation given above allow too much parallelism and should be altered when used for shared memory.
THREE-PHASE DEVELOPMENT PROCESS
A hierarchical description of the application in terms of its subsystems and a sound execution order is assumed to be available. This description is used in Phase I to infer a DMDAG, which exposes the potential parallelism inherent in the application. An optimal exploitation of the exposed parallelism can only be obtained through the choice of an appropriate scheduling of the DMDAG. Phase II explains how the scheduling-framework can be used to determine an efcient scheduling, and how the implementation-framework can guide the implementation of a scheduling, both in a manageable way. The resulting DMDAG and its scheduling are (partly) based on estimated input-dependent execution times and communication times. Due to these estimations, a speci c scheduling may lead to performance problems which can only be detected and analysed during execution. Phase III discusses how to detect potential performance problem areas and, if possible, how to improve the given scheduling. It explains the guidance o ered by the formal framework in the choice, implementation, and veri cation of optimisations, and discusses the various optimisation techniques used, which are expressed in the formal framework. Figure 1 shows the development process structure. Each phase contains a local re nement iteration guarded by a stop-criterion. This section explains the various steps and their connections. Phase I: Analysis Large-scale scienti c and engineering applications usually comprise several subsystems, such as (numerical) modelings and algorithms. These subsystems are identi ed in Step 1,  and their original (sound) execution order is recorded, including conditional execution of subsystems (selections). This is done using the hierarchical description of the application (> in Figure 1 ).
Step 2 determines for which subsystems a further decomposition is impossible or meaningless, and classi es them as potential tasks. Decomposing a subsystem S is impossible if > is not su ciently detailed w.r.t. S, and is meaningless if the computational intensiveness of S is too small. The latter depends on a threshold value Tdec, which is chosen on forehand based on two ghting arguments: manageability and optimal parallelism exploitation. Computational size usually is input-dependent and therefore di cult to estimate. However, since we only need to know if this size is larger than Tdec or not, rough estimates are su cient until subsystems are decomposed far enough. Therefore, the estimates used here are based on relative execution times (if a code is available) or on complexity orders (derived from >).
For each potential task, a more accurate estimate of its computational intensiveness is made in Step 3. This estimation uses information about the internal structure (possibly from >); it can be based on functional or procedural analysis depending on the description given in >, or on experimental analysis if a code is (already) available.
Step 4 classi es the current potential tasks as follows.
If the computational size estimate (derived in Step 3) of a potential task is larger than Tdec and there is a su ciently detailed description of the potential task available in >, then this potential task is denoted as a subsystem (again). As a result, the subsystem will be analysed for further decomposition during a subsequent iteration in Phase I. If the computational size estimate is larger than Tdec and there is not a su ciently detailed description, then the potential task is denoted as a task. If a subsystem contains a number of relatively small potential tasks ( Tdec), then these subsystems must be composed. This is done for manageability purposes. Of course, a composition is not allowed to be larger than Tdec. Since a composition de nes a system which
was not yet present in >, a new description must be added.
In order to minimise the e ort involved, a composition is only allowed to contain potential tasks which are consecutive w.r.t. the given execution order. Each composition is considered to be a potential task, or a task, depending on its computational size. Of course, the size of a composition can be determined by simply adding the sizes of its comprising potential tasks. If the computational size of a potential task is estimated to be smaller than Tdec in Step 3 and it is not (and will not be) composed with surrounding potential tasks, then this potential task is denoted as a task.
Steps 1 to 4 are repeated until all subsystems are denoted as tasks. So the nal result of these steps is a description of the application in terms of tasks and their given execution order. Descriptions of the individual tasks are recorded in the design documentation required for maintanance. They follow relatively easy from >, since each task is either an original subsystem or a composition of consecutive original subsystems. Of course, the computational intensiveness estimates are included in these descriptions.
Step 5 investigates, using >, the data-usage of tasks and selections. This information is required in Step 6 to determine the dependences and the delay (cf. Table 1) , and in Step 10 to implement the send actions. For tasks, Step 5 determines the common data sets, their sizes, and the type of usage of them in terms of input and output. For selections, it is analysed which tasks produce the data used.
The information derived in the previous steps is used in Step 6 to derive the corresponding DMDAG, the execution times x of tasks, and the delays of hard dependences (Table 1 gives an overview). It is also used in Phase II to transform the derived DMDAG to gain additional potential parallelism at the costs of memory-use. These transformations are explained at the end of this subsection.
The tasks (set T), selections (set S), and the control (sp-) dependences follow directly from the analysis done in Step 4.
The ts-dependences follow from the data-ow from tasks to selections: if task t produces data for selection s, then t!s. The tt-dependences follow from the original execution order and the data-usage type with the following rules: (R1) A At the end of this subsection, these rules are explained further and illustrated in an example.
Notice that a (ts or tt) dependence u!vs can only exist if u precedes vs in the original execution order. Consequently, the soundness of the original order guarantees that the resulting tasks, selections, and dependences form an acyclic graph. As a result, the resulting structure is a DMDAG. It speci es the potential parallelism in the application.
Notice that a wrongly included (or wrongly classi ed as hard) dependence may lead to reduced e ciency. A wrongly omitted (or wrongly classi ed as soft) dependence, however, may lead to a too optimistic speci cation of the potential parallelism, and, hence, to severe bugs. Therefore, careful conservatism is required while determining the data-usage types (in Step 5) and the dependences (in Step 6 
Step 5 and for hard in/output of tasks Step 5 dep.'s (Step 6) input of selections Step 5 ts-dependences (Step 6) DMDAG
Step 6 (Phases II and III) altern. DMDAGs Phase II (Phases II and III) -- Table 1 : Information from Phase I
The execution times x and the delays are also determined in Step 6. Since the size of the data used by selections will, in general, be relatively small, communication times of this data can often be abstracted into a small uniform delay for ts-dependences (possibly 0). Similarly, the relatively small computations of selections lead to small uniform or zero execution times x for selections.
The computational intensiveness of tasks and the size of the common sets of data is used to choose the execution times (x) of tasks and the delay ( ) of hard tt-dependences.
They are used in Phase II to evaluate parallelisation options. The actual communication delay can, however, vary per machine (latency, bandwidth), per input (characteristics, size), and even per run (in case of collisions). Furthermore, execution time and delay may relate di erently per machine. Consequently, the execution times and delays will often be inaccurate measures in the resulting implementation. This is the main cause for the optimisations in Phase III. On Dependences and Transformations of the DMDAG Hard dependences (rule R1) express a data-exchange requirement. The assumption on the communication protocol made in Section 2 leads to a low-level requirement on the implementation that guarantees correctness of this data exchange. The requirement on the use of data sets expressed by soft dependences is one of memory access order only.
In the formal framework, the underlying assumption on memory usage is that the local memory of a processor contains (at most) one copy (memory-location) of each data set in which tasks and selections can read and write. In In the DMDAG derived in Step 6, renaming can eliminate each R2-soft-dependence, and, consequently, also the R3-soft-dependences. It enforces an additional copy (and a copy-action) of the data-set if the tasks are scheduled to the Obviously, in a good implementation, the two versions of d produced at processor 1 are each sent exactly once to processor 2. Their arrival order may di er from their sending order due to network delays. Notice that the implementation requirement on the communication (cf. Section 2) guarantees the usage of the correct version by the tasks of processor 2. Notice also that without the dependences derived with R3, also the order u2 ; v2 ; u3 would be allowed, which cannot be dealt with by the implementation requirement. In order to achieve more potential parallelism and more schedule exibility, an additional copy of d at processor 2 can be used, which is denoted by the renaming: u1(out:d) ; u2(in:d) ; u3(in:d) ; v1(out:d2) ; v2(in:d2)
The rules now lead to the same hard dependences as above (based on R1), but all soft dependences have been eliminated by the renaming (cf. R2, R3). The scheduling considered above now allows at both processors all orders of the allocated tasks: a gain of exibility that is important for optimisation purposes. For this schedule, the renaming leads to an additional memory-location, but not to an additional copy-action (since both versions of d are sent to processor 2). Notice also that now a schedule is possible where u1 and v1 are executed simultaneously, which is not possible in the case above: a plain gain of potential parallelism.
Phase II: Parallelisation An optimal exploitation of the parallelism exposed in Phase I can only be obtained through the choice of an appropriate scheduling of the DMDAG. In general, the main criterion for a schedule is its makespan (or global execution time), which can be estimated easily using the execution times and delays in the DMDAG (cf. techniques used in 11]). The main criterion for techniques to nd a good schedule is their e ciency. This e ciency is essential if the parallel algorithm needs to be scalable and portable to various platforms and if the schedule needs (possibly partly) to be produced dynamically due to input-dependences.
As mentioned in Section 2, e cient algorithms to nd good schedules for traditional DAGs are restricted to special cases: uniform execution times and delays; potentially unbounded number of processors. For the more general DMDAG structures used here, the scheduling problem is even more complex for several reasons. Firstly, the DMDAG contains a number of input-dependences: Selections determine the runtime choice of tasks. The execution times and delays, which are usually non-uniform, can vary per input. Further, many applications show repetitive structures which execute tasks multiple times on di erent data sets. The number of repetitions usually is input-dependent. Secondly, a scheduling strategy must be able to exploit the architectural features of a variety of platforms, including: various numbers of processors, hierarchical parallelism, and processor interconnections. These reasons imply that for each DMDAG a number of schedules must be produced. The implementation of the DMDAG must support these various schedules, and, from a software engineering point of view, must be possible in a number of well-de ned and manageable steps.
In order to deal with the complexity of the scheduling problem for DMDAGs in a manageable way, the widely used technique of clustering 4, 9, 12] is applied. This scheduling strategy consists of two phases: (1) clustering, in which tasks are grouped in clusters (tasks in each cluster are mapped to the same processor), and (2) creating the various clusterto-processor mappings (depending on input and platform). Because parts of both the clustering and mapping can be de ned input and platform dependent, algorithms will have to be designed to produce (parts of) them at runtime. These algorithms may produce (part of) the clustering and mapping in an initialisation phase (static load balancing), and/or produce (part of) them during execution of the algorithm (dynamic load balancing). The task of the developer consists of de ning the clustering and mapping and developing these algorithms. The rst criterion mentioned above is now translated to the quality of the makespans of the various resulting schedules. The second criterion indicates the importance of the e ciency of the developers task, and the essentiality of the e ciency of the clustering and mapping algorithms. An additional criterion is the way the cluster choice enables a stepwise implementation.
From these criteria, a number of secondary criteria and heuristics for the two phases can be inferred. In general, the makespan of a schedule is better for an arbitrary number of processors if the workload is evenly balanced among the processors in size and per processor in time, and if the inter-processor dependences are minimal. Secondary criteria are therefore the di erences in workloads of the clusters, the spreading of the workload per cluster, the relative timing of inter-cluster dependences, and the number of clusters relative to the number of processors. Further, the required e ciency of the algorithms mentioned above is usually better if the mapping can be done in a uniform and simple way. Consequently, it improves if a relatively large (possibly input-dependent) number of similarly formed clusters is available. This can be achieved by aiming at a cluster structure that re ects the repetitive or regular structure in the application, that is, where each cluster corresponds with one iteration and the number of clusters is input-dependent. This re ection-heuristic also guarantees an evenly balanced workload, and, since it can often be applied iteratively, helps to identify independent implementation steps. Several other valuable heuristics and criteria are discussed below in Steps 7 and 8 of the development process.
Step 7 detects the clustering (or parallelisation) options.
Obviously, it is important to keep the secondary criteria mentioned above in mind. A practical starting heuristic is the re ection-heuristic mentioned above. Considering the computation versus communication costs ratio is important when detecting potential clusters 4], or when improving the initial clustering. The`owner-computes rule' aims at minimising communication costs by clustering tasks using the same (large) data. The ratio is also used to determine whether the scheduling of independent tasks (no execution order induced by the dependences) to di erent clusters delivers useful parallelism, i.e. whether the decreased computation time is larger than the increased communication time. Another valuable heuristic is to eliminate communication bottlenecks by the multi-allocation of critical tasks if this removes crucial communication hot spots at the (less) expense of additional computation time. Finally, it is important to consider the elimination of soft dependences, increasing the number of independent tasks, as discussed in Phase I.
The detection and evaluation of the properties of each option and the actual parallelisation choice is done in Step 8.
The most important evaluation criteria are the secondary criteria on the clusters and the mapping given above. It is also important that the clustering is e cient for all inputs and for as many numbers of processors as possible. An important factor is the ability to postpone part of the clustering decisions until more information is available about the actual tasks and dependences, in particular about the computation times and delays. This dynamic load balancing is especially relevant if the DMDAG contains critical selections. Obviously, a dynamic mapping must satisfy the schedulingrequirement given in Section 2. Another evaluation property is the e ectiveness of the clustering in exploiting architectural speci cs, such as combinations of loosely/tightlycoupled vector/scalar-processors. For this, it is important that the cluster de nition contains multi-level parallelism. Finally, as explained above, the manageability of the implementation needs to be taken into account.
Step 9 detects and de nes multiple steps for the (step- and exit-statement(u)), the send and receive actions corresponding to this dependence do not have to be implemented. This means that, by choosing large clusters in the rst implementation steps, only part of the error-prone send/receive actions (and the corresponding collecting of data) have to be implemented, namely only those corresponding to intercluster dependences. By now dividing each large cluster into smaller ones, the technique can be repeated. The third heuristic proposes to use existing hierarchical structures in the cluster de nition to de ne such implementation steps. Clearly, the proposed iterative application of the re ectionheuristic helps to detect such hierarchical structures.
The actual stepwise implementation takes place in
Step 10. The implementation requirements, as de ned in the formal framework, structure each step and should be followed carefully. Note that, as explained above, due to the cluster-technique, the send-actions within clusters do not have to be implemented. A further optimisation in the formal requirements is to combine send-actions that send the same data to (distinct tasks on) the same processor. After each implementation step, the correctness of the current implementation is veri ed in Step 11. With respect to manageability, a stepwise implementation and verication based on the implementation requirements given in Section 2 is crucial, especially for large-scale parallel applications. The number of possible causes for a particular bug can be very large, due to the complex combination of computations and communications within the system. Further, parallel execution often is nondeterministic, which complicates the reproduction and investigation of bugs. Finally, correctness veri cation of a parallel implementation must consider various numbers of participating processors, a variety of inputs (if possible), and multiple runs.
Phase III: Optimisation The DMDAG and its chosen scheduling as derived in Phases I and II respectively are (partly) based on estimated (rather than exact) input-dependent execution times and communication times. Consequently, a speci c scheduling may lead to performance problems which can only be detected and analysed during execution (so only after implementation). In Phase III, we try to detect potential performance problem areas and, if possible, improve the given schedule. The formal framework is used for guiding the choice, implementation, and veri cation of optimisations.
The detection of performance problems can be done by two di erent approaches. Firstly, many performance tools exist for studying performance data, which collect this date during execution by program instrumentations, cf. for instance 1, 10, 13, 17]. These instrumentations are inserted automatically by the tool and are machine-speci c. In order to fully exploit the architectural features of a speci c platform, the parallel application must be tuned for each individual port. Since performance tools are mostly only available on one or on a small set of machines, di erent porting e orts require the use of di erent tools. Another major concern for these tools is that they necessarily introduce execution perturbation. Further, they pin-point the location of a performance bottleneck in terms of procedures, processors and data-structures 14], and do not consider applicationknowledge such as obtained in Phases I and II. In the second approach to collect performance data, additional code segments that measure execution times, idle times, and communication sizes are manually inserted. Since these instrumentations are inserted by hand and not automatically, considerably more e ort is required. However, this approach can directly focus on the potential performance bottlenecks as indicated by application-knowledge available in the form of the DMDAG, the de nition of clusters and the clusterto-processor mapping. Also, after the initial insertion, no additional e ort is required to use this`built-in' tool on successive platforms. Additionally, since this approach is only applied to subsystems where performance analysis is really needed (directed by the application-knowledge) the amount of execution perturbation is clearly less than in the rst approach.
A large number of optimisation techniques exist for the improvement of a given schedule of a DMDAG. Here, we discuss four of the most important optimisation techniques, and show how they can be incorporated in our formal framework. The next section illustrates them in the case studies.
Firstly, suppose that within one cluster a number of consecutively executed tasks exist for which no prescribed order is induced by the given soft and hard dependences, so the processor to which this cluster is allocated can correctly execute these tasks in any order. Now analyse the exitstatements of the tasks in this set. If one or more of the tasks send data to other clusters while other tasks do not, then the former tasks must obtain a higher priority w.r.t. execution order. When studying the clusters of the receiving tasks, one can also discriminate between receiving tasks which require data sooner than other receiving tasks. Also in this case, one should give a higher priority to the sending tasks which send data to the most urgent receiving tasks. These task priorities can be incorporated in the framework quite easily by extending the guard G(v) of each task v. This optimisation can (partly) eliminate earlier detected idle-times.
Secondly, suppose that one task, or, even worse, several tasks in separate clusters, send a relatively large amount of data at approximately the same moment in time. This communication is induced by the implementation requirements as given in the corresponding exit-statements (Section 2). These requirements, however, do not impose a restriction on when the send actions are actually executed. Consequently, the execution of the send actions can be postponed. Interleaving them with the subsequent tasks may solve the described communication bottleneck. However, in order to prevent the introduction of deadlocks, one additional requirement is required: a send statement should never be postponed until after a subsequent receive statement. Further, splitting the`large' send statements can increase the number of possibilities for interleaving send's with computational tasks.
Thirdly, a straightforward, but important, improvement is to exploit the interconnection network on the given machine. That is, tune the cluster-to-processor mapping such that strongly interrelated clusters are mapped to processors interconnected at a relatively small distance. Of course, this should only be done after obtaining a mapping which aims at balancing the workload and minimising the total amount of communication.
Fourthly, on distributed memory systems, the multiallocation of a given task t can possibly improve the overall e ciency. The gain depends on the amount of time spend on waiting for data from t to arrive, the execution time of t, and the additional communication involved for getting the necessary input for t. The incorporation of multi-allocation in the formal framework is discussed in Section 2.
TWO PRACTICAL CASE STUDIES
A thorough evaluation of the practical applicability of the formal framework and the 3-phase development process is given in this section. The rst case study considers the parallelisation of a large-scale sequential Navier-Stokes solver. The second case study concentrates on the development processes of both a shared and a distributed memory version of a parallel solver of large sparse matrices.
Parallelisation of a Large-Scale Navier-Stokes Solver A large-scale distributed memory Navier-Stokes solver has been developed by parallelising an existing sequential version of the solver ( 5] ). This parallel solver is a typical example of a large-scale application. It comprises 210; 000 lines FOR EACH iteration save the old ow status in all blocks; calculate the time-steps in all blocks; advance the ow-status in all blocks; average the ow-status on the boundaries of all blocks; apply the local boundary conditions in all blocks; apply the global boundary conditions in all blocks; apply the local boundary conditions in all blocks; calculate contribution to aerodynamic coe cients for the current iteration of each block; calculate the residuals in the domain, for all blocks; ENDFOR Figure 2 : Global structure of the Navier-Stokes solver of Fortran 77 code which implement many subsystems (such as ow models and numerical integration schemes).
During the analysis phase, the original sequential solver turned out to be very well-structured. The solver is applied to a physical domain which is divided into a number of three-dimensional blocks. This block-oriented structure is clearly visible in the global structure of the solver, as illustrated in Fig. 2 . First, within each iteration, for each block, the current ow status within the block is saved. Subsequently, the time-steps are calculated for each block separately. Next, the ow status is advanced within each block. This is the most computational intensive part of the SOLEQS code. The next processes, applying the local and global boundary conditions, require less computation time. Finally, convergence results are collected from all blocks. Due to this clearly de ned block-structure (which is also present at lower levels), the hierarchical decomposition into subsystems and the detection of tasks is relatively easy. Figure 3 shows the resulting DMDAG. We now discuss the most important aspects of this graph.
Since the number of blocks is input-dependent, the size of the DMDAG is also input-dependent. For example, the number of instances of the tasks within`advance ow' is equal to the number of blocks. However, the number of blocks varies per input. The existence of multiple instances of a task or subsystem is denoted by the double lined right side of its box. The same repetitive structure is present in subsequent subsystems.
The task`advance ow-status' contains a selection. Depending on certain input values, di erent ow models (Navier-Stokes or Euler) and di erent numerical integration schemes (3-or 5-stage Runge-Kutta) are applied within a block. Depending on these choices, di erent computational intensivenesses of the task`advance ow-status' must be considered.
The dependences shown in Fig. 3 are all hard dependences. Soft dependences are excluded through the use of (additional) copies of commonly used data (cf. Section 3, Phase I). For instance, when applying the global boundary conditions, multiple copies of the boundaries of a block are made, one for each of its neighbouring blocks. Consecutively, each neighbouring block can consume its associated copy of a common boundary while at the same time the (original) boundary values can be updated again.
Notice that the number of dependences is also inputdependent. Consider for instance the dependences from`advance ow-status' to`perform averaging'. The ow-status is advanced separately within each of the blocks. Therefore, Estimating the execution times and communication times turned out to be complex. For instance, the execution time of task`advance ow-status' (applied to a certain block) depends on the size of the block (number of gridpoints), which ow model is applied, and which time-integration is used. The amount of communication required for task`apply global boundary conditions' depends on which type of boundary conditions are applied, the size of the boundaries, and the allocation of blocks among the processors. These parameters only become known at runtime. In an attempt to solve this problem, an approximation function is devised based on practical experiments on a large number of di erent testcases 15]. This function takes into account the given processor speeds and the given communication bandwidth and latency times. Although this function gives reasonable results, additional research is required to include more parameters in order to estimate the timings more accurately.
During the parallelisation phase, a choice for a set of clusters and for an e cient cluster-to-processor mapping must be made, taking into account the input-dependent aspects of the DMDAG and the scalability requirements. We rst make two important observations. Firstly, consider the instances of tasks that process one speci c block. Since these instances apply to all variables associated with the block, the dependences between these instances are relatively large. If we consider instances of tasks which process di erent blocks, then the dependences only apply to the boundary variables, and are thus relatively small. Secondly, each of the most computational intensive tasks (advance the ow-status in a block) considers a single block. These two observations lead to a straightforward choice of clusters: for each block, a cluster is chosen which contains the tasks which process the block. This choice is scalable w.r.t. input since the number of clusters grows equally with the input (number of blocks). The clusters can be mapped onto the processors considering the size of the inter-block dependences and the execution time per cluster. Practical experiments on several platforms have shown that the parallel solver remains scalable w.r.t. the number of processors as long as the number of blocks (or clusters) is large enough compared to the number of processors 6].
The implementation requirements for selections can be met easily, since each selection only selects di erent tasks within one cluster. So each processor only has to deal with local decisions: no communication of control information is required. The requirements for tasks consider, for each task, a guard and an exit-statement. This is illustrated in Fig. 4 . The incoming arrows of a task represent its guard. That is, a hard dependence from task u to task v is translated to local dependences ld (`Finished(u)') if both tasks are scheduled to the same processor. If u is scheduled to a di erent processor, then the dependence is translated to global dependence gd (`Finished(receive(u; v))'). Similarly, the outgoing arrows of a task represent its exit-statement. The dependence numbering corresponds to the descriptions in the documentation of the parallel solver. These requirements are used for guiding a stepwise implementation. That is, the implementation steps are chosen such that the send and receive statements are implemented and veri ed one by one and in the order in which they appear in the cluster, using the given requirements. Of course, manageability of the implementation phase is also induced by the fact that the order in which tasks within one cluster are carried out is sustained (w.r.t. the sequential code) and the code implementing the communications is inserted at speci c phases in the code.
The optimisation phase has improved the overall performance of the Navier-Stokes code signi cantly. With respect to e ciency, the implementation as represented by Fig. 4 turned out to have two main drawbacks.
Firstly, if the workload is evenly balanced, then the processors will start to send (and receive) at approximately the same moment (in time). For example, consider the send/receive's within`advance ow'. Each processor sends all the average-variables from each of its assigned tasks, and all processors execute these sends at approximately the same moment (in time), namely at the end of`advance ow'. Performance analysis showed that this drawback results in a relatively large number of message collisions.
Secondly, each processor can start with a`perform averaging' only after it has completed`advance ow'. For example, each processor must wait until it has received all its average-variables before the processor can start`perform averaging'. Due to this implicit synchronisation, processors strongly depend on each other.
The rst drawback is solved by combining the communications with the tasks. As a result, the communications are spread (in time) more evenly, which will minimise the number of collisions. This can be done by sending (and receiving) variables per block instead of per processor: if a block is already processed, then its corresponding variables are sent before processing the next assigned block. The communications can be combined with the tasks even further by splitting up the send-processes (cf. Section 3, Phase III). That is, instead of sending all the variables of a speci c block at the same time, one can partition these variables into faces, edges, and vertices. Then these subsets of variables increase the ability to combine communications with tasks. The design and implementation of the solver already contain these multiple block-, face-, edge-, and vertex objects. Therefore, restructuring the system to enable to access these individual objects separately only required a relatively small number of changes in the design and implementation of the parallel solver.
The second drawback is solved by changing the control structure of each task: each processor monitors which variables are received in the current iteration and executes a process as soon as all of its corresponding non-local variables have arrived. Again, this is done by a simple restructuring of the existing task-structures in the solver: computational processes are now executed per block, per face, per edge, and per vertex.
These optimisations can quite easily be expressed in the formal framework. Figure 5 illustrates the resulting structure. A detailed discussion of a performance evaluation of these optimisations on an IBM workstation cluster and a IBM SP1 is given in 6]. These results show a signi cant improvement obtained by the optimisation phase.
Development of a Parallel Solver of Sparse Matrices
The second case study describes a solver of large sparse matrices, called Mcsparse, for which both a shared and a distributed memory version have been developed ( 7] ).
The solver uses a preprocessing phase to reorder the input matrix into a bordered block upper triangular form, as shown in Fig. 6 . This structure can be used during the subsequent factorisation and solve phases to exploit ne, medium, and large grain parallelism. In this case study, we concentrate on the rst two phases of the factorisation: the factorisation of the diagonal blocks Di and the elimination of the border blocks Bi.
From the structure of the matrix it is clear that when a pivot is selected in Di it will not perform any updates on the rows in Dj, for each i < j. Nor will the pivots in Dj update any of the rows in Di. As a result, the LU factorisation of the diagonal blocks can be performed in parallel. Similarly, one can see that the o -diagonal blocks can also be updated concurrently. Of course, each o -diagonal block can only be updated until after its corresponding diagonal block has been factorised. How global stability and sparsity can be maintained is described in detail in 3].
The elimination of a particular border row can be done independently of the elimination of the other border rows. Therefore, the elimination of the border can be done in parallel over its comprising border rows. A restriction within the elimination of a speci c border row is that its corresponding diagonal blocks must be applied in the order in which they appear in the reordered matrix. Of course, each diagonal block can only be applied until after its factorisation and the update of its o -diagonal block have been completed. Again, speci cs concerning stability and sparsity can be found in 3]. The DMDAG which corresponds to this parallel system is shown in Fig. 7 . The dependences between the tasks have been argued above. The sizes of the delays are di cult to estimate. The size of a factorised diagonal block is only known after its actual factorisation. The delay of sending a border block will increase as more diagonal blocks have been applied to the border block, and is therefore even more dicult to estimate. The execution times of the diagonal block factorisations will not di er much because the blocks are all close to the same size (as a result of an initial reblocking), and most of the diagonal blocks show a banded structure, con ning ll-in within each band. The execution time of a border block elimination, however, can di er substantially per border block. Although an initial reblocking is also applied to the border blocks, their elimination times depend on the sparsity of these blocks, which changes during elimination: a border row tends to become more dense near the end of the block as more diagonal blocks are already applied to the row. The numbers in Fig. 7 are merely a global indication of how the communication times of diagonal blocks and border blocks are mutually related.
For this DMDAG, a choice for a set of clusters and for an e cient cluster-to-processor mapping is extremely dicult. In order to obtain an e cient load balancing during the factorisation of the diagonal blocks, one can straightforwardly use estimates for the amount of work per block and de ne clusters accordingly. However, this distribution may not be e cient during the elimination of the border blocks. Each border block must be eliminated by its corresponding diagonal blocks, in order. Since the diagonal blocks are already distributed among the processors, the amount of elimination work per processor is also already determined. Unfortunately, the amount of elimination work per diagonal block (and thus per processor) can di er substantially. The last numbered diagonal blocks are associated more elimination work than the rst numbered diagonal blocks, for two reasons. Firstly, the former blocks have to eliminate more border blocks. Secondly, a border row tends to become more dense near the end of the block, as explained above.
Therefore, the last numbered diagonal blocks must be shared among several processors, so that their corresponding elimination work can also be shared among multiple processors. These shared blocks are called S-blocks. An S-block is copied to other processors only after the block is factorised on its initial processor. Of course, the use of S-blocks has The corresponding SMDAG of Mcsparse has the same structure as the DMDAG, only now the delays are all assumed to be zero (recall: synchronisation instead of communication). This makes the choice for an e cient cluster and cluster-to-processor mapping relatively easy. Since all the diagonal and border blocks are now allocated in shared memory, each processor is able to process these blocks without delay costs. The factorisations of the diagonal blocks and the eliminations of the border blocks form a pool of parallel tasks. Of course, the same dependences must be respected, but an evenly balanced distribution of work can be established through a dynamic load balancing scheme: a cluster consists of a number of factorisations and eliminations, and the clusters are mapped to processors runtime depending on the amount of work per cluster. An e cient implementation of this shared memory version of Mcsparse on a Cray Y-MP together with a detailed performance evaluation is described in detail in 7].
CONCLUDING REMARKS
The development of parallel large-scale application codes is complicated due to the many factors to be considered. The three-phase development process described in this paper offers a structured and manageable approach to this development. The various steps of the process describe how to attack the various problems involved, and are strongly supported by the given underlying formal framework.
In the rst phase of the development process, the application is analysed and the potential parallelism is expressed in the notion of DMDAG (or SMDAG). It is described how the soft and hard dependences can be inferred, and how the resulting graph can be transformed to gain additional parallelism at the cost of memory use. In the second phase, clustering provides a practical scheduling strategy for the complex DMDAGs that correspond with large-scale applications. The formal framework supports the heuristics used for the detection of clustering options as well as their evaluation. The scheduling-independent high-level implementation protocol structures and guides a manageable stepwise implementation. The third phase detects potential performance problem areas and, if possible, improves the algorithm. The most important optimisation techniques are discussed, and it is shown how the formal framework guides the choice, implementation, and veri cation of the optimisations.
The diverse steps of the approach are illustrated in the practical developments of a large-scale distributed memory Navier-Stokes solver and of a shared and a distributed memory version of a solver of large sparse matrices. These case studies demonstrate the applicability and manageability of the approach and the value of the formal framework, both as a conceptual tool and as formal guidelines.
