A BSP superstep is a distributed computation comprising a number of simultaneously executing processes which may generate asynchronous messages. A superstep terminates with a barrier which enforces a global synchronisation and delivers all ongoing communications. Multilevel supersteps can utilise barriers in which subsets of processes, interacting through shared memories, are locally synchronised (partitioned synchronisation). In this paper a state-based semantics, closely related to the classical sequential programming model, is derived for distributed BSP with partitioned synchronisation.
Introduction
One inherent difficulty in the development of scientific software is the reconciliation of the requirements that code be both correct and efficient. Placing too much emphasis on correctness may result in an abstract, but inefficient, programming model. Many members of the scientific programming community have unjustly rejected functional programming languages because of perceived performance inadequacies. Alternatively, striving for optimal efficiency can run the risk of comprising software correctness and can result in the employment of architecture-specific programming models. In this paper it is proposed that the BSP [Bis04, McC95, Val90, Val08] model of distributed computation combines the benefits of a well-established cost calculus (for evaluating program efficiency) with the advantages of a simple programming framework closely related to sequential program refinement.
A BSP computation comprises a sequence of supersteps. A superstep contains a set of independent processes, each with local memory, which perform computations and generate asynchronous communications. A superstep terminates with a global barrier which synchronises processes and delivers ongoing communications. The cost of a superstep S with p processors can be calculated using the following assumptions:
• a basic arithmetic operation or local memory access has unit cost;
• the maximum number of local operations performed by a process is w ;
• the maximum number of I/O operations performed by a process is h; • the ratio of communication cost to the unit cost is g and • the cost of synchronising all of the processes within S is L. The total cost of superstep S is defined to be w + h × g + L. A wide range of parallel algorithms have been developed and analysed using the BSP cost calculus.
A variant shared memory version of BSP, BSPRAM, has been proposed by Tiskin [Tisk98] . Again superstep processes are independent with being used to synchronise processes and update shared memory. BSPRAM has been refined to a multilevel variant [Val08] by partitioning shared memory into a hierarchy. Memory at the top of the hierarchy is assumed to have the highest access cost while cache memories at the process level have the lowest access costs. Figure 1 shows a memory configuration for four processes, P 1 , . . . , P 4 : processes P 1 and P 2 have access to memory M 12 while processes P 3 and P 4 have access to memory M 34 . All processes can (indirectly) access the memory M 1234 . Processes can be synchronised in various ways: for example,
• processes P 1 and P 2 and processes P 3 and P 4 can be synchronised independently (partitioned synchronisation); P 1 and P 2 interact through M 12 while P 3 and P 4 interact through M 34 . The notion of having sub-machine synchronisations in BSP was first proposed by de la Torre and Kruskal [TK96] .
• all processes can be synchronised with global interactions occurring via the shared memory M 1234 .
Multilevel BSP can be used to analyse the computational cost of BSP algorithms on multicore architectures. For example, the architecture in Fig. 1 has an associated 3-level cost model [Val08] . Each level is associated with four cost parameters:
where p i is the number of level i − 1 components inside a level i component, g i is the communication bandwidth (relative to the basic computational cost) from level i to level i + 1, L i is the cost of synchronising all of a level i 's sub-components and m i is the size of memory available at level i . More realistically, consider a cost model for an architecture containing 4 identical multicores, each having 8 cores 1 . Each core has a cache memory and associated cost parameters:
Here L 1 refers to the cost of synchronising internal core threads. Each multicore has memory accessible by all its cores-L 2 is the cost of internal core synchronisation at the chip level while g 2 is the cost for a multicore to access level 3 memory:
(p 2 8, g 2 3, L 2 23, m 2 3 MB) 1 The example given here is based on the cost model for Sun Niagara UltraSparc T1 multicore chips given in [Val08] .
BSP and partitioned synchronisation 423
The overall architecture contains a large globally accessible memory:
This example illustrates how partitioned synchronisation has the potential to be much less computationally expensive than global synchronisation (see the L 2 and L 3 costs above). Architectures comprising a number of multicores may be assembled either by building an interconnect network linking the multicores or by adding an overarching memory hierarchy (as described above). The intent of this paper is to model different forms of synchronisation patterns (rather than the details of memory hierarchies). In particular, localised barriers are distinguished from global synchronisations.
A [TK96, Val08] . In Sect. 6 a variant form of barrier, partitioned synchronisation, is defined. In Sect. 7 a superstep program environment is described and in Sect. 8 a specification of PrefixSum is refined into a BSP system which utilises a number of partitioned synchronisations. Finally, in Sect. 9 it is shown how to refine superstep systems into compositions of interfering processes.
Preliminaries: unified theories of programming
Unified theories of programming (UTP) [HH98] is a framework for refining predicates into programs. Here a brief overview of those features of UTP that are relevant to the development of a multilevel BSP programming model is given. A predicate P is defined over a finite set of (free) variables αP where αP inαP ∪ outαP and where inαP is a set of undashed variables (initial values) and outαP is a set of dashed variables (final values). The predicate x a + x ∧ a a with alphabet {x , a, x , a } corresponds to the program x : a + x . The operation SKIP over an alphabet A {x , x } is defined as
Predicates may be composed using the operators (selection), (independent parallel composition), (nondeterministic choice) and ; (sequential composition):
w h e r eαb ⊆ αP αQ
An additional operation (separating simulation) is used to rename an output variable of a predicate (and adjust the alphabet of the predicate accordingly). For example, the simulation U 0 renames m as 0.m :
A simulation U 1 (m) which renames m as 1.m can be defined in a similar way. Finally, the alphabet of a predicate R can be extended to include the variables x and x as follows: R +x df R ∧ x x . For example suppose that a predicate P has variable m in its alphabet. The predicate P ; U 0 (m) +outαP \{m} has alphabet (αP \{m }) ∪ {0.m } and is such that the output m of P is equivalent to output 0.m of P ; U 0 (m) +outαP \{m} .
Barrier synchronisation
A superstep contains a set of independent processes which perform computations and generate asynchronous communications. A superstep terminates with a global barrier which synchronises processes and delivers ongoing communications. Let variable m denote a set of ongoing asynchronous communications that have been generated but have not, as yet, been delivered. A set of communications is modelled as a relationship between destinations (i.e. the variables to be updated) and values-see [HMC96, SC01, SCG04] :
Here VAR is a set of variable names and VAL is a set of values.
Superstep SS (P , Q) comprises process bodies P and Q; SS (P , Q) is well-defined if the variables occurring in P and the variables occurring in Q have, at most, the name m in common:
is a valid superstep. The definedness condition above permits superstep processes to be executed independently (see below).
A superstep process is a sequence of conventional programming instructions and asynchronous communication operations. Execution of the asynchronous communication put(x , e) evaluates the local expression e and sends the resulting value, asynchronously, to the "foreign" variable x ; semantically put(x , e) is defined to update the communication space m as follows:
An asynchronous send put(x , e) in process P is well-defined if αe ⊆ αP where αe denotes the set of variables that occur in expression e. In a superstep different processes may generate and send inconsistent messages to the same destination variable. For example, P may contain the statement put(x , 1) while Q contains put(x , 2). In this case the combined set of ongoing communications has the form
is defined in such a way that component processes utilise disjoint output communication spaces. The separate communication spaces are merged (Sect. 7.2 [HH98] ) and the resulting messages are delivered in a terminating barrier, SYNC. Let
be independent processes generated by renaming the output variable m in both P and Q. Note that P 0 and Q1 are initialised using empty communication relations. Superstep SS (P , Q) first executes P 0 and Q1, thereby performing local computations and generating asynchronous communications. Subsequently the asynchronous messages are delivered in parallel. Let A αP \{m, m } and B αQ\{m, m }: Then:
Here γ m denotes the restriction of the domain of map m to the set γ :
The operation SYNC (m) γ is a variant form of parallel assignment which models the behaviour of a barrier synchronisation for a communication map m and an alphabet γ where dom(m) ⊆ γ and where dom is a function which returns the domain of its argument.
If f is a function then SYNC (f ) γ is a deterministic set of parallel assignments:
Here x is a meta-variable (rather than a conventional programming variable). It follows that
If the map m is not a function then SYNC (m) γ is defined non-deterministically:
Here f m denotes the set of functions f which cover m and are consistent with it:
For example, if m {x → 1, x → 2, y → 1, y → 2} four functions can be extracted from m using : Synchronisation is destructive in that destination variables are overwritten; an alternative form of barrier SYNC + (m) γ updates destination variables by adding message content to destination content. For functional f :
For non-functional m,
A superstep with barrier SYNC + is denoted SS + (P , Q) (see Sect. 8).
Superstep laws and termination
Supersteps inherit many of the properties of parallel composition:
If αP ⊆ αP \{m, m } and αQ ⊆ αQ \{m, m } then (P Q); SS (P , Q ) SS (P ; P , Q; Q ). The semantics of supersteps can be extended to reason about termination. In UTP a design P Q is such that if assumption P holds then it is guaranteed that Q terminates in a state specified by Q [HH98] .
Here the variables ok and ok are used to record whether a program has been started and whether it has terminated, respectively. Predicate true denotes arbitrary behaviour (including possible non-termination) and can be represented by either false false or false true. The termination properties of supersteps 
SS (true, Q) true
If the component processes of a superstep terminate then so too does the superstep:
Terminating superstep processes can only generate finite communication spaces.
A weakest precondition semantics for SYNC
A weakest precondition semantics of SYNC (m) γ is derived below. Let P be a predicate and f be a finite function with dom(f ) ∈ αP . P f denotes P modified by the application of the simultaneous substitutions {x → f (x ) | x ∈ dom(f )}. Let x and y be variables, c a constant, P and Q predicates, R a relation and k a fresh variable name: P f is defined over the structure of P as follows:
Here e 1 b e 2 is e 1 if b is true and e 2 otherwise. It follows that P {} P and P {x →v } P x v (conventional substitution). The weakest precondition of a deterministic barrier SYNC (f ) γ is:
A substitution over a finite mapping m is defined as:
m generalises the WP semantics of non-deterministic choice ( ):
P m is used to define the WP semantics of non-deterministic SYNC :
A weakest precondition of SYNC + can be devised by modifying the substitution rule P f to P +f :
Partitioned synchronisation
Global synchronisation is often considered to be prohibitively expensive to implement. In certain situations superstep barriers can be implemented using localised forms of synchronisation. Let SS ({P i | i ∈ I }) be a superstep comprising a set of processes indexed by a set I and let PS be a partition of I :
Superstep processes can always be decoupled since they are independent:
A termination barrier can be decoupled using partition PS if the set of messages generated within each partitioned set of processes is destined for delivery within the same process set:
In such circumstances an implementation need only synchronise sets of processes within the same partition. The smallest partition partition(I , {I }) contains one index set and corresponds to complete inter-dependence. The largest partition, partition(I , {{i } | i ∈ I }) corresponds to a completely decoupled system. Consider a superstep SS ({P i | i ∈ I 1 ∪ I 2 }) where I 1 {2, 4, . . . , 2 n } and I 2 {1, 3, . . . , 2 n − 1}. If even(odd) indexed processes communicate only with other even(odd) indexed processes then
If the process sets {P i | i ∈ I 1 } and {P i | i ∈ I 2 } are assigned for execution to sets of cores which either have a shared memory or a fast localised interconnect network then SS ({P i | i ∈ I 1 ∪ I 2 }) can be implemented in a way that exploits the computing potential of the hardware. Such a partitioned synchronisation is likely to execute faster than a conventional BSP barrier which enforces a global synchronisation. Partitioned synchronisation can be characterised by the following algebraic law:
Superstep systems
Supersteps may be embedded within a sequential programming framework; in this way BSP systems can be developed using conventional "sequential refinements" while the inherent parallelism of superstep programs can be used to exploit the computational potential of multiprocessor architectures (see Sect. 9). The embedding of supersteps within a sequential programming environment requires a generalisation of the definition of superstep. Consider the repetitive ( * ) construct below:
Here superstep processes P and Q share a "global" variable, i ; in order to maintain superstep process independence i must be read-only for both P and Q (although i can be updated at a barrier). Let α W P denote those variables which can be updated by P . For example, α W put(i , e) {m} and α put(i , e) {i , m} ∪ αe. Processes P and Q are independent if
This modified definition of independence allows P and Q to share read-only variables.
All supersteps in a system must have a fixed degree of parallelism, say k . The i th superstep process P i , 1 ≤ i ≤ k , has write access to a set of local variables, D i , and read-only access to a set of "global" variables, D 0 where
The structure of a superstep computation is outlined in Fig. 2 . The definition of a partitioned superstep is modified to ensure that "global" variables are not updated at barriers:
428 A. Stewart 
The derivation of parallel prefix sum
The superstep framework above is used to derive a parallel program from a specification. The resulting program contains barriers with varying degrees of synchronisation. The derivation starts from a predicate specification and a number of refinement steps are carried out [Bk09, Dij68] until a superstep program is derived. The program development is heavily influenced by R. Back's approach to constructing programs from invariants. Given an integer l , 1 ≤ l , and an array A with integer elements
Here a i and a i denote the initial and final states of array element a i , respectively. For example:
The first step in the development of a program satisfying the specification is the creation of a predicate I (j , l , A) which generalises Pre f i x Sum(l , A): A) . A program which satisfies PrefixSum can be constructed by finding a mechanism for transforming I (0, l , A) into I (l , l , A). Using composition [HH98] it can be shown (see refinement 1, Appendix) that
Thus, in a conventional way
The parallel assignment || 2 j <i≤2 l a i : a i + a i−2 j can be refined (see refinement 2, Appendix) to:
In this superstep program the set of global variables D 0 is {j } and each superstep process
Finally, the superstep within the loop body can be partitioned (see refinement 3, Appendix):
where the partition PI is: The synchronisation cost for executing PrefixSum, for l ≥ 5, on the multicore architecture described in Sect. 1 is given under the assumption that processes are distributed over multicore chips as follows:
The first two iterations of an execution of PrefixSum (see j 0, 1 above) require two global synchronisations (cost 2L 3 ).
Further assume that processes assigned to MC 1 etc. are internally distributed over cores as follows:
• Core 1 (C 1 ):
Iterations 3,4 and 5 can be synchronised using partitioned synchronisations at the multicore level (cost 3L 2 ). Finally the remaining iterations involve synchronisations at core level (cost (l − 5) × L 1 , provided that the cores have sufficient memory). Thus, a non-partitioned version of the prefix sum program has associated synchronisation cost l × L 3 whereas the partitioned version has associated synchronisation cost 2
Implementation projections
The proposed framework for developing BSP programs has a simple semantics because it is given at a system level. Unfortunately, a direct implementation of such a system would involve unnecessary synchronisations over statements containing shared variables (e.g. j : j + 1 in Pre f i x Sum). To resolve this problem a renaming process projection is defined. Projection P {x →i.x |x ∈D 0 ∪{m}} i maps a superstep system onto its i th component process where each shared variable, say x , is renamed i .x (see the definition of substitution in Sect. 5). P f i is defined by structural induction over superstep programs:
Here local synchronisation LSYNC is defined as a communicating CSP [HH98] process:
where || CSP is the interfering parallel composition operator of CSP and {c ij | i , j ∈ I } is a set of communication channels connecting processes. LSYNC synchronises participating processes in a partition, acquires non-local communication data and updates its variables accordingly. Note that an asynchronous communication to a "global variable" is projected onto a set of communications to "local variables"-thus, supersteps with communications to "global variables" cannot be partitioned. A system S , with k processes per superstep, is implemented as a message passing system as follows:
The resulting composition can be directly implemented using an MPI-like [MPI95] programming environment.
Conclusion
Ideally a computing environment should be designed in a top-down fashion, starting with a high-level programming model and ending with an architecture which supports the required high-level operations. One example of such an environment was the CSP \ OCCAM \ transputer system. Unfortunately, programming languages and computer architectures are usually designed independently. Currently there are acute difficulties (particularly in high performance computing) in finding a programming model which has a simple semantics appropriate for program development and which, at the same time, can be efficiently mapped onto multicore architectures.
Co-Array Fortran (CAF) is a Single Program Multiple Data parallel language designed to exploit the computing potential of multicore architectures: processes can asynchronously read and write non-local data. Unless great care is taken when using synchronisation statements, process compositions may be associated with multiple execution paths -some of these may be unwanted (e.g. data a is read by P 1 from P 2 before P 2 has initialised a). It is highly undesirable that high-level languages offer the potential for programs to be subject to race conditions. It is proposed here that the BSP computational model should be re-evaluated for the following reasons:
1. Multicore chips are now being designed with 16 and 32 cores-on-chip synchronisation in such environments has the potential to be relatively inexpensive compared with synchronisation on distributed processor architectures. 2. For large scale architectures comprising numerous multicore chips the multilevel BSP model offers the potential for efficient execution of many applications (e.g. divide and conquer algorithms) through the use of partitioned synchronisation. 3. It has been demonstrated in this paper that multilevel BSP has an associated programming model with simple reasoning and refinement laws
It is proposed that BSP programs be constructed in two phases: an initial sequential-like system development followed by a projection onto localised process computations. The first phase provides the benefits of reasoning about systems in a sequential-like way while there is potential for developing tools to carry out the second phase automatically.
