The bulk-synchronous parallel random access machine  by Tiskin, Alexandre
ELSEVIER Theoretical Computer Science 196 ( 1998) 109-l 30 
Theoretical 
Computer Science 
The bulk-synchronous parallel random access machine 
Alexandre Tiskin*,’ 
Oxford University Computing Laboratory, Wolfson Building, Parks Rd, Oxford OXI 3QD, UK 
Abstract 
The model of bulk-synchronous parallel (BSP) computation is an emerging paradigm of 
general-purpose parallel computing. Originally, BSP was defined as a distributed memory model. 
Shared-memory style BSP programming had to be provided by PRAM simulation. However, this 
approach destroys data locality and therefore may prove inefficient for many practical problems. 
In this paper we present a new BSP-type model, called BSPFCAM, which reconciles shared- 
memory style programming with efficient exploitation of data locality. BSPRAM can be optimally 
simulated by BSP for a broad range of algorithms. We identify some characteristic properties 
of such algorithms: obliviousness, slackness, granularity. Finally, we illustrate these concepts 
by presenting BSPRAM algorithms for butterfly dag computation, cube dag computation, dense 
matrix multiplication and sorting. @ 1998-Elsevier Science B.V. All rights reserved 
Keywords: BSP computing; Automatic memory management; PRAM simulation; 
Shared memory simulation; BSP algorithms 
1. Introduction 
The model of bulk-synchronous parallel (BSP) computation (see [27,18-201) is in- 
tended to provide a simple and practical framework for general-purpose parallel com- 
puting. Its main goal is to support the creation of architecture-independent and scalable 
parallel software. The key features of BSP are the treatment of the communication 
medium as an abstract fully connected network, and explicit and independent costing 
of communication and synchronisation. 
Many other communication complexity models have been proposed for parallel com- 
puting. One of the main divisions among the models is by the type of memory or- 
ganisation: distributed or shared. Models based on shared memory are appealing from 
* E-mail: tiskin@comlab.ox.ac.uk. 
‘This work was supported in part by ESPRIT Basic Research Project 9072 - GEPPCOM (Foundations 
of General Purpose Parallel Computing). 
0304-3975/98/$19.00 @ 1998 - Elsevier Science B.V. All rights reserved 
PII so304-3975(97)00197-7 
110 A. Tiskinl Theoretical Computer Science 196 (1998) 109-130 
the theoretical point of view, because they provide the benefits of natural problem 
specification, convenient design and analysis of algorithms, and straightforward pro- 
gramming. For this reason, the PRAM model has dominated the theory of parallel 
computing. However, this model is far from being realistic, since the cost of sup- 
porting shared memory in hardware is much higher than that of distributed memory. 
Consequently, much effort was put into the development of efficient methods for sim- 
ulation of PRAM on more realistic models. 
Unlike PRAM, BSP accurately reflects main design features of most existing parallel 
computers. On the abstract level BSP is defined as a distributed memory model with 
point-to-point communication between the processors. Paper [28] shows how shared- 
memory style programming, with all the associated benefits, can be provided in BSP 
by PRAM simulation. However, this approach does not allow the algorithm designer 
to exploit data locality, and therefore in many cases may lead to inefficient algorithms. 
In this paper we propose a new model, called BSPRAM, which stands between BSP 
and PRAM. BSPRAM is based on a mixture of shared and distributed memory, and 
allows one to specify, design, analyse and program shared-memory style algorithms 
that exploit data locality. The cost models of BSPRAM and BSP are based on the 
same principles, but there are important differences connected with concurrent memory 
access in BSPRAM. The two models are related by efficient simulations for a broad 
range of algorithms. 
We identify some properties of a BSPRAM algorithm that suffice for its optimal sim- 
ulation in BSP. Algorithms possessing at least one of these properties - obliviousness, 
high slackness, high granularity - are abundant in scientific and industrial computing. 
We show the meaning and use of such properties on several examples: butterfly dag 
computation, cube dag computation, matrix multiplication, sorting. In view of our sim- 
ulation results, BSPRAM here plays a role of a methodology for generic BSP algorithm 
design. 
Algorithms presented in this paper, as well as many other BSPRAM algorithms, 
are defined for input sizes that are sufficiently large with respect to the number of 
processors. Apart from simplifying the algorithms, this condition provides slackness 
and granularity necessary for their efficient BSP simulation. A typical form of such 
condition is n >poly(p), where n is the size of the input, p is the number of processors, 
and poly is a low-degree polynomial. Practical problems usually satisfy such conditions, 
unless the number of processors is extremely large. Because of that, we present the 
algorithms in their simplest form, without trying to adapt them for lower values of n. 
Instead, we only note where such optimisation is possible, and give references to papers 
that address this problem. 
For the sake of simplicity, throughout the paper we ignore small irregularities that 
arise from imperfect matching of integer parameters. For example, when we write 
“divide an array of size n into p regular blocks”, the value n may not be an exact 
multiple of p, and therefore the blocks may differ in size by f 1. Such effects need 
not be considered in the abstract description of algorithms, since they can be easily 
accounted for during implementation. 
A. Tiskin I Theoretical Computer Science 196 (1998) 109-130 111 
2. Historical background 
The last fifty years have seen a tremendous success of sequential computing. As 
pointed out in [27, 18, 191, this was primarily due to the existence of a single model, 
the von Neumann computer, which was simple and realistic enough to serve as a 
universal basis for sequential computing. No such basis existed for parallel computing. 
Instead, there was a broad variety of hardware designs and programming models. 
One of the main traditional divisions among models of parallel programming is the 
organisation of memory: distributed versus shared. Shared memory is much costlier 
to support in hardware than distributed memory. However, shared memory has some 
important advantages: 
l natural problem specification - computational problems have well-defined input and 
output, that are assumed to reside in the shared memory. As a contrast, algorithms 
for a distributed memory model have to assume a particular distribution of input and 
output. This distribution effectively forms a part of the problem specification, thus 
restricting the practical applicability of an algorithm. 
l convenient design and analysis of algorithms - the computation can be described 
at the top level as a sequence of transformations of the global state determined by 
the contents of the shared memory. As a contrast, algorithms for distributed memory 
models have to be designed in terms of individual processors operating on their local 
memories. 
l straightforward programming - the shared memory is uniformly accessible via sin- 
gle address space and two basic primitives: reading and writing. As a contrast, pro- 
gramming for distributed memory models is more complicated, typically involving 
point-to-point communication between processors via the network. 
The computational model most widely used in the theory of parallel computing is 
the Parallel Random Access Machine (PRAM) (see e.g. [5,12,13,18]). The PRAM 
consists of a potentially infinite number of processors, each connected to a common 
memory unit with potentially infinite capacity. The computation is completely syn- 
chronous. Accessing a single value in the memory costs the same as performing an 
arithmetic or Boolean operation on a single value. 
Several variants of PRAM have been introduced. Among them are exclusive read, 
exclusive write PRAM (ERE W PRAM), which requires that every memory cell is 
accessed by not more than one processor in any one step, and concurrent read, con- 
current write PRAM (CRCW PRAM), which allows several processors to access a 
cell concurrently in one step. For CRCW PRAM, a rule for resolving concurrent writ- 
ing must be adopted. One of the possibilities, realised in combining CRCW PRAM 
(see e.g. [5, pp. 690-691]), is to write some specified combination of the values being 
written and (optionally) the value stored previously at the target cell. A typical choice 
of the combining function is some commutative and associative operator such as the 
sum or the maximum of the values. 
Another major model of parallel computation is the circuit model (see e.g. [ 13, 181). 
A circuit is a directed acyclic graph (dag) with terminal nodes labeled as constant, 
112 A. Tiskinl Theoretical Computer Science 196 (1998) 109-130 
input or output, and nonterminal nodes labeled by arithmetic or Boolean operations. 
Algorithms that can be represented as circuits are oblivious, i.e. perform the same 
sequence of operations for any input (although the arguments and results of individual 
operations may, of course, depend on the inputs). Such algorithms are simpler to 
analyse than non-oblivious ones. Circuits also provide a useful intermediate stage in 
the design of algorithms for PRAM-type models: the problem of designing a circuit 
is separated from the problem of scheduling its underlying dag. For example, while 
the question of an optimal solution to the matrix multiplication problem remains open, 
one can find optimal scheduling for particular circuits representing the standard C3(n3) 
method, or the Strassen’s @(n“‘g7) method. In this paper we study the scheduling 
problem for several classes of dags. 
Both the PRAM and the circuit model are simple and straightforward. However, 
these models do not take into account the limited computational resources of existing 
computers, and therefore are far from being realistic. The first step in making them 
more realistic was to introduce a new complexity measure, eficiency, depending on 
the number of processors used by the algorithm (see [14]). New parallel models were 
gradually introduced to account for resources other than the number of processors. 
Currently, dozens of such models exists; see [16, 17,231 for their survey. Among the 
computer resources measured by these models are, according to [ 161, the number of 
processors, memory organisation (distributed or shared), communication latency, de- 
gree of asynchrony, bandwidth, message handling overhead, block transfer, memory 
hierarchy, memory contention, network topology, and many others. 
Models that include many different resource metrics tend to be too complex. A useful 
model should be concise and concentrate on a small number of crucial resources. One 
of the simplest and most elegant parallel models is the BSP model - see [27, 18, 191 for 
the description of BSP as an emerging paradigm for general-purpose parallel computing. 
The BSP model is defined by a few qualitative characteristics: uniform network topol- 
ogy, barrier-style bulk synchronisation, - and three quantitative parameters: the number 
of processors, communication throughput and latency. The main principle of BSP is 
to regard communication and synchronisation as separate activities, possibly performed 
by different mechanisms. The corresponding costs are independent and compositional, 
i.e. can be simply added together to obtain the total cost. It is easy to extend the BSP 
model to account for memory efficiency as well. 
In this paper we propose a variant of BSP, called BSPRAM, intended to support 
shared-memory style BSP programming. The memory of BSPRAM has two levels: local 
memory of individual processors, and a shared global memory. We compare BSPRAM 
with similar existing models. We then study the relationship between BSPRAM with 
BSP by means of simulation. Let n denote the size of the input to a program. Following 
[28], we say that a model A can optimally simulate a model B when there is a 
compilation algorithm that transforms any program with cost T(n) on B to a program 
with cost O(T(n)) on A. If the compilation algorithm yields a randomised program for 
A, we call the simulation optimal if the expected cost of the randomised program is 
O(T(n)). Sometimes the simulation may be restricted to programs from a particular 
A. Tiskin I Theoretical Computer Science 196 (1998) 109-130 113 
class. We assume that we are free to define a suitable distribution of the input and 
output data to simulate a shared memory model on a distributed memory one. 
If the described compilation is defined only for a particular class of algorithms, we 
say that A can optimally simulate B for that class of algorithms. We show that BSP 
can optimally simulate BSPRAM for several large classes of algorithms. 
3. The BSP model 
A BSP computer, introduced in [2&28], consists of p processors connected by a 
communication network (see Fig. 1). Each processor has a fast local memory. The 
processors may follow different threads of computation. A BSP computation is a se- 
quence of supersteps (see Fig. 2). A superstep consists of an input phase, a local 
computation phase and an output phase. In the input phase a processor receives data 
that were sent to it in the previous superstep; in the output phase it can send data to 
other processors, to be received in the next superstep. The processors are synchronised 
between supersteps. The computation within a superstep is asynchronous. 
The cost unit is the cost of performing a basic arithmetic operation or a local mem- 
ory access. If for a particular superstep w is the maximum number of local operations 
performed by each processor, h’ (respectively h”) is the maximum number of data 
units received (respectively sent) by each processor, and h = h’ + h”, then the cost of 
the superstep is defined as w + h. g + 1. Here g and I are parameters of the computer. 
Fig. 1. A BSP computer. 
suoerstep superstep superstep 
’ camp ’ comm ’ camp ’ comm camp 
Fig. 2. A BSP computation. 
114 A. Tiskin I Theoretical Computer Science 196 (1998) 109-130 
The value g is called communication throughput ratio (also sometimes “bandwidth 
inefficiency” or “gap”), the value I - communication latency (also sometimes “syn- 
chronisation periodicity”). We write BSP (p, g, 1) to denote a BSP computer with the 
given values of p, g and 1. The values of w and h typically depend on the number 
of processors p and on the problem size. If a computation consists of S supersteps 
with costs w, + h, . g + 1, 1 <s <S, then its total cost is W + H. g + S. I, where 
W = c,“=, w, is the local computation cost, H = Cf=, h, is the communication cost, 
and S is the synchronisation cost. (We will omit the factors g and 1 when dealing with 
communication and synchronisation separately from local computation.) 
In order to utilise the computer resources efficiently, a typical BSP program should 
regard the values p, g and 1 as configuration parameters. Algorithm design should 
aim to minimise local computation, communication and synchronisation costs for any 
realistic values of these parameters. For most problems, a balanced distribution of 
data and computation work will lead to algorithms that achieve optimal cost values 
simultaneously. However, for some other problems a need to trade off the costs will 
arise. 
An example of a communication-synchronisation tradeoff is the problem of broad- 
casting a single value from a processor: it can be performed with H = S = O(log p) 
by a balanced binary tree, or with H = O(p) and S = 0( 1) by sending the value di- 
rectly to every processor (this was observed in [27]). On the other hand, a technique 
known as two-phase broadcast allows one to achieve perfect balance for the problem 
of broadcasting n > p values from one processor. By dividing the values into p blocks 
of size n/p, scattering the blocks so that each one gets to a distinct processor, and then 
performing total exchange of the blocks, the problem can be solved with H = O(n) and 
S = 0( 1) - this is obviously optimal. Broadcasting of n = pE elements for any constant 
E, 0 <E < 1, can be performed optimally by (1 + 8-l )-phase broadcast. The values are 
scattered so that each one gets to a distinct processor, and each value is broadcast by a 
balanced tree of degree n and height 8-l. Non-leaf nodes of the broadcasting forest are 
partitioned among the processors, so that on each level each processor computes at most 
one node. The communication and synchronisation costs are H = 0(&-l . n) = O(n) and 
s=O(&-1)=0(l). 
Matrix computations provide further examples of problems with and without trade- 
offs: for instance, matrix multiplication can be done optimally in communication and 
synchronisation, but matrix inversion presents a tradeoff between communication and 
synchronisation with a polynomial range of parameters. 
The BSP model does not directly support shared memory, broadcasting or combining. 
These facilities can be obtained by simulating a PRAM on a BSP computer. Such 
simulation is also called automatic mode BSP programming, as opposed to the direct 
mode, i.e. programming with explicit control over communication and synchronisation. 
In order to achieve efficient simulation of a PRAM on a BSP computer, the PRAM 
must have more processors than the BSP computer. For a fixed value of p, we say 
that a PRAM algorithm has slackness 0, if at least op PRAM processors perform 
reading or writing at every step. Note that ap is a lower bound on the number of 
A. Tiskinl Theoretical Computer Science 196 (1998) 109-130 115 
communicating processors, rather than the actual minimum number, which may de- 
pend on the dynamic behaviour of the algorithm. Slackness measures the “degree of 
communication parallelism” achieved by the algorithm, and is typically a function of 
the problem size n and the number of BSP processors p. 
In the automatic mode, each step of a PRAM is implemented as a superstep, with 
at least g virtual PRAM processors allocated to each of the p BSP processors. Virtual 
processor allocation is equal and non-repeating, but otherwise arbitrary. Paper [28] 
states the following result. 
Theorem 1. An optimal randomised simulation on BSP (p, g, 1) can be achieved for 
(i) any ERE W PRAM algorithm with slackness a 2 log p; 
(ii) any CRC W PRAM algorithm with slackness a > p” for some E > 0. 
Here g and I are assumed to be constant. 
Proof. See [28]. 0 
Memory access in the randomised simulation is made uniform by hashing: each 
memory cell of the simulated PRAM is represented by a cell in the local memory of 
one of the BSP processors, chosen according to some easily computable hash function 
which ensures nearly random and independent distribution of cells. 
The simulation allows one to write PRAM programs for BSP computers and to pre- 
dict their performance accurately. Most practical problems possess the slackness neces- 
sary for efficient simulation. However, the automatic mode does not allow the program- 
mer to exploit data locality, because PRAM processors do not have any substantial local 
memory. For some problems this is insignificant (e.g. multiplication of sparse matrices 
with a random pattern of nonzeros). For many other problems this can be a serious 
drawback. (e.g. multiplication of dense or regularly sparse matrices). Because of that, 
the direct mode of BSP programming is often preferable to the automatic mode. 
The next section aims to reconcile the exploitation of data locality with shared- 
memory style programming, retaining the parameters g and 1 and the bulk-synchronous 
structure of the computation, We introduce a new BSP-type model, called BSPRAM, 
in which the network is implemented as a random-access shared memory unit. The 
new model is designed to combine the best features of both automatic and direct BSP 
programming modes. We present a randomised BSP simulation of BSPRAM, based on 
a suitably adapted concept of slackness. We also describe a deterministic simulation, 
based on additional properties of obliviousness and granularity. 
4. The BSPRAM model 
In the previous section we described two alternative approaches to BSP program- 
ming. The automatic mode (PRAM simulation) enables the shared-memory style BSP 
programming with all its benefits. However, it does not allow one to exploit data 
116 A. Tiskinl Theoretical Computer Science I96 (1998) 109-130 
1 2 P 
BSPRAM (p, .g, 1) 
w 
Fig. 3. A BSPRAM. 
locality. On the other hand, the direct mode (pure BSP) allows one to exploit data 
locality, but only in a distributed memory paradigm. The aim of this section is to 
introduce a new BSP programming method, allowing both shared-memory style pro- 
gramming and exploitation of data locality. This might be called a “semi-automatic 
mode” of BSP programming. 
The new method is similar to PRAM simulation mentioned in the previous section. 
The key difference is that a BSP superstep is no longer fragmented into independent 
steps of up individual virtual PRAM processors. The structure of computation in the 
local memories of BSP processors is preserved. The simulation mechanism is used to 
model the global shared memory, which in the new model replaces the BSP commu- 
nication network. We call the new computational model BSPRAM. 
Formally, a BSPRAM consists of p processors with fast local memories (see Fig. 3). 
In addition, there is a single shared main memory. As in BSP, the computation proceeds 
by supersteps (see Fig. 2). A superstep consists of an input phase, a local computation 
phase, and an output phase. In the input phase a processor can read data from the main 
memory; in the output phase it can write data to the main memory. The processors are 
synchronised between supersteps. The computation within a superstep is asynchronous. 
Similarly to PRAM, concurrent access to the main memory in one superstep can be 
either allowed or disallowed. In this paper we consider an exclusive-read, exclusive- 
write BSPRAM (EREW BSPRAM), in which every cell of the main memory can 
be read from and written to only once in every superstep, and a concurrent-read, 
concurrent-write BSPRAM (CRCW BSPRAM), that has no restrictions on concurrent 
access to the main memory. For convenience of algorithm design we assume that if a 
value x is being written to a main memory cell containing the value y, the result may 
be determined by any prescribed function f(x, y) computable in time 0 (1). Similarly, 
if values xi , . . . ,x,,, are being written concurrently to a main memory cell containing the 
value y, the result may be determined by any prescribed function f(xi @ . . . ax,, y), 
where @ is a commutative and associative operator, and both f and @ are computable 
in time 0( 1). This corresponds to resolving concurrent writing in PRAM by combining 
(see e.g. [5]). 
The cost of a BSPRAM superstep is defined, similarly to the BSP model, as w + 
h . g+Z. Here w is the maximum number of local operations performed by each pro- 
cessor, and h = h’ + h”. The value of h’ (respectively h”) is defined as the maximum 
number of data units read from (respectively written to) the main memory by each 
processor in the superstep. As in BSP, the values g and I are fixed parameters of the 
computer. We write BSPRAM (p, g, I) to denote a BSPRAM with the given values 
A. Tiskinl Theoretical Computer Science 196 (1998) 109-130 117 
superstep superstep superstep 
1 
Fig. 4. A BSPRAM computation. 
of p, g and 1. The cost of a computation consisting of several supersteps is defined 
as W + H. g + S. 1, where W, H and S have the same meaning as in the BSP 
model. 
One of the early models similar to BSPRAM was the LPRAM model proposed in [l]. 
The model consists of a number of synchronously working processors with large local 
memories and a global shared memory. The only mode of concurrent memory access 
considered in [l] is CREW. The model has an explicit bandwidth parameter, which 
corresponds to g in BSP and BSPRAM. There is no accounting for synchronisation cost 
(although it is suggested as a possible extension of the model). Thus, a p-processor 
LPRAM is equivalent (up to a constant factor) to CREW BSPRAM (p, g, 1). 
Another model similar to BSPRAM, called Asynchronous PRAM, was proposed in 
[7] (an earlier version of this model was called Phase PRAM). Like BSPRAM, Asyn- 
chronous PRAM consists of processor-memory pairs communicating via a global shared 
memory. The computation structure is bulk-synchronous, with EREW communication. 
The model charges a unit cost for a global read/write operation, d units for communica- 
tion startup and B units for barrier synchronisation. Thus, a p-processor Asynchronous 
PRAM is equivalent (up to a constant factor) to EREW BSPRAM (p, l,d+B). 
A bulk-synchronous parallel model QSM is proposed in [8] (an earlier version of 
this model was called QRQW PRAM). The model has a bandwidth parameter g. 
A p-processor QSM machine is similar to BSPRAM (p, g, 1) with a special mode 
of concurrent access to the main memory: any k concurrent accesses to a cell cost k 
units. Such a model is more powerful than EREW BSPRAM (p, g, 1 ), but less powerful 
than CRCW BSPRAM (p, g, 1). 
As for PRAM simulation, some “extra parallelism” is necessary for efficient 
BSPRAM simulation on BSP. We say that a BSPRAM algorithm has slackness 0, 
if the communication cost of every one of its supersteps is at least U. We adapt the 
results on PRAM simulation mentioned in the previous section to obtain an efficient 
simulation of BSPRAM. 
118 A. Tiskin I Theoretical Computer Science 196 (1998) 105130 
Theorem 2. An optimal randomised simulation on BSP (p, g, I) can be achieved for 
(i) any ERE W BSPRAM (p, g, I) algorithm with slackness o 3 log p; 
(ii) any CRC W BSPRAM (p, g, 1) algorithm with slackness (r 3 pE for some E > 0. 
Proof. Immediately follows from Theorem 1. 0 
Apart from randomised simulation by hashing, in some cases an efficient determinis- 
tic simulation of BSPRAM is possible. We consider two important classes of algorithms 
for which such deterministic simulation exists. 
We say that a BSPRAM algorithm is oblivious, if the sequence of operations ex- 
ecuted by each processor is the same for any input of a given size (although the 
arguments and results of individual operations may depend on the inputs). An oblivi- 
ous algorithm can be represented as a computation of a uniform family of circuits (for 
the definition of a uniform family of circuits, see e.g. [13]). We say that a BSPRAM 
algorithm is communication-oblivious, if the sequence of communication and synchro- 
nisation operations executed by each processor is the same for any input of a given 
size, but no such restriction is made for local computation. 
We say that a set of cells in the main memory of BSPRAM constitutes a granule, 
if in any input (output) phase each processor either does not read from (write to) any 
of these cells, or reads from (writes to) all of them. Informally, a granule is treated 
as “one whole piece of data”. We say that a BSPFL4M algorithm has granularity y if 
all main memory cells used by the algorithm can be partitioned into granules of size 
at least y. Note that both slackness and granularity are lower bounds rather than the 
actual minimum values. Slackness of a BSPRAM algorithm can always be taken to be 
equal or higher than its granularity: u > y. 
Communication-oblivious algorithms and algorithms with sufficient granularity 
for BSPRAM allow optimal deterministic BSP simulation. As we show below, ran- 
domised hashing is not necessary for communication-oblivious algorithms, since their 
communication pattern is known in advance. Therefore, an optimal distribution of 
main memory cells across BSP processor-memory pairs can be found off-line. For 
algorithms with granularity at least p, hashing is not necessary either, since every 
granule can be split up into p equal parts that are evenly distributed across BSP 
processor-memory pairs. This makes all communication uniform. In both cases ran- 
domised hashing is replaced by a simple deterministic data distribution. Moreover, 
for communication-oblivious algorithms with slackness at least p’, and for algorithms 
with granularity at least p, concurrent memory access can be simulated by mecha- 
nisms similar to the two-phase and (1 +s-l )-phase broadcast described in the previous 
section. 
Below we formally state the results on deterministic BSPRAM simulation. 
Theorem 3. An optimal deterministic simulation on BSP (p, g, 1) can be achieved for 
(i) any communication-oblivious ERE W BSPRAM (p, g, I) algorithm; 
A. Tiskin I Theoretical Computer Science 196 (1998) 109-130 119 
(ii) any communication-oblivious CRC W BSPRAM ( p, g, I) algorithm with slack- 
ness 0 2 pE for some E > 0; 
(iii) any CRCW BSPRAM (p, g, 1) algorithm with granularity y 2 p. 
Proof. (i) Since the communication pattern of a communication-oblivious algorithm is 
known in advance, we only need to show that any computation of EBEW BSPRAM 
(i.e. a particular run of an algorithm) can be performed in BSP at the same asymptotic 
cost. First, we modify each BSPRAM superstep so that each processor both reads and 
writes any main memory cell that it either reads or writes in the original superstep. 
This increases the communication cost of the computation at most by a factor of 2, 
and does not change the synchronisation cost. 
The above modification essentially transforms the computation into a form of mes- 
sage passing, in which main memory cells represent messages, and writing or reading 
a value corresponds to sending or receiving a message. This message-passing version 
of BSPRAM was referred to as “BSP+” in [25]. It differs from the direct BSP mode 
in that a message can be “delayed”, i.e. its sending and receiving may occur in non- 
adjacent supersteps. 
It remains to show that the “delayed” messages can be simulated optimally by nor- 
mal BSP messages. We represent the whole BSPRAM computation by an undirected 
graph. Each superstep is represented by two nodes, one for the input phase and the 
other for the output phase. Messages are represented by edges. Two nodes vi and v2 
are connected by an edge e, if the message represented by e is sent in the output 
phase represented by vi, and received in the input phase represented by ~2. The con- 
structed graph is bipartite, with the two parts representing all input and output phases 
respectively. If an input or output phase has cost h, then the degree of its representing 
node is at most ph. 
It is a well-known fact (see e.g. [2, p. 247]), that for any bipartite graph with 
maximum degree at most p, there is a colouring of its edges with not more than p 
colours, such that all the edges adjacent to the same node are coloured differently. As 
an easy corollary of this, for an arbitrary bipartite graph and an arbitrary p, there is 
a colouring of the edges with not more than p colours, such for an arbitrary h, any 
node of degree at most ph has at most h adjacent edges of of each colour. (This 
can be proved by splitting each node of degree at most ph into h nodes of degree at 
most p.) 
We use the above theorem to colour the computation graph. We then regard the 
colour of each edge as the identifier of a BSP processor that must obtain the corre- 
sponding message from the sending processor, keep it in its local memory for as long 
as necessary, and then transfer the message to the receiving processor. The communi- 
cation and synchronisation costs of the computation are increased at most by a factor 
of 2. 
(ii) The proof is similar to that of (i). The only difference is that, due to con- 
current reading and writing, each message has to be combined from contributions of 
several processors before being sent, and broadcast to several processors after being 
120 A. Tiskin / Theoretical Computer Science I96 (1998) 109-130 
received. Consider a particular superstep in the computation. By symmetry, we need 
to analyse only the input phase. Simultaneous broadcasting of received messages is 
done by a method which generalises the (1 + s-‘)-phase broadcast technique from 
Section 3. Without loss of generality we assume that the communication cost of the 
considered input phase is h = o = p’, 0 <E < 1. Each message is broadcast by a tree of 
maximum degree h and height at most 8-l (the tree does not have to be balanced). 
The broadcasting forest is partitioned among the processors so that on each level the 
total degree of nodes computed in any processor is at most 2h. Such partitioning can 
be easily obtained by a greedy algorithm. The communication cost of the computation 
is increased at most by a factor of 2&-l, and the synchronisation cost at most by a 
factor of E-‘. 
(iii) Partition each granule into p equal subgranules. For each granule, choose an 
arbitrary balanced distribution of its subgranules across the processors. 
An input phase of the BSPRAM algorithm is simulated by two BSP supersteps. 
In the first superstep a processor broadcasts a request for each granule that it must read. 
Note that since the subgranules of every granule are distributed evenly, all processors 
receive an identical set of requests. In the second superstep a processor satisfies the 
received requests by sending the locally stored subgranules of the requested granules 
to the requesting processors. 
An output phase of the BSPRAM algorithm is simulated by one BSP superstep. 
In this superstep a processor divides each granule that it must write into p subgran- 
ules, and sends to every processor the appropriate subgranules. Having received its 
subgranules, each processor combines any concurrently written data, and then updates 
the locally stored subgranules. 
The communication and synchronisation costs of the computation are increased at 
most by a factor of 2. 0 
On some parallel computers, a direct implementation of the BSPRAM model 
may prove practical. In any case, the proofs of Theorems 2 and 3 show that a BSP 
computer can execute most practical BSPRAM algorithms within a low constant factor 
of their BSPRAM cost. For two important classes of algorithms - communication- 
oblivious algorithms and algorithms with sufficient granularity - the simulation is 
deterministic and particularly simple. The next few sections give examples of such 
algorithms. 
5. Butterfly dag computation in BSPRAM 
The butterfly dag describes the dependence pattern of the Fast Fourier Transform, 
which is one of the most important algorithms in scientific computation. Parallel al- 
gorithms for butterfly dag computation have been proposed in various parallel models 
(see e.g. [5,12]). 
A. Tiskinl Theoretical Computer Science 196 (1998) 109-130 121 
01234567 
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 
Fig. 5. Butterfly dag bpY(l6). 
Formally, the butterfly dug bJy(n) with inputs xi and outputs yi, Ofi<n, contains 
logn+l levels of nodes vf, 0 dk Glogn, Odi<n, such that 
vo takes the input xi, 
vk+’ depends on vi, 
J 
k if i=j or i@j=2k, 
‘i logn produces the output yi, 
(1) 
where i @ j denotes the bitwise x or (exclusive or) operation on the binary representa- 
tions of i and j. Fig. 5 shows the butterfly dag bfEy(l6). 
As observed in [21,27], the butterfly dag can be partitioned in a way suitable for 
bulk-synchronous parallel computation. The computation of a level in bJEy(n) consists 
in $n independent computations of bfly(2). Similarly, the computation of any k con- 
secutive levels consists in n/2k independent computations of bJly(2k). Therefore, the 
butterfly dag computation can be split into two stages, each comprising k logn levels 
and consisting of n’i2 independent tasks. If n is sufficiently large with respect to p, 
each of the two stages can be completed in one superstep. 
Fig. 6 shows the two-superstep computation of bjly( 16). Each superstep consists of 
four independent tasks computing bJy(4). In general, the algorithm is as follows. 
Algorithm 1. Computation of the butter-y dag bJy(n). 
Input: An array x = (xi), 0 d i < n. 
Output: An array y = (vi), 0 d i <n, defined by (1). 
Description. We assume n 2 p2. The computation is performed on EREW BSPRAM 
(p, g, I) and proceeds in two supersteps, each comprising i log n levels. In both super- 
steps, each processor is assigned n’i’/p independent butterfly dags of size n1i2. Data 
are communicated via the main memory. 
122 A. Tiskin I Theoretical Computer Science 196 (1998) 109-130 
I /A\ I I, 
b.fG (4 WY (4) WY (4) WY (4) 
I I I I I I I I I I I I I I I 
0 4 8 12 1 5 9 13 2 6 10 14 3 7 11 15 
Fig. 6. BSPFLAM computation of bpY(l6). 
Cost analysis. The local computation, communication and synchronisation costs are 
W = O(n log n/p), H = WdP), S=O(l). 
The algorithm is oblivious, with slackness rs = n/p, and granularity y = n/p2. 
Paper [27] considers reducing the required minimum value of n for efficient BSP 
computation of a butterfly dag. 
6. Cube dag computation in BSPRAM 
The cube dag defines the dependence pattern that is characteristic for many scientific 
algorithms. Here we describe a BSPRAM version of the BSP cube dag algorithm from 
[19]. For simplicity, we consider the computation of a three-dimensional cube dag; the 
algorithm for other dimensions is similar. 
The three-dimensional cube dag cube3(n) with inputs x$‘, A$‘, xf’, and outputs 
y$‘, y!:‘, yf), 0 <i, j, k <n, contains n3 nodes z)$, such that 
Vojk, viok, V&O take respectively x$‘,x$‘,xP 
u$ contributes to each of the nodes 
Vi+l,j,k, Vi,j+],k, Vi,j,k+l whenever such node eXktS 
V,-l,j,k, Vi,n-l,k, Vi,j,n-1 produce respectively J$‘, y$‘, yi;” 
(2) 
Fig. 7 shows the cube dag tubes(4). 
The BSP algorithm for computing the dag cubes(n) is given in [19]. In this algorithm, 
the array v = (vi&) is partitioned into p3/2 regular cubic blocks of volume (n/p’/2)3. 
We denote these blocks by &, 0 <i, j, k < p ‘I2 Each block defines a dag isomorphic . 
to cube3(n/p’/2). The algorithm computes a block Vi+l,j+l,k+l as soon as the data from 
A. Tiskinl Theoretical Computer Science 196 (1998) 109-130 123 
Fig. 7. Cube dag cube3(4). 
Fig. 8. BSPRAM computation of cube3(n). 
its predecessors K,j+l,k+l, h+l,j,k+l and &+l,j+l,k become available. The independent 
blocks computed simultaneously form a “layer”, or “wavefront” of the dag cube3(n). 
Fig. 8 shows a stage in the BSP computation of cubea( The current wavefront is 
shaded. The total number of wavefronts is 3p”* - 2, therefore the computation can be 
completed in O(p”‘) supersteps. 
Algorithm 2. Computation of the cube dag cubes(n) 
Input: Arrays x(l) = ($‘), xc2) = (x$‘), xc3) = (x1;3’), 0 < i,j,k <n. 
Output: Arrays y(l) = ($‘), yc2) = ($), y (3)=(yf’), O<i,j,k<n, defined by (2). 
Description. We assume n > p ‘I2 The computation is performed on EREW BSPRAM . 
(p, g, I) and proceeds in 3p ‘1’ - 2 stages, each comprising a constant number of super- 
steps. In stage s, 0 <s < 3p ‘1’ - 3 the blocks I& with i + j f k = s are computed. The , 
maximum number of blocks computed in any one stage is ip. Data are communicated 
between supersteps via the main memory. 
Cost analysis. The local computation cost is W = 0(n3/p). The computation of a 
block requires the communication (reading from or writing to the main memory) of 
124 A. Tiskin I Theoretical Computer Science 196 (1998) 109-130 
0(n2/p) values on the surface of the block. Therefore, the communication cost of each 
stage is h = O(n2/p). The total communication cost is H = h. p”’ = O(n2/p1”). The 
synchronisation cost is S=O(P’/~). The algorithm is oblivious, with slackness and 
granularity 0 = y = n2/p. 
7. Matrix multiplication in BSPRAM 
In this section we describe a BSPRAM algorithm for one of the most common prob- 
lems in scientific computation: dense matrix multiplication. We deal with the problem 
of computing the matrix product XY = Z, where X = (xv), Y = (yjk), Z = (zik) are ar- 
bitrary it x n matrices. 
This problem is of great importance, and, despite its simple formulation, of enormous 
theoretical complexity. Since the groundbreaking paper by Strassen [24] much work has 
been done on the complexity of sequential matrix multiplication. However, no lower 
bound asymptotically better than the trivial G(n2) has been found; nor there is any 
indication that the current O(n2.376) algorithm from [4] is close to optimal. 
We aim at parallelising the standard @(n3) method without using fast matrix mul- 
tiplication techniques. The method consists in the straightforward computation of the 
family of bilinear forms 
n 
zik = c XijYjk, 1 <i,k<n. (3) 
j=l 
Following (3), we need to set 
z&CO for i,k=l,...,n (4) 
and then compute 
vijk * xijyjk, zik + c%‘ik + V$ for all i, j, k, 1 < i, j, k <n. (5) 
Computation (5) for different triples i, j, k is independent (although it requires con- 
current reading from xv and yjk, and concurrent writing to zik), and therefore can be 
performed in parallel. 
The BSPRAM algorithm implementing this method is derived from the BSP algo- 
rithm for matrix multiplication described in [ 19,201, which in its turn is based on an 
idea from [l]. The algorithm works by a straightforward partitioning of the problem. 
The array V = (vgk) is represented as a cube of volume n3 in integer three-dimensional 
space (see Fig. 9). The arrays X, Y, Z are represented as projections of the cube V 
onto the coordinate planes k = 0, i= 0 and j = 0, respectively. The computation with 
the point vijk in (5) requires the input of its X and Y projections xii and yjk, and 
the output of its Z projection zik. In order to provide a communication-efficient BSP 
algorithm, the array V must be divided into p regular cubic blocks of size n/p113 (see 
Fig. 10). Such partitioning induces a partition of the matrices X, Y and Z into p213 
A. Tiskinl Theoretical Computer Science 196 (1998) 109-130 125 






. . . 
Z 
Fig. 10. Matrix multiplication in BSPRAM. 
regular square blocks of size n/~‘~‘, 
(6) 
and similarly for Y and Z (see Fig. 10). The computation (4), (5) can be expressed 
in terms of blocks as 
ZiktO for i,k=l,...,p113 (7) 
126 A. Tiskinl Theoretical Computer Science 196 (1998) 109-130 
and then 
f’$k +$qkk, zik + Zik + Kjk, for all i,j,k, 1<i,j,k<p1/3. (8) 
Each processor computes a block product I’& =X, . qk sequentially by (4), (5). The 
algorithm is as follows. 
Algorithm 3. Matrix multiplication. 
Znput: Matrices X = (xv) and Y = (J+), 1 <i, j<n. 
Output: A matrix Z = (zg), 1 Gi, j bn, defined by (3). 
Description. We assume n>p ‘I3 The computation is performed on CRCW . 
BSPRAM (P, g,Q. 
After the initialisation step (7), the computation proceeds in one superstep. Each 
processor performs the computation (8) for a particular triple i, j, k. In the input phase, 
the processor reads X, and I$. Then it computes the product &k =X, . qk by (4), 
(5). The block Kjk is then written to Z& in the main memory. Concurrent writing is 
resolved by addition of the written blocks to the previous content of Zik. The resulting 
array Z is the matrix product of X and Y. 
Cost analysis. The local computation, communication and synchronisation costs are 
W = O(n3/p), H = 0(n2/p2j3), S=O(l). 
The algorithm is oblivious, with slackness and granularity 0 = y = n2/p2j3. 
8. Sorting in BSPRAM 
Sorting is a classical problem of parallel computing. Many parallel sorting algorithms 
of different complexity have been proposed (see e.g. [3,9,12] and references therein). 
Here we consider comparison-based sorting of an array x = (xi), 1 <i <n. Without loss 
of generality we may assume that the elements of x are distinct (otherwise, we should 
attach a unique tag to each element). Let (a, b) denote an open interval, i.e. the set of 
all x in x such that a<x<b. 
Probably the simplest parallel sorting algorithm is parallel sorting by regular sampling 
(PSRS), proposed in [22] and discussed in [15]. Paper [l I] describes an optimised 
version of the algorithm, and its efficient implementation on a variety of platforms. 
The PSRS algorithm proceeds as follows. First, the array x is partitioned into p 
subarrays x’,...,xP, each of size n/p. The subarrays x4 are sorted independently by 
an optimal sequential algorithm. The problem now consists in merging the p sorted 
subarrays. 
On the first stage of merging, p + 1 regularly spaced primary samples are selected 
from each subarray (the first and the last elements of a subarray are among the sam- 
ples). We denote the samples of the subarray x4 by ??z,. . . ,Xz. The samples divide 
each subarray into p primary blocks of size n/p*. We denote the primary blocks of 
x4 by [z&x;], . . . , [FE_,,XF]. Then, p . (p + 1) primary samples are collected together 
A. Tiskinl Theoretical Computer Science 196 (1998) 109-130 127 
Xl X2 X3 
e; 5; 2; 2; 5; zf z; 3; z; 2; 2; z; 
v.., . . . . .v. : : 
‘. ‘,.. :.., : ‘. 











‘., ,: ‘.,. ,:. 
‘.., _. _, ..: ‘. _’ 
. . ,:. .:, . . : 
: -.., -.. : ,, .. .,.. ;’ ‘_ .,.:’ 
.’ 
.‘.. .’ 
‘..’ ._ _: .._: 
‘. :. 
,’ 
‘, : ‘_ 
: : 
.’ ‘._, ._ ,. ,. ., ‘..’ ‘. 
: : 
., .-, : ‘._ .“. ‘. 
: : .:..; :..,. I ‘. ‘. 




‘. . . 
& _aL-__Dfiz ,..‘., . ‘-_ ‘.. ‘.. __A --’ Q DZTZi --_ -- 
30 fl L2 z.3 
Fig. 11. Sorting by regular sampling on three processors. 
and sorted by an arbitrary sequential algorithm. After that, we choose p + 1 regularly 
spaced secondary samples from the sorted array of primary samples (the first and the 
last elements are again included in the samples). We denote the secondary samples 
by ?e,...,$,. The secondary samples partition the elements of x into p secondary 
blocks, corresponding to the intervals (ze,?t ),. . . , (F,_ I ,zp). Efficiency of the above 
computations with samples is not critical, since the number of samples does not depend 
on n. The problem is now reduced to collecting the elements of each secondary block 
together. 
Let us show that each secondary block contains at most 3nJp elements. For a fixed 
secondary block defined by (&,Xk+t ), we divide all the primary blocks of x into three - - 
categories. We call a primaty block (Zy,Xi”,,) an inner block, if ($,jsi”,,) C (&,&+t); 
an outer block, if (X~,Y~+,) n (&,&+I) = 8; and a boundary block, if it is neither inner 
nor outer. With respect to any secondary block, there are at most p inner primary blocks 
in total (because there are only p primary samples inside the secondary block) and at 
most two boundary primary blocks in each subarray (because a boundary block must 
contain at least one of the two ends of the secondary block). Therefore, the size of 
a secondary block is at most n/p2 . (p + 2p) = 3njp. Thus, on the second stage of 
merging, the elements of each secondary block can be collected in optimal time, and 
then sorted by an optimal sequential algorithm. 
The method is illustrated in Fig. 11 for p = 3. The state of the array x after local 
sorting of the subarrays is represented by three horizontal bars at the top. Primary 
samples are shown as white dots. Dotted lines show the rearrangement of primary 
samples into a sorted array at the bottom. The dashed bars at the bottom show the 
elements of x assumed to lie between the samples; their numbers between neighbour- 
ing primary samples need not be equal. Black dots indicate the secondary samples. 
The secondary block (?t,?z) is shown by dark shading. Primary blocks that are in- 
ner, boundary and outer for (!?t,%$) are shown by dark shading, light shading and 
no shading, respectively. Only inner and boundary blocks may contain elements from 
- - 
(%,~2). 
128 A. Tiskin I Theoretical Computer Science 196 (1998) 109-130 
The sorting algorithm based on PSRS can be easily implemented in the BSPRAM 
model. We assume that the input and output arrays are stored in the main memory of 
BSPRAM. 
Algorithm 4. Sorting by regular sampling. 
Input: an array x = (xi), 0 6 i < n, where all X; are distinct. 
Output: the elements of x in increasing order. 
Description. We assume n 2 p3. The computation is performed on CRCW 
BSPRAM (p, g, I) and proceeds in three supersteps. In the first superstep a proces- 
sor picks a subarray ~4, reads it, sorts it with an optimal sequential algorithm, selects 
from it p + 1 primary samples, and writes them to the main memory. In the second 
superstep the processors perform an identical computation: read the p. (p + 1) primary 
samples, sort them and select p secondary samples. In the third superstep a processor 
picks a secondary block and collects its elements. In order to do this, a processor 
receives from other processors (via the main memory) all primary blocks that may 
intersect with the assigned secondary block; the number of such blocks is at most 3p, 
and their total size is at most 3n/p. The processor merges the primary blocks, discard- 
ing the values that do not belong to the assigned secondary block. The merged result 
is written to the main memory. 
Cost analysis. The local computation, communication and synchronisation costs are 
W = O(n log n/p) H = O(nlp) S=O(l) 
The algorithm is not communication-oblivious. Its slackness and granularity are 
(ignoring non-critical computations with samples) o = n/p, y = n/p’. 
Paper [lo] presents a more complex BSP sorting algorithm which is asymptotically 
optimal for any n 2 p. Its costs are W = O(n log n/p), H = O(n/p . log n/ log(n/p)), 
S = 0( log n/ log(n/p)). For n 2 p3, the algorithm is identical to PSRS; for smaller 
values of n it uses a pipelined tree merging technique similar to the one employed by 
Cole’s algorithm (see e.g. [3]). Despite its asymptotic optimality, the algorithm from 
[IO] is unlikely to be practical in the case of IZ M p. A more practical BSP sorting 
algorithm for small values of n is described in [6]. 
9. Conclusions 
A new model for bulk-synchronous parallel computing, the BSPRAM, has been pre- 
sented. The model enables the shared-memory style BSP programming with efficient 
exploitation of data locality. The BSP model can simulate BSPRAM optimally for a 
broad range of algorithms. The use of BSPRAM was illustrated on the examples of 
butterfly dag computation, cube dag computation, matrix multiplication and sorting. 
The corresponding values of the BSP cost, the type of BSPRAM used, and the charac- 
teristics of the obtained algorithms are summarised in Table 1. The BSPRAM approach 
A. Tiskinl Theoretical Computer Science 196 (1998) 109-130 129 
Table 1 
Summary of algorithm examples (constant factors omitted) 
Problem W H s trpe clobl? 0 Y 
bfly dag 
* 
4P 1 EREW P yes 4P dP2 
cube dag n3/P ,2 +2 P 112 EREW yes n2/P n2/P 
matr mult n31p .2 +3 1 CRCW yes .z .2 
* P2 3 
$3 
sorting nip 1 CRCW IlO P nip nlp2 
encourages natural specification of the problems: the input and output data are assumed 
to reside in the main memory, and no assumptions on data distribution are necessary. 
The design and analysis of algorithms are also simplified, since all communication is 
performed via the shared memory. 
In future we plan to develop new BSPRAM algorithms and to analyse their costs. 
This may lead to identifying new algorithm properties connecting BSPRAM and BSP, 
in addition to obliviousness, slackness and granularity. We also plan to develop a 
programming model and an implementation of BSPRAM. 
Acknowledgements 
The author thanks Bill McCall, Karma Terekhova, Alex Gerbessiotis, Constantinos 
Siniolakis, Rob Bisseling, Ben Juurlink, and the anonymous referees. 
References 
[l] A. Aggarwal, A.K. Chandra, M. Snir, Communication complexity of PRAMS, Theoret. Comput. Sci. 
71 (1990) 3-28. 
[2] C. Berge, Graphs, vol. 6, part 1 of North-Holland Mathematical Library, 2nd rev. ed., North-Holland, 
Amsterdam, 1985. 
[3] R. Cole, Parallel merge sort, in: J.H. Reif (Ed.), Synthesis of Parallel Algorithms, Morgan Kaufmann, 
Los Altos, CA, 1993, Ch. 10, pp. 453-495. 
[4] D. Coppersmith, S. Winograd, Matrix multiplication via arithmetic progressions, J. Symbol. Comput. 9 
(1990) 251-280. 
[5] T.H. Cormen, C.E. Leiserson, R.L. Rivest, Introduction to Algorithms, the MIT Electrical Engineering 
and Computer Science Series, MIT Press, Cambridge, MA and McGraw-Hill, New York, 1990. 
[6] A.V. Gerbessiotis, C.J. Siniolakis, Deterministic sorting and randomized median finding on the BSP 
model, in: Proc. 8th ACM Symp. on Parallel Algorithms and Architectures, 1996, pp. 223-232. 
[7] P.B. Gibbons, Asynchronous PRAM algorithms, in: J.H. Reif (Ed.), Synthesis of Parallel Algorithms, 
Morgan Kaufmann, Los Altos, CA, 1993, Ch. 22, pp. 957-997. 
[S] P. Gibbons, Y. Matias, V. Ramachandran, Can a shared memory model serve as a bridging model for 
parallel computation?, in: Proc. 9th ACM Symp. on Parallel Algorithms and Architectures, pp. 72-83, 
1997, Extended abstract. 
[9] A. Gibbons, W. Rytter, Efficient Parallel Algorithms, Cambridge University Press, Cambridge, 1988. 
[lo] M. Goodrich, Communication-efficient parallel sorting, in: Proc. 28th ACM Symp. on Theory of 
Computing, 1996. 
130 A. Tiskin I Theoretical Computer Science I96 (1998) 109-130 
[ 1 l] D.R. Helman, J. J&I&, D.A. Bader, A new deterministic parallel sorting algorithm with an experimental 
evaluation, Technical Report CS-TR-3670 and UMIACS-TR-96-54, Institute for Advanced Computer 
Studies, University of Maryland, August 1996. 
[12] J. J&T& An Introduction to Parallel Algorithms, Addison-Wesley, Reading, MA, 1992. 
[13] R.M. Karp, V. Ramachandran, Parallel algorithms for shared memory machines, in: J. van Leeuwen 
(Ed.), Handbook of Theoretical Computer Science, Elsevier, Amsterdam, 1990, Ch. 17, pp. 869-941. 
[ 141 C.P. Kruskal, L.Rudolph, M. Snir, A complexity theory of efficient parallel algorithms, Theoret. Comput. 
Sci. 71 (1990) 95-132. 
[15] X. Li, P. Lu, J. Schaeffer, J. Shillington, P.S. Wong, Hanmao Shi, On the versatility of parallel sorting 
by regular sampling, Parallel Comput. 19 (1993) 1079-l 110. 
[16] Z. Li, P.H. Mills, J.H. Reif, Models and resource metrics for parallel and distributed computation, in: 
Proc. 28th Hawaii Intemat. Conf. on System Sciences, IEEE Press, New York, 1995. 
[17] B.M. Maggs, L.R. Matheson, R.E. Tarjan, Models of parallel computation: a survey and synthesis, in: 
Proc. 28th Hawaii Intemat. Conf. on System Sciences, IEEE Press, New York, 1995, vol. 2, pp. 61-70. 
[18] W.F. McCall, General purpose parallel computing, in: A. Gibbons, P. Spirakis (Eds.), Lectures on 
parallel computation, Cambridge International Series on Parallel Computation, vol. 4, Cambridge 
University Press, Cambridge, 1993, Ch. 13, pp. 337-391. 
[19] W.F. McCall, Scalable computing, in: J. van Leeuwen (Ed.), Computer Science Today: Recent Trends 
and Developments, Lecture Notes in Computer Science, vol. 1000, Springer, Berlin, 1995, pp. 46-61. 
[20] W.F. McCall, Universal computing, in: L. Bough, P. Fraigniaud, A. Mignotte, Y. Robert (Eds.), Proc. 
Euro-Par’96-I, Lecture Notes in Computer Science, vol. 1123, Springer, Berlin, 1996, pp. 25-36. 
[21] C.H. Papadimitriou, M. Yannakakis, Towards an architecture-independent analysis of parallel algorithms, 
in: Proc. 20th Annual Symp. on Theory of Computing, 1988, pp. 510-513. 
[22] Hamnao Shi, J. Schaeffer, Parallel sorting by regular sampling, Parallel Distrib. Comput. 14 (4) (1992) 
361-372. 
[23] D.B. Skillicom, D. Talia, Models and languages for parallel computation, October 1996. 
[24] V. Strassen, Gaussian elimination is not optimal, Numer. Math. 13 (1969) 354-356. 
[25] A. Tiskin, The bulk-synchronous parallel random access machine, in: L. Bough, P. Friagniaud, 
A. Mignotte, Y. Robert (Eds.), Proc. Euro-Par ‘96-11, Lecture Notes in Computer Science, vol. 1124, 
Springer, Berlin, 1996, pp. 327-338. 
[26] L.G. Valiant, Bulk-synchronous parallel computers, in: M. Reeve (Ed.), Parallel Processing and Artificial 
Intelligence, Wiley, New York, 1989, Ch. 2, pp. 15-22. 
[27] L.G. Valiant, A bridging model for parallel computation, Comm. ACM 33 (8) (1990) 103-l 11. 
[28] L.G. Valiant, General purpose parallel architectures, in: J. van Leeuwen (Ed.), Handbook of Theoretical 
Computer Science, Elsevier, Amsterdam, Ch. 18, pp. 943-971. 
