Efficient automatic simulation of parallel computation on networks of workstations  by Kaklamanis, Christos et al.
Discrete Applied Mathematics 154 (2006) 1500–1509
www.elsevier.com/locate/dam
Efﬁcient automatic simulation of parallel computation on networks
of workstations
Christos Kaklamanisa,1,2, Danny Krizancb, Manuela Montangeroc,∗,1,
Giuseppe Persianod,2
aComputer Technology Institute and Department of Computer Engineering and Informatics, University of Patras, GR26500 Rion, Greece
bDepartment of Mathematics and Computer Science, Wesleyan University, Middletown CT 06459, USA
cDipartimento di Ingegneria dell’Informazione, Università di Modena e Reggio Emilia, Via Vignolese 905/b, 41100 Modena, Italy
dDipartimento di Informatica ed Applicazioni, Università di Salerno, 84081 Baronissi (Salerno), Italy
Received 14 October 2003; received in revised form 30 May 2005; accepted 29 October 2005
Available online 23 March 2006
Abstract
Andrews et al. [Automatic method for hiding latency in high bandwidth networks, in: Proceedings of the ACM Symposium on
Theory of Computing, 1996, pp. 257–265; Improved methods for hiding latency in high bandwidth networks, in: Proceedings of the
Eighth Annual ACM Symposium on Parallel Algorithms and Architectures, 1996, pp. 52–61] introduced a number of techniques
for automatically hiding latency when performing simulations of networks with unit delay links on networks with arbitrary unequal
delay links. In their work, they assume that processors of the host network are identical in computational power to those of the guest
network being simulated. They further assume that the links of the host are able to pipeline messages, i.e., they are able to deliver P
packets in time O(P + d) where d is the delay on the link.
In this paper we examine the effect of eliminating one or both of these assumptions. In particular, we provide an efﬁcient simulation
of a linear array of homogeneous processors connected by unit-delay links on a linear array of heterogeneous processors connected
by links with arbitrary delay. We show that the slowdown achieved by our simulation is optimal. We then consider the case of
simulating cliques by cliques; i.e., a clique of heterogeneous processors with arbitrary delay links is used to simulate a clique of
homogeneous processors with unit delay links. We reduce the slowdown from the obvious bound of the maximum delay link to
the average of the link delays. In the case of the linear array we consider both links with and without pipelining. For the clique
simulation the links are not assumed to support pipelining.
The main motivation of our results (as was the case with Andrews et al.) is to mitigate the degradation of performance when
executing parallel programs designed for different architectures on a network of workstations (NOW). In such a setting it is unlikely
that the links provided by the NOW will support pipelining and it is quite probable the processors will be heterogeneous. Combining
our result on clique simulation with well-known techniques for simulating shared memory PRAMs on distributed memory machines
provides an effective automatic compilation of a PRAM algorithm on a NOW.
© 2006 Elsevier B.V. All rights reserved.
Keywords: Parallel computation; Distributed computation; Automatic simulation
∗ Corresponding author.
E-mail addresses: kakl@cti.gr (C. Kaklamanis), dkrizanc@wesleyan.edu (D. Krizanc), montangero.manuela@unimo.it (M. Montangero),
giuper@dia.unisa.it (G. Persiano).
1 Work partially supported by the European RTN Project under contract HPRN-CT-2002-00278, COMBSTRU.
2 Work partially supported by the European Integrated Project under contract FP6-015964, AEOLUS.
0166-218X/$ - see front matter © 2006 Elsevier B.V. All rights reserved.
doi:10.1016/j.dam.2005.10.016
C. Kaklamanis et al. /Discrete Applied Mathematics 154 (2006) 1500–1509 1501
1. Introduction
In this paper we consider the problem of executing parallel programs designed in one setting (e.g. for a homogeneous
array of processors or PRAM) in an entirely different one (e.g., a network of workstations (NOW)). A NOW is a very
attractive and widely available type of distributed systems found in university departments, software houses, etc. (Even
a co-operating subset of the Internet may be thought of as a NOW.) In many situations the workstations remain idle
for signiﬁcant periods of time. By harnessing their computational power when their owner is not using them (e.g., at
night, during week-ends, lunch breaks, and meetings) they form a valid alternative to parallel machines for executing
parallel programs.
The main problem we deal with here is to determine the degradation of the performance of the algorithm introduced
by the simulation of one architecture on another. This is a classical problem in the theory of parallel algorithms and
several solutions have already been proposed including the use of redundant computation [4,6,7] and complementary
slackness [3,5,6,8–12]. These have been shown effective for hiding queueing and congestion delays introduced by the
links of distributed systems. While these approaches have been adopted in some special cases with success, they all
have an undesirable characteristic: it is always the programmer that has to tailor the parallel algorithm to the speciﬁc
distributed architecture on which the algorithm will be performed and look for an ad hoc simulation.
Automatic latency hiding: Andrews et al. [1,2] consider the possibility of automatically determining the simulation
once the parallel algorithm and the structure of the host network (e.g., a NOW) are known. This approach is interesting
because it moves the problem of adjusting programs for the speciﬁc parallel architecture of interest from software
developers to compilers or to run-time libraries. In fact, algorithm designers and software developers can, respectively,
design parallel algorithms and develop software for distributed systems assuming unitary delays on links and identical
processors; then, once the characteristics of the NOW on which to run the software are known, the software is automat-
ically compiled for the current architecture. Moreover, whenever the underlying NOW changes, with minimal effort
the code can be recompiled and the same algorithm can be simulated on a different NOW with no need to rewrite code.
Andrews et al. concentrate their attention on parallel algorithms for linear arrays automatically simulated by a NOW
with an embedded array structure. Their setting is the following: two n processor arrays G, the guest, and H, the host,
are given. G has unit delays on its links while H has arbitrary delays d0, . . . , dn−1 > 1 on its links and average delay
dave. They consider the case in which all processors of the host array have the same computational power as the guest
array processors and as each other, i.e., the processors are homogeneous. Furthermore, they assume that links of the
host can pipeline messages, i.e., a link with delay d can be seen as a chain of d unitary links connected by gates that
can receive and send a message at each instant of time. With such a link model, at each instant of time a new message
can be sent on a link and, after the ﬁrst d instants of time, the message can be picked up at the other end of the link.
Thus P messages can be injected in P consecutive steps into a link, and the last one is received after P + d steps.
They distinguish between two different models: the dataﬂow model and the database model. In the dataﬂow model the
computation performed by a processor p at step t depends only on the results of the computation performed by p and
its neighbors at the preceding step. In the database model each processor has its own database and at each computation
step a processor reads its memory and the messages received from its neighbors and possibly updates its database. The
size of the databases makes it infeasible for two processors to exchange databases once the simulation has started and
only updates to the databases may be exchanged. Andrews et al. [1] showed that in the dataﬂow model a linear array
with arbitrary delay can simulate a linear array with unitary delays with a slowdown of O(
√
dave). In [2] they show that
in the database model a slowdown of O(
√
dave · log3 n) can be achieved.
We believe that the assumptions of homogeneous processors and pipelined links used in [1,2] are too restrictive.
In general, in a NOW there are no constraints on the relative speed of computation of the individual workstations.
Moreover, some amount of pipelining on links may be appropriate in some situations (e.g., where delay is dominated
by processing or queueing delays on multiple physical connections between workstations) but not in most situations,
and in particular in the standard NOW setting of workstations belonging to local network. In this case, if P messages
are sent over a link, we expect it will take time O(P · d) to deliver all of them.
Summary of results: In this paper, we extend the work of [1,2] by presenting simulations for parallel algorithms
designed for arrays and cliques for the cases in which processors have different speeds and/or the links do not allow
pipelining.
In the ﬁrst part, we give a simulation of a computation of a linear array of homogeneous processors connected by
unit-delay links on a linear array of heterogeneous processors connected by links of arbitrary delays in the dataﬂow
1502 C. Kaklamanis et al. /Discrete Applied Mathematics 154 (2006) 1500–1509
model. We also show that the slowdown achieved by our simulation is optimal. We consider both the cases of links that
allow pipelining and links that do not allow pipelining. These results are easily extended to the case where the host
network is an arbitrary bounded degree network using the same embedding technique used in [1,2].
In the second part, we consider simulations of cliques by cliques in the database model. We do not assume that links
can pipeline messages and we analyze both the cases in which processors are homogeneous and heterogeneous. In the
ﬁrst case, we achieve a slowdown that is proportional to the average link delay. In the second case, the slowdown is
adjusted by a factor related to the computational speed of the host processors.
The results of the second part combined with well-known techniques for simulating shared memory on distributed
memory architectures (see [12] for references) yields an efﬁcient automatic method of compiling PRAM algorithms on
a NOW. Allowing compilation of PRAMs to speciﬁc architectures has the advantage of completely freeing algorithm
designers and software developers from considerations relative to the topological structure of the underlying network.
This has the potential to shortcut parallel software development cycles considerably.
2. Simulating linear arrays
In this section, we present our results about the simulation of linear arrays by NOWs with underlying linear array
topology.
2.1. Links with pipelining
In this section, we give a simulation and matching lower bound in the dataﬂow model for the case of a heterogeneous
linear array of processors with links that allow pipelining of messages.
We are given an array G of n processors qi , i = 0, . . . , n − 1. Processor qi , 0 < i <n − 1, can communicate with
its two neighbors qi−1 and qi+1, while processor q0 can communicate only with q1 and qn−1 only with qn−2. Links
between processors have unit delay, i.e., a message sent on a link needs one unit of time to arrive to its destination.
A computation of G, of length T, naturally deﬁnes a DAG with vertices (x, y), for x = 0, . . . , n − 1 and y = 0, . . . , T ,
representing the computation performed by processor qx at time step y. The computation (x, y) depends on the outcome
of the computation of qx and its neighbors at the previous time step, thus (x, y)’s incoming edges are: ((x + 1, y −
1), (x, y)), ((x, y − 1), (x, y)), ((x − 1, y − 1), (x, y)).
Our aim is to simulate a computation on G using a host array H of processors pi , i = 0, . . . , m − 1 with mn,
that communicate through links with delay d > 1. More precisely, di is the delay on the link connecting pi to pi+1.
Processors of H have different computational power. We associate to each processor pi its speed si , meaning that in
one step processor pi can simulate si steps of a processor in G.
A ﬁrst attempt to simulate G with H could be to slow down every processor to the computation speed of the slowest
in H and to think that every communication requires the maximum delay on H. On the contrary, in the following, to
hide the latency introduced by non-unitary link delays, we take advantage of the fact that some processors are more
powerful than others.
The techniques presented in this section resemble those of [1] used for the case of an array of homogeneous processors.
2.1.1. Algorithm Stripes
Consider the ﬁrst n steps of G’s computation. Deﬁne L as the triangle formed by the vertices (j, t) of the DAG such
that jn− t and R as the one formed by vertices (j, t) such that j t . The algorithm ﬁrst simulates the ﬁrst n/2 steps
of L, then the ﬁrst n/2 steps of R; in this way every vertex of the ﬁrst n/2 steps are simulated. In the same way it will
simulate the next steps of computation, n/2 by n/2.
Only a portion of array H is used to perform the simulation; for the sake of presentation and w.l.o.g. we suppose that
we use processors in the interval I = {p0, . . . , pmI−1} ⊆ H , where mI m.
We will use the following notation:
Si =
i∑
j=0
sj , SI =
∑
pj∈I
sj and analogously Di =
i−1∑
j=0
dj , DI =
∑
{pj ,pj+1}∈I
dj .
C. Kaklamanis et al. /Discrete Applied Mathematics 154 (2006) 1500–1509 1503
(kSi-1,0) (kSi,0)
P1 P2 P3
n/2-1
(0,0) (n-1,0)
t
L
P
Stk
(0,0) (n-1,0)
d+1
r
right(r,k)
left(k+1,r-1)
last(k,r-1)
right(k,r-1)Li
(b)(a)
R P = (kSi-1-t,t)
P2 = (kSi-1-t,t-1)
P3 = (kSi-1-t+1,t-1)
P1 = (kSi-1-t-1,t-1)
Fig. 1. (a) Simulation in the case links allow pipeline. (b) Simulation in the case links do not allow pipeline.
Let k = n/SI . We divide the bottom half of L into slanting stripes and each processor pi computes the stripe Li
of width ksi deﬁned in the following way (see Fig. 1a):
Li = {(j, t)|t = 0, . . . , Ti and j = Ji, . . . , kSi − 1 − t},
where Ti = min{n/2 − 1, kSi−1} and Ji = max{0, kSi−1 − t}.
Every stripe is computed row by row in a bottom-up manner and every row is computed from left to right.
Lemma 1. Processor pi (0 imI − 1) computes vertex (kSi − t, t) at time step less or equal to k(t + 1) + Di .
Proof. First observe that the computation of processor pi depends on vertices calculated by pi−1 and possibly, when
k = 1 and si−1 = 1, by pi−2. Then, observe that, if pi did not have to wait for information from its neighbors, it could
compute all the vertices on any row t of its stripe in k units of time.
The proof is by induction on i. The base case for p0 follows easily by observing that its computation never needs
information from other processors and that p0 needs at most k units of time to compute every row. Thus, vertex (kS0, 0),
the last of row 0, is done at time k; (kS0 − 1, 1), the last of row 1, is done at time 2k and so on.
Vertex P = (kSi−1 − t, t) is the ﬁrst one in row t to be calculated by processor pi and it depends on vertices
P1 = (kSi−1 − t − 1, t − 1), P2 = (kSi−1 − t, t − 1) and P3 = (kSi−1 − t + 1, t − 1) (see Fig. 1(a)); vertex P3 has
been computed by pi itself at previous steps, while for the others two cases might arise:
ksi−1 > 1: pi−1 computes P1 and P2, that, by induction, are ready at time step kt + Di−1 and will arrive to pi at
time step kt + Di−1 + di−1 = kt + Di .
ksi−1=1: pi−1 computes P2 and pi−2 computes P1 that, by induction, are ready, respectively, at time steps kt+Di−1
and kt + Di−2. They will arrive to pi at time step kt + Di−1 + di−1 = kt + Di−2 + di−2 + di−1 = kt + Di .
Thus, P1, P2 and P3 are available for pi at time kt + Di and the computation of row t can be computed in k steps
by time kt + Di + k = k(t + 1) + Di . 
Corollary 1. The bottom half of triangle L can be calculated in kn/2 + DI time steps.
Proof. The last processor to ﬁnish its computation is pmI−1 with vertex (n/2 − 1, n/2 − 1). 
Theorem 1. Algorithm Stripes has slowdown
O
(
min
I
{
1 + n
SI
+ mIdI
n
})
(1)
1504 C. Kaklamanis et al. /Discrete Applied Mathematics 154 (2006) 1500–1509
of host time steps per guest time steps, where mI is the number of processors in I, SI is the total computation power of
interval I and dI = DI/mI is the average delay on the links between processors in interval I.
Proof. By Corollary 1 we know how much time is needed to compute L’s bottom half; the same time is needed to
compute R’s bottom half and at most DI time steps to exchange the necessary information to start the algorithm again
on the next n/2 steps. This because, in general, processors will start the computation on the next n/2 steps from vertices
they have not computed in the previous ones; e.g., vertex (n/2 − 1, n/2 − 1) is computed by processor pmI−1 at step
n/2 − 1, but vertex (n/2 − 1, n/2) will be computed by processor pj , with jmI − 1, at step n/2.
Thus, slowdown , computed on n/2 steps of computation, is upper bounded by the time needed to compute the n/2
steps divided by the number of steps.
 2(kn/2 + DI ) + DI
n/2
= 2k + 6DI
n
2 + 2 n
SI
+ 6mIDI
mIn
as k1 + n/SI
2 + 2 n
SI
+ 6mIdI
n
. 
2.1.2. Discussion
The previous theorem gives us an upper bound on the slowdown as the minimum over all possible intervals I of
a function of the speed and the number of processors and of the delay of the links connecting them. We now derive
bounds for special cases in two different settings: ﬁrst, processors in the host array all have the same speed and are
more powerful that processors in the guest array; second, processors in the host array do not necessarily have the same
speed. We denote with dave = Dm/(m − 1) the average delay on links in H.
Homogeneous processors: Let s be the speed of processors in H, then the total computation power of H is given by
Sm = s · m. We distinguish the following cases:
(1) n√s · dave: only one processor is used to perform the simulation. This processor needs at least one unit of time
and at most n/s = O(1 + n/s) units of time to simulate one step of computation of G, thus we have
 ∈ O
(
1 +
√
dave
s
)
.
(2) n>√s · dave: we further distinguish the following cases:
(a) If nSmmn/
√
s · dave then there must exist an interval I ⊆ H of consecutive processors such that mI =
n/
√
s · dave and dI dave. Suppose by contradiction that such an interval does not exist; divide the array H into
h = (m − 1)/(mI − 1) consecutive intervals of mI processors each, such that intervals share endpoints. As we are
interested in asymptotic analysis we assume w.l.o.g. that mI − 1 divides m− 1. Let i > dave be the average delay
on links of the ith interval, thus
dave =
∑h
i=1(mI − 1)i
m − 1 =
(mI − 1)
m − 1
h∑
i=1
i >
mI − 1
m − 1
m − 1
mI − 1 dave,
which is a contradiction. Thus, given I, we have
SI = mI s = n
√
s
dave
and by (1)
 ∈ O
(
1 +
√
dave
s
)
.
C. Kaklamanis et al. /Discrete Applied Mathematics 154 (2006) 1500–1509 1505
(b) If nSm and n/
√
s · dave >m1, the whole array H or a single processor is used to carry out the simulation,
according to which solution performs better. From (1) we have that
 ∈ O
(
min
{
n
Sm
+
√
dave
s
,
n
s
})
.
Thus, slowdown  can be as good as before when Sm ∈ O(n), but can be O(n) in very bad (but pathological)
situations, i.e., when the guest array has few (m ∈ O(1)) and not very powerful processors (Sm ∈ O(1)).
Heterogeneous processors: Let si be the speed of processor pi and Save = Sm/m be the average speed of processors
in H.
If there exists an interval I ⊆ H such that
SImIdI = n2 and dI
SIave
 dave
Save
,
where SIave = SI /mI is the average power of processors in I, we use interval I to perform the simulation. Referring to
(1) we have that n/SI = mIdI /n and
mIdI
n
= mIdI√
SImIdI
=
√
mIdI
SI
=
√
dI
SIave

√
dave
Save
,
hence
 ∈ O
(
1 +
√
dave
Save
)
.
If such an interval does not exist, one of the following cases must arise:
(1) n>√dave: If Smmn/√dave, there must exist an interval I ⊆ H such that mI =n/√dave (thus Smn/√dave)
and dI dave (the argument of the existence of interval I is analogous to the one in 2(a)). Using I to perform the simulation
by (1) we have
 ∈ O(√dave).
The same slowdown is achieved when Smn/
√
dave >m using the whole array H for the simulation.
(2) n√dave: The processor with maximum speed smax is used to perform the simulation. It needs at least one unit
of time and at most n/smax units of time to simulate one step of computation of G. As smaxSm/m, from (1) we have
 ∈ O
(
1 +
√
dave
Save
)
.
2.1.3. A lower bound
In this section, we show that the upper bound O(minI {1 + n/SI + mIdI /n}) is asymptotically tight.
Lemma 2. Vertex (i, n) cannot be computed earlier than time step
min
I
max{n2/2SI ,mIdI /2}.
Proof. Consider any simulation that uses an interval I of processors in H. There must exist a subinterval I ′ =
{pj , . . . , pj+|I ′|−1} ⊆ I of consecutive processors such that pj and p|I ′| are the farthest processors that must exchange
information during the simulation before vertex (i, n) is computed. Thus, (i, n) cannot be calculated in less thanmI ′dI ′/2
time steps. Moreover, to calculate vertex (i, n) we ﬁrst need to calculate every vertex in triangle ((1, 1), (n, 1), (i, n)),
that needs at least n2/2SI ′ time steps to be computed. 
1506 C. Kaklamanis et al. /Discrete Applied Mathematics 154 (2006) 1500–1509
Corollary 2. The ﬁrst kn steps of computation cannot be simulated in less than
k min
I
max{n2/2SI ,mIdI /2}
time steps.
By the previous corollary we can, thus, state the following theorem:
Theorem 2. The slowdown of the best simulation of G by H is

(
min
I
{1 + n/SI + mIdI /n}
)
time steps.
2.2. Links without pipelining
We now analyze the case in which links between processors do not have the possibility to pipeline messages; i.e.,
a new message can be sent on a link only when the preceding one has arrived at its destination. We simulate m steps
of computation of an n-processor unit-delay linear array with a n/d-processor linear array with links of delay d in
time O(md); i.e., the simulation is work efﬁcient. We describe the simulation in detail for the case processors are
homogeneous. The case of heterogeneous processors is a straightforward generalization.
We are given a guest array G of n processors qi, i = 0, . . . , n − 1, with unit delays on links, that computes a DAG
D′ = {(x, y) | xn − 1 and 0ym,mn} and a host array H of hn processors pi with delay d > 1 on links not
supporting pipelining. W.l.o.g. assume n is a multiple of d + 1.
DAG D′ can be computed by H in a work-efﬁcient way using n/(d +1) processors with a slowdown of O(d): divide
D′ into n/(d + 1) vertical stripes Stk each d + 1 vertices wide and assign each stripe to one processor of the host array.
More precisely, the computation goes as follows:
• Set Stk = {(x, y) | 0ym, k(d + 1)x < (k + 1)(d + 1)}, k = 0, . . . , n/(d + 1); i.e., Stk is a vertical stripe of
dag D′ of width d + 1 vertices.
• Set left(k, r)= (k(d +1), r), right(k, r)= (k(d +1)+d, r) and last(k, r)= (k(d +1)+d −1, r) for every 0rm
and every 0kn/d + 1; i.e., lef t(k, r) is the leftmost vertex of stripe Stk at row r and, analogously, right(k, r)
is the rightmost of the same stripe at the same row. last(k, r) is the vertex on the left of right(k, r) and will be the
last one to be computed in row r.
• Processor pk computes stripe Stk row by row in a bottom-up manner. Vertices in row r are computed in the following
order:
if k mod 2 = 0 then
left(k, r), right(k, r), (k(d + 1) + 1, r), (k(d + 1) + 2, r), . . . , last(k, r)
if k mod 2 = 1 then
right(k, r), left(k, r), (k(d + 1) + 1, r), (k(d + 1) + 2, r), . . . , last(k, r),
i.e., the ﬁrst vertices to be computed in every row are the leftmost and the rightmost and the order in which this is
done depends on the position of the stripe. All the other vertices in the row are computed from left to right.
• All processors start computation at t = 0.
We now prove that every processors has always at its disposal all the vertices needed to carry out its computation, i.e.,
it can compute a vertex at every instant of time, and that the slowdown of the computation of H is O(d). We start by
deﬁning t (x, y) as the time by which vertex (x, y) is computed, and by prev(x, y) as the number of vertices that are
computed before vertex (x, y) by the same processor.
C. Kaklamanis et al. /Discrete Applied Mathematics 154 (2006) 1500–1509 1507
Lemma 3. For every 0rm, t(x, r) = prev(x, r) + 1.
Proof. The claim clearly holds for r = 0. Suppose it holds for ﬁxed r0. If we prove that, for every k, the claim
holds for right(k, r + 1) and left(k, r + 1) then it holds also for the remaining vertices in row r + 1. In fact, except for
right(k, r + 1) and left(k, r + 1), the processor itself computes all the vertices needed for vertices in row r + 1. We
prove the claim only for right(k, r + 1) and k odd, the proof for left(k, r + 1) and the cases k even are analogous.
To compute right(k, r + 1), processor pk needs to have the information relative to vertices last(k, r), right(k, r),
left(k + 1, r) (see Fig. 1(b)). As right(k, r) is computed earlier than last(k, r) then
t (right(k, r + 1)) = max{t (last(k, r)), t (left(k + 1, r)) + d} + 1,
that is,pk can compute right(k, r+1)when it has computed all the vertices of the previous row and when t (left(k+1, r)),
computed by pk+1, has arrived.
Using the inductive hypothesis we have
t (last(k, r)) = prev(last(k, r)) + 1
= prev(right(k, r + 1)),
t (left(k + 1, r)) + d = prev(left(k + 1, r)) + 1 + d
= (r − 1)(d + 1) + d + 1
= r(d + 1)
= prev(right(k, r + 1)). 
Corollary 3. The computation of stripe Stk is ﬁnished by time t = m(d + 1).
Proof. For every k, last(k,m) is the last vertex to be computed in stripe Stk and, because of Lemma 3, t (last(k,m))=
|Sk| = m(d + 1). 
Theorem 3. The simulation above is work-efﬁcient.
Proof. We use n/(d + 1) processors to simulate m steps of computation of n processors in m(d + 1) time. 
When processors are heterogeneous the simulation works in the same way with the only difference that stripe Stk ,
computed by processor pk with speed sk , has width sk(d + 1).
3. Simulating cliques
In this section we present an automatic method for simulating, in the database model, homogeneous processor cliques
with unit delay links on both homogeneous and heterogeneous cliques of processors with arbitrary delays on links that
disallow pipelining. In the database model, each processor p has a potentially large local database that may be accessed
only by p at each step of computation. Before the simulation starts it is possible to assign the databases of the guest
machine to the processors of the host machine. However, the size of the database makes it infeasible for two processors
to exchange databases once the simulation has started and only updates of the database can be passed.
These results have straightforward implications for simulating PRAM algorithms on an arbitrary NOW. A shared-
memory PRAM is an abstract model of parallelism which consists of n processors and a global shared memory of size
M. Each processor has its own local control and its own local memory. During each step of a shared-memory PRAM
computation, each processor is allowed to access any location of the global shared memory and to perform some
computation according to its local control and its local memory. Here we consider a variation of the shared-memory
PRAM model called the distributed-memory PRAM (also called a distributed memory machine or DMM). Here, the
memory of size M is distributed evenly among the n processors with each processor receiving a block of memory of
M/n locations. In the distributed-memory PRAM, each processor has direct access to its own memory and to every
other processor but it has only indirect access to other processors’ memory. Moreover, at each time step, each memory
block can be accessed by at most one processor. Using well-known techniques related to random hashing, a shared-
memory PRAM can be simulated by a distributed-memory PRAM with a slowdown of O(log n) with high probability
1508 C. Kaklamanis et al. /Discrete Applied Mathematics 154 (2006) 1500–1509
(see [12] for references). By equating the guest clique to a distributed memory PRAM and the host clique to a NOW
with the weight di,j of edge (i, j) representing the delay of the minimum-delay path in the NOW from vertex processor
pi to processor pj , we get a method of automatically compiling PRAM algorithms for NOWs.
3.1. Homogeneous processors
In this section, we are given an n-vertex host clique C and use it to simulate an n-vertex unit delay clique and show
that the slowdown is order of the average of the weights on links.
Given a weighted clique C=(V ,E), with n vertices and weight de on edge e ∈ E, we deﬁne the subgraph H=(V ,E′)
such that
E′ = {e ∈ E such that de2dave},
where dave is the average weight edges in E. A node is said to be alive if it has degree at least n/2 in H; otherwise it is
dead.
The simulation works in the following way: equally distribute the databases of the guest processors among the alive
nodes in H and use alive and dead nodes for message passing. Each alive node will perform all the computation relative
to the processors assigned to it.
Lemma 4. Any two alive nodes are either adjacent or share a common (dead or alive) neighbor.
Proof. Suppose by contradiction that there exist two nodes u1 and u2 that are not adjacent and such that the intersection
of their neighbor sets V1 and V2 is empty. As both u1 and u2 are alive, |V1|, |V2|n/2. As V1 ∩V2 =∅ then |V1 ∪V2|n,
but this is a contradiction since u1 /∈V2, u2 /∈V1. 
Lemma 5. The remaining alive nodes are a constant fraction of n.
Proof. First note that the number of eliminated edges is at most
(
n
2
)
/2. Suppose by contradiction that |E−E′|> (n2 ) /2,
then
dave =
∑
e∈E−E′ de +
∑
e∈E′ de(
n
2
) >
((
n
2
)/
2
)
2dave(
n
2
) = dave.
Thus, the number of dead nodes is at most 2
(
n
2
)
/2n = (n − 1)/2 and at least n − (n − 1)/2 >n/2 are alive. 
Theorem 4. An n-vertex clique with links without pipelining and with average delay dave can simulate an n-vertex
clique with a slowdown O(dave).
Proof. Since the number of alive nodes is a constant fraction of the total number of nodes, it is possible to assign guest
processors (along with their local database) to host processors so that each host is responsible for a constant number
of processors. As the distance between every pair of alive nodes is at most 2 and the delay of every used link is at most
2dave, the time spent for communicating at each step is at most O(dave). 
It is easy to see that in a NOW in which all links have the same delay d = O(n), then no simulation can achieve
a slowdown smaller than d and thus our simulation is asymptotically optimal. Conversely, if d = (n), the trivial
simulation that assigns work only to one processor of the NOW achieves a slowdown of O(n).
3.2. Heterogeneous processors
In this section, we brieﬂy discuss the extension of the simulation of the previous section to the case in which the host
network is a clique of m heterogeneous processors. The basic idea is to expand a vertex of the clique corresponding to
C. Kaklamanis et al. /Discrete Applied Mathematics 154 (2006) 1500–1509 1509
a processor with computer power s into a clique of s processors connected among themselves with links of delay 0 and
then use the simulation with O(dave) slowdown using this new graph as host.
As before, the host consists of processors p0, . . . , pm−1 and is represented by a complete graph C = (V ,E) on m
vertices that has weights on the vertices and on the nodes. The weight di,j of edge (i, j) represents the delay on the
edge and the weight si on node i represents the speed of processor pi . We denote by S the sum of the speeds of all
processors. We assume that weights on both edges and nodes are integer and that 1Sn.
We start by deﬁning a graph G′ with unweighted nodes; using this graph we can deﬁne a simulation with the technique
given in the previous section. We then observe that C can simulate G′ without any additional slowdown.
Let G′ = (V ′, E′) be the edge-weighted clique deﬁned as follows:
• V ′ = {vi,j | 0 in − 1, 0jsi − 1} (thus |V ′| = S);
• weight d(vi,j , vl,k) on edge (vi,j , vl,k) is deﬁned in the following way:
d(vi,j , vl,k) =
{
0 if i = l,
di,l otherwise.
G′ can perform the simulation described in the previous section achieving slowdown ′
′ ∈ O
(∑
e∈E′ d(e)
|E′|
)
= O
(∑
(i,j)∈E di,j sisj
S(S − 1)
)
.
Now, C simulates G′ in the following way: every processor pi , i = 0, . . . , n − 1, performs the computation of all
processors in Vi = {vi,j ∈ V ′ | 0jsi − 1}. The slowdown of the simulation using C is still ′.
4. Discussion and open problems
Our simulation of a clique by a clique guarantees a slowdown proportional to the average delay. It would be
interesting to design algorithms that map guest processors to host processors so as to guarantee optimal simulation or
to give evidence of the hardness of the problem and present approximate algorithms.
References
[1] M. Andrews, T. Leighton, P.T. Metaxas, L. Zhang, Automatic method for hiding latency in high bandwidth networks, in: Proceedings of the
ACM Symposium on Theory of Computing, 1996, pp. 257–265.
[2] M. Andrews, T. Leighton, P.T. Metaxas, L. Zhang, Improved methods for hiding latency in high bandwidth networks, in: Proceedings of the
Eighth Annual ACM Symposium on Parallel Algorithms and Architectures, 1996, pp. 52–61.
[3] Y. Aumann, M. Ben-Or, Computing with faulty arrays, in: Proceedings of the 24th Annual ACM Symposium on Theory of Computing, 1992,
pp. 162–169.
[4] R. Cole, B. Maggs, R. Sitaraman. Multi-scale self-simulation: a technique for reconﬁguring arrays with faults, in: Proceedings of the 25th
Annual ACM Symposium on Theory of Computing, 1993, pp. 561–572.
[5] C. Kaklamanis, A.R. Karlin, F.T. Leighton, V. Milenkovic, P. Raghavan, S. Rao, C. Thomborson, A. Tsantilas, Asymptotically tight bounds
for computing with faulty array of processors, in: Proceedings of the 31st Annual Symposium on Foundation of Computer Science, 1990,
pp. 285–296.
[6] R. Koch, T. Leighton, B. Maggs, S. Rao, A. Rosenberg, Work-preserving emulations of ﬁxed-connection networks, in: Proceedings of the 21st
Annual ACM Symposium on Theory of Computing, 1989, pp. 227–240.
[7] T. Leighton, B. Maggs, R. Sitaraman, On the fault tolerance of some popular bounded degree networks, in: Proceedings of the 33rd Annual
Symposium on Foundation of Computer Science, 1992, pp. 542–552.
[8] C.E. Leiserson, S. Rao, S. Toledo, Efﬁcient out-of-core algorithms for linear relaxation using blocking covers, in: Proceedings of the 34th
Annual Symposium on Foundation of Computer Science, 1993, pp. 704–713.
[9] M.O. Rabin, Efﬁcient dispersal of information for security, load balancing and faults tolerance, J. ACM 36 (2) (1989) 335–348.
[10] L.G. Valiant, Bulk-synchronous parallel computers, Technical Report TR-08-89, Center of Research in Computing Technology, Harvard
University, 1989.
[11] L.G. Valiant, A bridging model for parallel computation, Comm. ACM 33 (8) (1990) 103–111.
[12] L.G.Valiant, General purpose parallel architectures, in: J. van Leeuwen (Ed.), Handbook of Theoretical Computer Science, Elsevier,Amsterdam,
1990.
