Automatic Methods for Hiding Latency in Parallel and Distributed Computation by Andrews, Matthew et al.
Wellesley College
Wellesley College Digital Scholarship and Archive
Computer Science Faculty Scholarship Computer Science
1999










Follow this and additional works at: http://repository.wellesley.edu/computersciencefaculty
This Article is brought to you for free and open access by the Computer Science at Wellesley College Digital Scholarship and Archive. It has been
accepted for inclusion in Computer Science Faculty Scholarship by an authorized administrator of Wellesley College Digital Scholarship and Archive.
For more information, please contact ir@wellesley.edu.
Recommended Citation
Automatic Methods for Hiding Latency in Parallel and Distributed Computation, with M. Andrews, F.T. Leighton, Y.L.Zhang. SIAM
Journal of Computing, 29 (2): 615-647 (1999).
AUTOMATIC METHODS FOR HIDING LATENCY IN PARALLEL
AND DISTRIBUTED COMPUTATION
MATTHEW ANDREWSy , TOM LEIGHTONz , P. TAKIS METAXASx , AND LISA ZHANGy
SIAM J. COMPUT. c° 1999 Society for Industrial and Applied Mathematics
Vol. 29, No. 2, pp. 615{647
Abstract. In this paper we describe methods for mitigating the degradation in performance
caused by high latencies in parallel and distributed networks. For example, given any \data°ow"
type of algorithm that runs in T steps on an n-node ring with unit link delays, we show how to run
the algorithm in O(T ) steps on any n-node bounded-degree connected network with average link
delay O(1). This is a signicant improvement over prior approaches to latency hiding, which require
slowdowns proportional to the maximum link delay. In the case when the network has average link
delay dave, our simulation runs in O(
p
daveT ) steps using n=
p
dave processors, thereby preserving
eciency. We also show how to eciently simulate an n  n array with unit link delays using
slowdown ~O(d
2=3
ave) on a two-dimensional array with average link delay dave. Last, we present results
for the case in which large local databases are involved in the computation.
Key words. hiding latency, parallel and distributed computation, linear and two-dimensional
arrays, complementary slackness
AMS subject classication. 68Q22
PII. S0097539797326502
1. Introduction. Most papers describing algorithms for parallel or distributed
computation assume a model of computation in which all the links have unit delay.
Such a model is nice to work with and it is realistic for some parallel machines, but
not for most. In reality, there are often substantial delays associated with some or
all of the links. These delays can be caused by long wires, links that are realized
by paths that go through one or more intermediate switches, wires that are required
to go o-chip or o-board, communication overheads, and/or by the method which
is used to prepare a packet for entry into the network. Link delays are an even
greater concern for distributed machines and networks of workstations (NOWs). This
is because some latencies can be very high (due to the fact that some processors can
be far apart physically) and also because the variation among latencies can be high
(since some processors may be very close or even part of the same tightly coupled
parallel machine).
1.1. Traditional approaches. Since communication latency is an important
factor in the performance of a parallel or distributed algorithm, several methods have
been devised in an attempt to compensate for latency. The simplest of these methods
is to slow down the computation to the point where the latency is accommodated.
This approach is most commonly used at the circuit level, where the clock speed
Received by the editors August 27, 1997; accepted for publication (in revised form) August 11,
1998; published electronically November 23, 1999.
http://www.siam.org/journals/sicomp/29-2/32650.html
yBell Laboratories, Murray Hill, NJ (andrews@research.bell-labs.com, ylz@research.bell-labs.
com). The research of the rst author was supported by NSF contract 9302476-CCR and ARPA con-
tract N00014-95-1-1246. The work of these authors was performed while at MIT.
zDepartment of Mathematics and Laboratory for Computer Science, MIT, Cambridge, MA
(ftl@math.mit.edu). The work of this author was supported by ARMY grant DAAH04-95-1-0607
and ARPA contract N00014-95-1-1246.
xDepartment of Computer Science, Wellesley College, Wellesley, MA (pmetaxas@wellesley.edu).
The work of this author was supported by NSF contract 9504421-CCR and ARPA contract N00014-
95-1-1246.
615
616 M. ANDREWS, T. LEIGHTON, P. T. METAXAS, AND L. ZHANG
is set to be slow enough so that all of the data has time to reach its destination
before the next step begins. This means that the circuit needs to be slowed down to
accommodate the highest latency. Such an approach is clearly less than desirable in
the context of a NOW with high-latency links.
An alternative approach is to organize the network in a hierarchical fashion so
that the latencies are consistent with the hierarchy. For example, the CM-5 [1, 14]
is organized into a fat tree and the KSR consists of two levels of nested rings. In
both cases, the highest latency links are segregated into the top levels of the network
hierarchy. This type of architecture works well for applications in which most of the
computation is local since local computation can proceed using the low-level low-
latency links. Only rarely, it is hoped, would the high-latency links be needed. Thus,
only certain steps of the computation would be slow. Unfortunately, this approach is
not suitable for scenarios where the network is unstructured (which is often the case
for a NOW) or when the underlying application requires frequent communications
through the high-level links.
Redundant computation is another approach that has been used in the past [6,
11, 13] to hide the eects of latency. Here the idea is to avoid latency by recomputing
data locally instead of waiting to receive it through a high-latency link.
Probably the most generally applicable method of hiding latency is the approach
known as complementary slackness. The idea behind this approach is to load each
processor with enough work so that it stays productive while waiting for data to be
supplied by the network. There are many implementations and incarnations of this
method. For example, each processor in the CRAY YMP C-90 keeps busy by operating
on a pipeline of 128 64-bit words. Processors on the HEP machine [21] swapped
between unrelated threads while waiting for the data. The CM-1 and CM-2 were
designed to simulate much larger virtual machines so that a single processor would
perform the computation of many virtual processors [4, 22]. The technique also forms
a critical component of Valiant’s bulk synchronous model of parallel computing [23, 24]
and it has been employed in several papers [3, 10, 11, 15, 20].
Unfortunately, in all of the preceding examples, it is incumbent on the program-
mer to provide the slackness or pipelining needed or to determine what part of the
computation must be redundantly duplicated and by which processors to overcome
the latencies in the network. Even in the scenario where a large virtual network is
being simulated on a small parallel machine, it is incumbent on the programmer to
nd the parallelism necessary to eciently implement the algorithm on a (potentially
very large) virtual network.
The goal of our research is to devise automatic methods for hiding latency. Our
approach falls within the broad class of methods based on complementary slackness,
but it does not require the programmer to provide slackness, pipelines, or greater
parallelism in order to hide the latency. Rather, our methods attempt to nd the
slackness automatically. By automatically nding the slackness, we hope to allow the
programmer to assume that there are uniform delays on each link of the network,
thereby easing the task of writing code. Moreover, our methods will enable us to
automatically convert a program that was written for a well-structured unit-delay
machine into a program that will run with minimal degradation in performance on
a network with potentially large and variable latencies, at least for certain classes of
networks.
1.2. Model and problem. We consider the problem of simulating a network G
with unit-delay links on a network H with arbitrary delays on its links. We refer to G
AUTOMATIC METHODS FOR HIDING LATENCY 617
Fig. 1. The computation pebbles created by a guest linear array.
as the guest and H as the host. Let g1; g2; : : : be the processors of G and p1; p2; : : : be
the processors of H. We shall use pebbles to record the computations performed by the
guest processors. In particular, pebble (i; t) represents the tth step of computation by
processor gi. In a simulation of G, H carries out the same step-by-step computation
as G. In other words, H simulates G by computing every pebble created by G in an
order that preserves the \dependency" of the pebbles. Our goal is to provide methods
that would allow H to simulate G with a minimum amount of slowdown when G is
used in a general-purpose way. Formally, slowdown is the ratio of TH to TG, where TG
is the time taken by G to compute all the pebbles and TH is the time taken by H to
simulate this computation. Two computation models are studied here: the data°ow
model and the database model.
Data°ow model. In the data°ow model, each computation solely depends on
the computation of the previous step. Creating a pebble (i; t) involves two time units.
The rst time unit is for communication, where gi obtains pebbles of the form (j; t¡1)
from all its neighbors gj . The second time unit is for computation, where gi performs
computation based on pebbles (j; t¡1) and records the result in pebble (i; t). Take an
example of an n-node guest linear array. In 2T time steps, G creates n T pebbles,
where pebble (i; t), for 1 < i < n and 1 < t  T , depends on pebbles (i ¡ 1; t ¡ 1),
(i; t ¡ 1); and (i + 1; t ¡ 1). (See Figure 1.) Any host processor p can compute
pebble (i; t) as long as p has the information in pebbles (i ¡ 1; t ¡ 1), (i; t ¡ 1); and
(i + 1; t ¡ 1), either by directly computing these pebbles or by receiving them from
neighboring processors.
The data°ow model is applicable to many computations such as matrix opera-
tions, Fourier transform, sorting, algorithms for computational geometry, etc. A large
number of examples can be found in [12].
Database model. In the database model each guest processor gi has a poten-
tially large local memory that may be accessed and updated by gi during each step.
We refer to the local memory of gi as the database, bi. Each computation not only
depends on the computation of the immediate past but also the state of the database.
For example, let G be a linear array. To create pebble (i; t), gi rst communicates
618 M. ANDREWS, T. LEIGHTON, P. T. METAXAS, AND L. ZHANG
with its neighbors, then performs computations based on pebbles (i¡1; t¡1), (i; t¡1);
and (i + 1; t ¡ 1) and the current state of database bi. Last, gi updates database bi.
Hence, creating a pebble involves two time units as in the data°ow model: one for
communication and one for computation and recording.
In the database model, a pebble not only records the result of a computation
but also the changes to the database incurred by this computation. To emphasize,
a pebble does not contain a snapshot of the whole database but rather the changes
incurred by one computation. Therefore, a pebble has small size and can be passed
along links.
In order to simulate G on H, we assume that the initial contents of each database
can be copied before the computation begins (thereby allowing redundant computa-
tions), but that the large size of a database makes it impractical to transmit a copy
of a database through the network during the computation. Suppose processor p of
H copies databases bi and bj ; then p only has access to bi and bj and hence can only
compute pebbles of the form (i; t) and (j; t) for t  1. Moreover, if both processors p
and q decide to copy bi, then p and q each maintains a copy of bi, and each looks up
and updates its own copy. If p is to compute pebble (i; t); then p needs an updated
copy of the database that includes all the changes incurred by the computations (i; t0)
for all t0 < t. Hence, p must either have directly computed all the pebbles (i; t0) or
else have received the information from its neighbors.
Unlike the data°ow model, the database model captures a scenario where the
computation performed by a processor depends on the state of a local memory or
where part of the computation performed by a processor is to update its local mem-
ory. These situations could be critical in some applications involving a network of
workstations.
Bandwidth. The guest network G has unit bandwidth on each link. This allows
each pebble to be passed along a unit-delay link of G in one time step. In our
simulation we assume that the link bandwidth of the host network H is w. That is,
P pebbles can be passed along a d-delay link of H in d+ dPw e¡ 1 steps by pipelining.
In many cases of our study, it is sucient to assume that the host and the guest have
comparable link bandwidth; i.e., w is a constant. However, in certain situations the
bandwidth needs to be ~O(logn). Otherwise, we pay an extra factor of ~O(logn) in the
slowdown. The details are discussed in sections 3.2.4 and 4.2.4.
1.3. Results. Table 1 summarizes our results. In the table, n is the size of the
guest, dave is the average delay of the host and \Bd-deg" stands for bounded-degree.
The ratio of n and the slowdown is the size of the host, since all the simulations
are work ecient; i.e., it takes the guest and the host the same amount of work to
compute the same result, where work is the product of the number of processors used
and the running time.
The rst two results in Table 1 are proved in terms of linear arrays. An n-node
unit-delay ring is essentially the same as an n-node unit-delay linear array, since the
latter can simulate the former with a slowdown of 2 [12]. Result 1 is asymptotically
optimal in some cases. In addition, we also have a constant-approximation algorithm
for simulating rings and linear arrays in the data°ow model. Results 2 and 3 are
optimal up to a polylogarithmic factor in some cases. Result 3 is for a worst-case
model. When the delays on the host are randomly arranged, the bound can be
improved to O(d
2=3
ave ). Results 4 and 5 are easy generalizations of results 1 and 2,
respectively. Sections 2 and 3 present latency hiding methods for the data°ow model.
AUTOMATIC METHODS FOR HIDING LATENCY 619
Section 4 concentrates on the database model.
The methods for latency hiding in the two computation models are substantially
dierent. For example, we make heavy use of redundant computation in the database
model, whereas redundancy is apparently not useful for the data°ow model.
Our bounds indicate that hiding latency in the database model is more dicult
than in the data°ow model. Intuitively, this is because computation in the data°ow
model is processor independent and hence can be done by any processor with the
information of the previous computation. In the database model, computation can
only be done by the processors with the right databases. One cannot aord to pass
large databases across the links with limited bandwidth, because this will cause high
slowdown. One also cannot aord to keep many copies of the databases, because
memory is expensive and keeping every copy of the databases updated is dicult.
In section 4, we also establish limits on the degree to which the high latency can
be mitigated when each database is allowed a small number of copies. For example, if
each database has only one copy, we show that the slowdown can be as much as dmax
even if dave is a constant and the best simulation is used. When each database has
at most two copies and each host processor copies a constant number of databases,
we give an example of a host whose average delay is a constant, but for which the
slowdown has a lower bound of ›(logn). These results demonstrate that it is easier
to overcome latencies in data°ow types of computations than in computations that
require access to large local databases.
1.4. A related scheduling problem. The problem of latency hiding in the
data°ow model can be viewed as the following scheduling problem. The pebbles
created by the guest network together with their dependencies form a directed acyclic
graph (dag), whose nodes represent computational tasks of equal execution time, and
whose arcs represent precedence. All these tasks are to be computed by the processors
in a given host network. If the same host processor computes two tasks of direct
dependence, no communication cost is incurred. Otherwise, there is a communication
cost between the two host processors that compute these two tasks, and this cost is
equal to the total delay between the processors in the host network. The goal here
is to schedule the dag (with possible repetitions of the nodes) using the given host
processors so as to minimize the makespan, i.e., the total time taken to execute all
the tasks.
A variation of the above scheduling problem has been studied. Here, we are given
any task dag (not necessarily created by a guest network in the data°ow model). All
the arcs in the dag are associated with a xed quantity that indicates the communica-
Table 1
Result summary.
Guest Host Model Order of slowdown
1 Ring/linear array Bd-deg network Data°ow
p
dave








4 2-D array Bd-deg network Data°ow n1=4(
p
dave + n1=4)
5 2-D array Bd-deg network Database n1=4 log3 n(
p
dave + n1=4)
620 M. ANDREWS, T. LEIGHTON, P. T. METAXAS, AND L. ZHANG
tion cost. Note that, unlike our problem, the communication cost here is the same for
any processor pair. In [18] Papadimitriou and Ullman studied an nn grid dag (which
they called a diamond dag). They showed a nontrivial time-communication tradeo
and gave an asymptotically optimal schedule. Their result was similar to the special
case of our Result 1 stated in section 1.3, where all the link delays in our host network
are the same. In [19] Papadimitriou and Yannakakis presented a 2-approximation al-
gorithm for general dags where an unlimited number of processors could be used.
For well-known families of dags such as the full binary tree, the diamond dag, and
the fast Fourier transform, only a nite number of processors were needed and their
approximation algorithms were optimal (or near optimal). Redundant computation
was used in [19].
Dag scheduling has been studied in other papers, including [2, 5, 7, 8, 9, 16, 17].
Some variations of the problem are the cases in which the dags are limited to certain
topologies, the task nodes require dierent execution times, arcs require dierent
communication time, and processors have dierent processing powers.
2. Data°ow model|Linear arrays. We begin our presentation with the
methods for hiding latency in linear arrays. Our basic approach is to transfer a
process that involves a two-way communication to a process that involves one-way
communication only. (This idea is also essential for simulating two-dimensional arrays
in section 3.) We present an asymptotically tight bound on the slowdown for linear
arrays. All the results for linear arrays are applicable to rings.
2.1. Average delay|An upper bound. Let the network G be an n-processor
guest linear array with unit delay on all the edges. Let the network H be an n-
processor host linear array with arbitrary delays, where di is the delay on the ith edge
of H. As discussed in section 1.2, in 2T time steps G creates n  T pebbles, where
pebble (i; t), for 1 < i < n and 1 < t  T , depends on pebbles (i¡ 1; t¡ 1), (i; t¡ 1);
and (i+ 1; t¡ 1). We rst present algorithm Stripe in which H simulates G with a
slowdown of O(dave), where dave =
Pn¡1
k=1 dk=(n¡ 1) is the average delay of H.
Consider the rst n=2 rows of pebbles created by G. Let L be the triangle formed
by pebbles (i; t), where i+ t  n+ 1. Let R be the triangle formed by pebbles (i; t),
where i  t. (See Figure 2.) In Stripe, H rst simulates the bottom half of L
and then the bottom half of R. At this point every pebble in the rst n=2 rows is
simulated. If the entire computation of G is partitioned into groups each of which
consists of n=2 rows of pebbles, then H can repeat the process and simulate every
group in a similar manner.
To simulate the bottom half of L, the computation pebbles of G are divided into
n slanted stripes, and each processor of H simulates one stripe. (See Figure 2.) In
particular, processor pi of H simulates a stripe consisting of pebbles (i¡ t+ 1; t) for
1  t  i and t  n=2. Note that in the original computation by G, processor gi
depends on both gi¡1 and gi+1. However, in the simulation by H pi depends on pi¡1
and pi¡2. Hence, Stripe transforms a process that involves two-way communication
into a process that involves only one-way communication.




Proof. We use induction on i. The base case for p1 is obvious. Pebble (i¡ t+1; t)
depends on pebbles (i¡t; t¡1), (i¡t+1; t¡1); and (i¡t+2; t¡1), which are computed
by processors pi¡2, pi¡1; and pi; respectively. By induction these three pebbles are
computed at step (t ¡ 1) + Pi¡3k=1 dk, (t ¡ 1) + Pi¡2k=1 dk; and (t ¡ 1) + Pi¡1k=1 dk;
AUTOMATIC METHODS FOR HIDING LATENCY 621
Fig. 2. (Left) Triangles L and R. (Right) Algorithm Stripe. Each slanted stripe is simulated
by one processor of H. Arrows correspond to communications. Dashed lines correspond to the delays
di encountered by communications.
respectively. It follows that (i¡ t+ 1; t) can be computed at step t+Pi¡1k=1 dk.
Hence, pebbles (i + 1; n=2), for 0  i  n=2, are computed at steps n=2 +Pi+n=2¡1
k=1 dk, and so the bottom half of L is simulated in n=2 +
Pn¡1
k=1 dk steps by H.
The bottom half of R is simulated in a similar manner. (Note that the intersection
of R and L only needs to be computed once.) Thus, H has completed simulating
the rst n=2 rows of pebbles created by G. To continue the simulation, each pebble
(i; n=2) is passed to processor pi. With pipelining, this can be done in
Pn¡1
k=1 dk steps.
The next n=2 and every subsequent n=2 rows of pebbles can be simulated in a similar
manner. Therefore, the slowdown is upper bounded by
s =
2  (n=2 +Pn¡1k=1 dk) +Pn¡1k=1 dk
n=2
= O(dave):
2.2. A better upper bound. To get a better upper bound on the best achiev-
able slowdown, we use the idea of \complementary slackness" in our new algorithm
called FatStripe. Each host processor is loaded with enough work to balance out the
communication time. Suppose FatStripe uses an interval of m processors to carry
out the simulation. For simplicity, assume that this interval consists of processors
p1; : : : ; pm. The bottom half of L is divided into m slanted stripes, each of which has
width ‘ = n=m. Again, pi computes every pebble in stripe i. (See Figure 3.) Within
each stripe i, pi rst computes all the pebbles in the bottom row and then moves up.
Lemma 2.2. Processor pi nishes simulating stripe i by step ‘n=2 +
Pi¡1
k=1 dk.
Proof. We inductively show that pi can compute the pebbles in the xth row of
stripe i by time step ‘x+
Pi¡1
k=1 dk. The base of the induction holds trivially for i = 1
and x = 1, since processor p1 does not depend on other processors and pebbles in the
rst row do not depend on other pebbles. Let us consider the pebbles on the (x+1)st
row of stripes i+ 1 for x  1 and i  1. These pebbles could only depend on pebbles
on the xth row of stripe i¡1, i; and i+1, which can be computed by processors pi¡2,




k=1 dk; and ‘x+
Pi¡1
k=1 dk; respectively, by
induction. Hence, pi is able to receive all the information necessary to compute its
(x+ 1)st row by step ‘x+
Pi¡1
k=1 dk and therefore nish computing the (x+ 1)st row
by step ‘(x+ 1) +
Pi¡1
k=1 dk. Since each stripe contains at most n=2 rows, pi nishes
simulating stripe i by step ‘n=2 +
Pi¡1
k=1 dk.
Hence, the slowdown is O(n=m +
Pm¡1
k=1 dk=n) in simulating the rst n=2 rows
of pebbles. All the subsequent n=2 rows can be simulated in a similar manner. To
622 M. ANDREWS, T. LEIGHTON, P. T. METAXAS, AND L. ZHANG
Fig. 3. Algorithm FatStripe. Processor pi simulates stripe i, which has width ‘ = n=m. (In




minimize the slowdown, FatStripe uses the interval I (with mI processors and dI
average delay) that minimizes the quantity n=mI + dImI=n. Therefore, Theorem 2.3
follows.
Theorem 2.3. FatStripe achieves slowdown of minintervals I O(n=mI+dImI=n):
In the case when
p
dave  n, there exists an interval I with MI = n=
p
dave
processors and average delay dI  dave by the pigeon-hole principle. Theorem 2.3
implies that the slowdown is O(
p
dave) when MI simulates G. In the case whenp
dave > n a single host processor is used to carry out the simulation, which incurs a
slowdown of n = O(
p
dave). The simulation is work ecient in both cases. Therefore,
Corollary 2.4 holds.
Corollary 2.4. FatStripe eciently simulates G on H and achieves a slow-
down of O(
p
dave), where dave is the average delay of H.
Let us consider the eect of bandwidth on the slowdown. In FatStripe as long
as the stripe width is at least 2, then pebbles cross the edges one at a time by using
pipelining. In Stripe (i.e., FatStripe with stripe width 1) at most two pebbles may
cross an edge at the same time. Therefore, it is sucient for the host bandwidth to
be twice as large as that of the guest bandwidth. Otherwise, we pay another factor
of 2 in the slowdown.
2.3. A matching lower bound. We proceed to show that the upper bound,
minI O(n=mI + dImI=n), in Theorem 2.3 is asymptotically tight by showing that
minI maxfn=2mI ; dImI=2ng is a lower bound on the best achievable slowdown even
if we allow redundant computation. Note that with redundant computation, a pebble
may be computed by several host processors. This technique makes it more likely for
the host to simulate the guest eciently. However, we show below that redundancy
does not help in this case.





Proof. We consider how the pebbles in L are computed in some simulation of
G by H. In particular, we build a ternary tree T to keep track of the processors
that have \eectively" computed the pebbles in L. The top pebble (1; n) has to be
computed by some processor of H. Call this processor q. (If more than one processor
AUTOMATIC METHODS FOR HIDING LATENCY 623
of H has computed (1; n), then we pick any one of them to be q.) We label the root of
tree T with q(1;n). Let u be a processor that has computed (1; n¡ 1) and has passed
this information to q, and let v be a processor that has computed (2; n¡ 1) and has
passed this information to q. (Note that other processors may compute (1; n¡ 1) and
(2; n ¡ 1). We are only concerned with processors that pass information to q.) Now
label the children of q(1;n) with u(1;n¡1) and v(2;n¡1). We proceed to construct the
children of u(1;n¡1) and v(2;n¡1). In general, node a(i;t) in T has children b(i¡1;t¡1),
c(i;t¡1); and d(i+1;t¡1) if the following holds. Processors a, b, c; and d compute pebbles
(i; t), (i¡ 1; t¡ 1), (i; t¡ 1); and (i+ 1; t¡ 1); respectively, and a receives the values
of (i¡ 1; t¡ 1), (i; t¡ 1); and (i+ 1; t¡ 1) from b, c; and d before a is able to compute
(i; t). The leaves of T are nodes of the form p(i;1). The important observation is the
following. If p(i;t) is a node in T , then information has to be passed from processor p to
q in H. The total delay from p to q lower bounds the number of steps in the simulation.
Let J be the smallest interval that contains all the processors appearing in tree T .
If processors x and y are at the two ends of J , then there exist two nodes of the form
x(ix;tx) and y(iy;ty) in T . Hence, information has to be passed from x and y to q in H.
This takes at least dJmJ=2 steps, and pebble (1; n) therefore cannot be computed at
a step earlier than dJmJ=2. Since mJ processors are computing n
2=2 pebbles, a work
argument shows that (1; n) cannot be computed before step n2=2mJ . Hence, (1; n)
cannot be computed at a step earlier than  = minI maxfn2=2mI ; dImI=2g.
It follows that the slowdown in simulating triangle L is lower bounded by =n.
By a similar argument to Lemma 2.5 none of the pebbles (i; n), for 1  i  n, can be
computed at a time step earlier than  . By repeating this argument the rst kn rows
of G cannot be simulated in time less than k . Therefore, we obtain Theorem 2.6.
Theorem 2.6. The slowdown of any simulation of an n-node guest linear array
G by a host linear array H is lower bounded by minI (n=mI + dImI=n), where I is
a subarray of H and has mI processors and average delay dI . Hence, FatStripe is
optimal up to a constant factor.
2.4. Simulating linear arrays on general networks. We now consider sim-
ulating a linear array G on a general n-node network H with average delay dave. We
rst embed a linear array H in H and then use H to carry out the simulation of G.
Lemma 2.7. Let H be a connected n-node network with arbitrary topology. Then
an n-node linear array H can be one-to-one embedded in H such that every edge of
H is used at most twice in H.
Proof. Our proof follows the approach of Theorem 3.15 in [12, page 470]. We
include the proof here for completeness. It is sucient to embed a linear array H in
a spanning tree of H. The proof proceeds by induction on the height of the tree with
the following inductive hypothesis. For any child u of the root v, there is a one-to-one
embedding of a linear array in the tree such that v and u form two endpoints of the
array, the edge uv is used at most once, and all other edges of the tree are used at
most twice. (Note that we treat all the edges as undirected.)
Let T be any spanning tree of H. The base of the induction in which T is a single
node, i.e., the height is 0, is trivial. Otherwise, let v be the root of T and u be any
child of v. We label the children of v as u1; : : : ; ud and assume u = ud without loss of
generality. We place the rst node of the linear array at v, and we place the second
node of the array at any child w of u1 (if any) using edges vu1 and u1w. Next, we
inductively place the nodes of the array in each node of the subtree of T rooted at
u1, making sure that the last node is placed at u1, the edge u1w is used at most once,
and that all other edges in the subtree are used twice. Therefore, edge u1w is used at
624 M. ANDREWS, T. LEIGHTON, P. T. METAXAS, AND L. ZHANG
Fig. 4. Embed a linear array one to one in a tree such that each tree edge is used at most twice.
The dotted lines indicate tree edges and the solid lines indicate array edges.
most twice in total.
We place the next node of the linear array at any child x of u2 (if any), using
edges u1v, vu2 and u2x. Again, we inductively place nodes of the linear array in the
subtree rooted at u2 such that u2 and x are endpoints. We continue in this fashion.
At the last subtree rooted at u, we enter this subtree at a child of u (if any) and exit
at u. This completes the embedding of the linear array. Our lemma follows from the
observation that the linear array has endpoints v and u, edges vu1, : : : ; vud¡1 are
used twice, and vud is used once. (See Figure 4.)
Since H has n nodes and degree , H has at most n=2 edges, and therefore the
total delays on all edges of H are at most daven=2. By Lemma 2.7, H uses each
edge of H at most twice. Hence, the total delays on all edges of H is at most daven,
and the average delay of H is at most dave. By Corollary 2.4, H can simulate G
with a slowdown of O(
p
dave). When H has bounded degree, i.e.,  = O(1), we have
Theorem 2.8.
Theorem 2.8. A bounded-degree host network with average delay dave can e-
ciently simulate an n-processor guest linear array with a slowdown of O(
p
dave).
Theorem 2.8 does not hold when H has unbounded degree. Consider the follow-
ing example. Let H be a linear array of
p
n cliques, in which each clique contains
p
n
nodes. If a clique edge has delay 1 and an edge connecting two adjacent cliques has
delay n, then H has dave < 4. Suppose m connected cliques are used to simulate n
steps of G. Lemma 2.5 implies a slowdown of minm maxf
p
n=2m;m=2g in simulat-
ing every n steps of computation by the guest. The rst term follows from a work
argument, since m
p
n processors are in m cliques. The second term comes from the
communication delay, since a linear array embedded in these m connected cliques has
a total delay of at least mn. Hence, the slowdown is at least minm maxf
p
n=2m;m=2g,
which is ›(n1=4), whereas the average delay is a constant.
3. Data°ow model|Two-dimensional arrays. In this section we present
methods for hiding latency in two-dimensional arrays. The analysis here is substan-
tially more complex than that for the one-dimensional case. We focus on simulating
a two-dimensional array on a two-dimensional array. Section 3.1 generalizes the ap-
proach for the linear arrays. Section 3.2 introduces some new mechanism to improve
the bound. Section 3.3 discusses the case when the delays are randomly arranged.
AUTOMATIC METHODS FOR HIDING LATENCY 625
Fig. 5. Algorithm 2d-Ray. In pyramid P1 the dashed line represents the ray of pebbles computed
by processor pi;j (which is shown in the upper left corner). Two of those pebbles, computed at times
t and t¡ 1, are shown shaded. The ve numbered pebbles are those upon which (i¡ t+ 1; j¡ t+ 1; t)
depends.
3.1. An analogue of the one-dimensional case. Let the guest network G be
an nn two-dimensional array with unit delay on all the edges. Let the host network
H be an n  n two-dimensional array with arbitrary delays. Let xi;j be the delay
between processors pi;j and pi+1;j of H for 1  i  n¡ 1 and 1  j  n, and let yi;j
be the delay between pi;j and pi;j+1 of H for 1  i  n and 1  j  n ¡ 1. The tth
step of computation by processor gi;j of G is recorded in pebble (i; j; t). In 2T steps,
G creates n  n  T pebbles, where pebble (i; j; t), for 1 < i; j < n and 1 < t  T ,
depends on (i¡ t+ 1; j¡ t+ 1; t¡ 1), (i¡ t; j¡ t+ 1; t¡ 1), (i¡ t+ 2; j¡ t+ 1; t¡ 1),
(i¡ t+ 1; j ¡ t; t¡ 1); and (i¡ t+ 1; j ¡ t+ 2; t¡ 1).
Consider the rst n=2 steps of computation by G. We dene four pyramids P1, P2,
P3, and P4 analogous to the left and right triangles in the linear array case. All four
pyramids have the square, dened by vertices (1; 1; 1), (1; n; 1), (n; 1; 1); and (n; n; 1),
as their bases. The top vertices of P1, P2, P3; and P4 are (1; 1; n), (1; n; n), (n; 1; n);
and (n; n; n); respectively. Note that the bottom half of the four pyramids contains
all the pebbles created by G for the rst n=2 steps of computation.
Algorithm 2d-Ray is a two-dimensional analogue of Stripe. To simulate the
rst n=2 steps of computation of G, 2d-Ray simulates P1, P2, P3; and P4 one by
one. Pyramid P1 is divided into n
2 rays, each of which is simulated by one processor
of H. In particular, processor pi;j of H simulates ray Ri;j , consisting of pebbles
(i¡ t+ 1; j ¡ t+ 1; t) for 1  t  minfi; j; n=2g. (See Figure 5.) When every pebble
for the rst n=2 steps of computation of G is simulated, 2d-Ray repeats the process
and simulates the next n=2 steps of computation. In the following we bound the
626 M. ANDREWS, T. LEIGHTON, P. T. METAXAS, AND L. ZHANG
slowdown in terms of the total delay on monotone paths, where a monotone path
travels in two directions, up and right. Let the length of a path be the total delay on
the path, and let Di;j be the length of the longest monotone path from processor p1;1
to pi;j in H. We have Lemma 3.1.
Lemma 3.1. Processor pi;j of H is able to compute pebble (i¡ t+ 1; j ¡ t+ 1; t)
at step Di;j + t.
Proof. We use induction on the indices (i; j) of the processors. The base of
the induction for p1;1 is obvious. Pebble (i ¡ t + 1; j ¡ t + 1; t) depends on pebbles
(i¡t+1; j¡t+1; t¡1), (i¡t; j¡t+1; t¡1), (i¡t+2; j¡t+1; t¡1), (i¡t+1; j¡t; t¡1);
and (i¡t+1; j¡t+2; t¡1), which are computed by processors pi¡1;j¡1, pi¡2;j¡1, pi;j¡1,
pi¡1;j¡2; and pi¡1;j ; respectively. (See Figure 5.) By induction, these ve pebbles are
computed at stepsDi¡1;j¡1+(t¡1), Di¡2;j¡1+(t¡1), Di;j¡1+(t¡1), Di¡1;j¡2+(t¡1);
and Di¡1;j + (t¡ 1); respectively. It follows that pebble (i¡ t+ 1; j ¡ t+ 1; t) can be
computed at step maxfDi¡1;j + xi¡1;j ; Di;j¡1 + yi;j¡1g+ t = Di;j + t.
Hence, 2d-Ray simulates pyramid P1 in Dn;n + n steps. Since P2, P3, and P4
can be simulated similarly, 2d-Ray simulates the rst n=2 steps of computation of G
in O(Dn;n + n) steps. The simulation is repeated for every n=2 steps of computation
of G. Therefore, Lemma 3.2 holds.
Lemma 3.2. Algorithm 2d-Ray achieves a slowdown of O(Dn;n=n), where Dn;n
is the length of the longest monotone path in H.
Unfortunately, Dn;n can be large compared with dave, the average delay of H.
In the worst case Dn;n can be (n
2dave), implying a slowdown of (ndave). We
introduce algorithm FatRay, a two-dimensional analogue of FatStripe, to achieve
a slowdown that is often better than O(Dn;n=n). Pyramid P1 is divided into m
2 rays,
each of which has size ‘ ‘ = nm  nm . FatRay uses an mm contiguous subarray
of processors in H to carry out the simulation. For simplicity, assume FatRay uses
processors pi;j (1  i; j  m). Again, pi;j computes every pebble in ray Ri;j , and pi;j
rst computes all the pebbles on the bottom plane and then moves up. The following
lemma is analogous to Lemma 2.2.
Lemma 3.3. Processor pi;j nishes simulating ray Ri;j by using step ‘
2n=2+Di;j.
Proof. As in Lemma 2.2 we can inductively show that pi;j can compute all the
pebbles in the xth plane in ray Ri;j by using time step ‘
2x + Di;j . Since each ray
contains at most n=2 planes of pebbles, pi;j nishes simulating ray Ri;j by using step
‘2n=2 +Di;j .
This implies a slowdown of O(n2=m2 + Dm;m=n). To minimize the slowdown,
FatRay uses the contiguous subarray S that minimizes n2=m2S +DS=n, where mS 
mS is the size of S and DS is the length of the longest monotone path in S.
Theorem 3.4. FatRay achieves a slowdown of minsubarrays S O(n
2=m2S+DS=n):
Unfortunately, the slowdown can still be big compared with dave. For example,
suppose that H is partitioned into n squares of size
p
npn with one edge of delay
n in the center of each square and unit delay on all other edges. The slowdown is
minS (n
2=m2S + DS=n) = (n
1=3); whereas dave is a constant. Matters are better,
however, when all the delays are the same, as we show in the following theorem.
Theorem 3.5. In the case where all the delays in H are d, FatRay eciently
simulates G on H and achieves a slowdown of 
¡
minfd2=3; n2g. The slowdown is
optimal up to a constant factor.




. Theorem 3.4 im-
plies a slowdown of O(d2=3). We show that the slowdown is asymptotically tight as fol-
lows. Consider pebble (i; j; d1=3), and suppose processor q computes it in a simulation.
AUTOMATIC METHODS FOR HIDING LATENCY 627
Let A be the set of pebbles of the form (i0; j0; t), for 1  t < d1=3, on which (i; j; d1=3)
depends; i.e., (i; j; d1=3) cannot be computed until after (i0; j0; t) is computed. If every
pebble in A is computed q; then it takes at least jAj = › ¡(d1=3)3 = ›(d) time steps
to simulate A. Otherwise, a processor p 6= q computes some pebble in A and passes
this information to q. The delay from p to q is at least d. Hence, the slowdown on
simulating the rst d1=3 steps is d2=3. The same argument applies for the slowdown
in the next d1=3 steps.
When d > n3 FatRay uses a single host processor for the simulation and achieves
a slowdown of O(n2). This slowdown is asymptotically tight for the same reason as
in the previous case. We consider pebbles (i; j; n) instead of (i; j; d1=3). In both cases
the simulation is work ecient.
Theorem 3.5 can be generalized to any k-dimensional array for k  1.
Theorem 3.6. Suppose G is an n      n k-dimensional array with unit-
delay edges, and H is an n     n k-dimensional array with delay-d edges; then H
can eciently simulate G with a slowdown of (minfdk=k+1; nkg). The slowdown is
optimal up to a constant factor.
3.2. Improved bounds for worst-case delays. In order to improve the slow-
down, we observe that not all the host processors are useful. If a host processor
is surrounded by high delays, then the benet to be gained by using its computing
power is nullied by the communication cost. We rst describe criteria of remov-
ing such host processors. We then embed guest processors to the unremoved host
processors. Suppose that guest processor gi;j is mapped to host processor p; then p
computes the pebbles in ray Ri;j in the 2d-Ray algorithm. For any arrangement of
the delays in H, we show how to embed G on H such that, for any monotone path
in G, its image in H has length of O(daven log
5=2 n). As a result, Lemma 3.2 implies
a slowdown of O(dave log
5=2 n) as long as only O(1) guest processors are mapped to




5=3 n) and achieve work eciency at the same time.
3.2.1. Removing useless processors. We rst recursively represent H using a
quadtree, in which each node corresponds to a subarray of H. The root represents the
entire n n array. The four children of the root represent the four n2  n2 subarrays,





of H. We refer to this subarray as a depth-k array. The leaves represent the individual
processors of H. (See Figure 6.)
We describe a two-stage procedure to remove \useless" processors of H. A proces-
sor is removed if it is surrounded by high delays (Stage 1) or few unremoved processors
(Stage 2). (When a processor is removed, its incident edges remain in the network.)
For each depth k, we dene two quantities Dk for \delay threshold" and mk for \sur-
vival threshold." Note that Dk is larger than the average delay on a row/column in a
depth-k array by a factor of (logn), and mk is smaller than the number of processors
in a depth-k array by a factor of (logn):















628 M. ANDREWS, T. LEIGHTON, P. T. METAXAS, AND L. ZHANG
Fig. 6. The quad tree that represents H.
A constant c is specied later. We also dene a maximum depth kmax such that when
k = kmax the survival threshold mk becomes 1.





 Stage 1. From depth k = kmax down to depth 0, if the total delay on a
row/column of a depth-k array exceeds the threshold Dk, then all the
n
2k
processors on that row/column are removed.
 Stage 2. From depth k = kmax down to depth 0, if the number of unremoved
processors in a depth-k array is smaller than the threshold mk, then all the
processors in that array are removed. Moreover, we also remove processors
so that the number of remaining processors in any depth-k array is an integer
multiple of mk.
Lemma 3.7. At most 2n2=c processors are removed in Stage 1.
Proof. The total delay of H is 2n2dave. At most
2n2k
c log n depth-k rows and columns






c log n processors are removed at depth k. There are log n depths, and so the
lemma follows.
Lemma 3.8. At most n2=c processors are removed at Stage 2.
Proof. Since there are 4k depth-k arrays, at most n
2
c log n processors are removed
at depth k.
We label each array with the number of unremoved processors contained in it.
By Lemmas 3.7 and 3.8, at most 3n2=c processors of H are removed. Therefore,
H is labeled with c1n
2, where c1  1 ¡ (3=c). Any constant c > 3 works for our
argument.
3.2.2. The embedding. For clarity of presentation, we create an intermedi-
ate two-dimensional array G that has size pc1n  pc1n and unit-delay edges only.
We describe an algorithm Embed that maps the processors of G one to one to the
AUTOMATIC METHODS FOR HIDING LATENCY 629
Fig. 7. (Left) Depth-k region R and depth k+ 1 regions R1, R2, and R3 of G. (Right) Depth-k
and k + 1 arrays of H. Depth k + 1 region Ri has size zi, where zi is the number of unremoved
processors in the corresponding array of H. In this gure, z4 = 0, and R4 is therefore empty.
unremoved processors of H. The goal is to show that for any monotone path in G its
image in H under Embed has length O(daven log
5=2 n). As a result, H can simulate
G with a slowdown of O(dave log5=2 n). Obviously G can simulate G with constant
slowdown.
Embed partitions G into regions recursively, and each depth-k region of G cor-
responds to a depth-k array of H. The depth-0 region is the entire network G. By
the construction of Stage 2, c1n
2 (the number of processors in G) is a multiple of m0.
Hence, G can be viewed as a collection of contiguous squares of size pm0  pm0.
We inductively assume that each depth-k region consists of contiguous squares of
size
p
mk  pmk, where mk is dened in equation (2). Each depth-k region R is
partitioned into four depth k + 1 regions R1, R2, R3; and R4 as follows. First, eachp
mk pmk square of R is divided into four squares of size pmk+1 pmk+1, wherep
mk+1 =
p
mk=2. Suppose that Ri corresponds to a depth k + 1 square of H that
has zi unremoved processors; then Ri has size zi. By the construction of Stage 2, zi
is a multiple of mk+1. Hence, Ri can be formed as a collection of contiguous squares
of size
p
mk+1  pmk+1. Note that if zi is 0, then the corresponding Ri is empty.
(See Figure 7.)
At depth kmax, each depth-kmax region consists of contiguous squares of size 11.
Embed maps the processors in a depth-kmax region of G to the unremoved processors
in the corresponding depth-kmax array of H in an arbitrary one-to-one manner. Thus,
we have a one-to-one mapping from the processors of G to the unremoved processors
of H.
We also dene the depth-k boundaries in G to be the borders of depth-k regions
of G. Note that the depth-k boundaries are at least pmk apart in both horizontal
and vertical directions.
3.2.3. Bounding monotone path length. In this section we bound the total
delay on the image of P in H, where P is any monotone path in G. Suppose a and b
are two neighboring processors in G; then their images aH and bH in H are connected
by a 1-bend route as follows. First, aH is routed along its row to bH ’s column and
then routed to bH along the column. We dene a and b (resp., aH and bH) to be
k-related if k is the largest integer such that a and b (resp., aH and bH) are in a same
depth-k region (resp., depth-k array). We also dene aH and bH to be peers of each
other. Note that each unremoved host processor can have four peers.
630 M. ANDREWS, T. LEIGHTON, P. T. METAXAS, AND L. ZHANG
Lemma 3.9. Let P be any monotone path in G. The image of P in H has a total
delay of O(daven log
5=2 n) under Embed.
Proof. Let a and b be two neighboring processors on P and let aH and bH be
their images. Suppose a and b (resp., aH and bH) are k-related. We rst bound the
length of the 1-bend route from aH to bH . By the construction of Stage 1, the total
delay on a depth-k row/column that contains aH or bH is at most Dk. Hence, the
distance from aH to bH is at most 2Dk.
We now bound the number of neighboring a’s and b’s that can be k-related. If
k < kmax, P must cross some depth k + 1 boundary of G in traveling from a to b.
Since P is monotone and the depth-k boundaries are
p
mk apart in both horizontal and
vertical directions, P can cross the depth-k boundaries at most 2npmk times. Hence,
at most 2npmk+1 neighboring a’s and b’s on P can be k-related. This implies that
the total delay incurred by k-related peers on the image of P is at most 2Dk
2np
mk+1
for k < kmax. Obviously, at most 2n neighboring a’s and b’s can be kmax-related.







, which is O(daven log
5=2 n) by the denitions
of Dk, mk and kmax.
Hence, we can embed G on H such that O(1) guest processors are mapped to
each host processor and that the image in H of any monotone path in G has length
O(daven log
5=2 n). Lemma 3.2 implies that H can simulate G with a slowdown of
O(dave log
5=2 n). To improve the slowdown and achieve eciency, we apply the idea of
complementary slackness and use an mm contiguous subarray of H for simulation as
in FatRay. Theorem 3.4 and Lemma 3.9 imply a slowdown of O((davem log
5=2m)=n+
n2=m2). By choosing m to be maxfnd¡1=3ave log¡5=6 n; 1g, we have Theorem 3.10.
Theorem 3.10. Network H with average delay dave can eciently simulate G




3.2.4. Bandwidth. The preceding analysis focuses entirely on the issue of la-
tency and ignores bandwidth constraints. This does not present any problems if the
link bandwidth available on the host array is ›(log3=2 n) times larger than that on the
guest array. If the bandwidth of the host and guest arrays are comparable, however,
and if the guest array is fully utilizing the bandwidth on its links, then congestion
becomes an issue. Formally, the congestion on an edge equals the number of pebbles
that wish to cross this edge simultaneously. In this case, we may need to slow down
the simulation by an additional factor of O(log3=2 n).
In section 3.2.3, peers aH and bH are connected by a 1-bend route in H. To
address the congestion issue, we present a more sophisticated method of connecting
aH and bH such that each edge in H has O(log
3=2 n) routes going through it and that
the distance between aH and bH remains unchanged asymptotically.
We begin with some denitions. Recall that Embed maps each depth-k region
Rk of G to a depth-k array Sk of H. A depth-k row/column of Sk is live if it contains
some unremoved host processors. A boundary point of Sk is live if it belongs to some
live row or column of Sk. We rst bound the number of connections from inside of
Sk to outside of Sk in terms of the number of live rows and columns of Sk.
Lemma 3.11. Consider any depth-k array, Sk, of H. The number of processors
in Sk that have peers outside Sk is O(x
p
logn), where x is the number of live rows
and columns in Sk.
Proof. Let z be the number of unremoved processors in Sk; then the number of
live rows and columns is at least z
n=2k
. The number of host processors in Sk that
AUTOMATIC METHODS FOR HIDING LATENCY 631
have peers outside Sk is proportional to the perimeter of Rk, the depth-k region that
corresponds to Sk. By the construction of Embed, Rk consists of squares of sizep
mkpmk. Hence, Rk has perimeter of O(z=pmk), which is O( zn=2k
p
logn) by the
denition of mk in (2). Our lemma follows.
We now describe a recursive procedure that connects the peers. The following
facts are used in our routing.
Fact 3.12. Consider a routing problem on a square array of size x x.
1. If each node has O(y) requests, then the routing can be done in one bend and
O(xy) congestion.
2. Let the nodes on the cross divide the square array into four x2  x2 quad-
rants. If each boundary node and cross node have O(y) requests and all other
nodes have no requests, then the routing can be done in O(1) bends and O(y)
congestion.
Our recursive routing starts at depth k = kmax. Consider all the depth-k arrays
Sk. For all the peers that are k-related, we connect them through a 1-bend routing
within Sk. Since Sk has size
p
logn plogn and each host processor has at most 4
peers, the congestion caused by this 1-bend routing within Sk is O(
p
logn) by item
1 of Fact 3.12. For all the processors that have peers outside Sk, we route them
to live boundary points such that the following two conditions hold. First, each
live boundary point of Sk receives O(
p
logn) requests. This is possible because of
Lemma 3.11. Second, the routing uses 1 bend and causes a congestion of O(logn) by
item 1 of Fact 3.12.
We proceed recursively to depths k < kmax. Consider all the depth-k arrays Sk.
From the previous stage the host processors that are not connected to their peers are
routed to some live boundary points of depth k + 1 arrays. Hence, they are either
on the boundary or on the cross of Sk, and O(
p
logn) host processors are routed to
the same location. For all the peers that are k-related, we connect them within Sk.
Otherwise, we route them to the live boundary points of Sk such that each live point
receives O(
p
logn) requests (including those from all previous stages but have not yet
connected to their peers). This is possible by Lemma 3.11. In both cases, item 2 of




The congestion incurred at depth k, for 1  k < kmax, is O(
p
logn) and at
depth kmax is O(logn). Since each of the depths uses the same underlying edges, the
overall congestion is O(log3=2 n). The host processors are routed to live boundary













dave logn), which remains O(
n
2k
dave logn) as in Lemma 3.9. In
summary, we have Lemma 3.13.
Lemma 3.13. In the above routing scheme the congestion is O(log3=2 n) on all
edges of H. Furthermore, for any monotone path P in G, the image of P in H has
length O(daven log
5=2 n).
3.3. Improved bounds for randomly arranged delays. In this section, we
show that the length of the longest monotone path in H is often short when the delays
are randomly arranged. If M is the number of edges in an n n array H, then for a
given set of M delays with average dave the longest monotone path in H has length
O(ndave) for most of the M ! permutations of the delays. That is, in the uniform
distribution of the M ! permutations, the longest monotone path has length O(ndave)
with high probability, and therefore the slowdown is O(dave) with high probability.
632 M. ANDREWS, T. LEIGHTON, P. T. METAXAS, AND L. ZHANG
Fig. 8. The bounding box and four edge-disjoint alternate paths for edge f .
Without loss of generality we assume that dave is a constant. (For a nonconstant
dave, each delay d is normalized to maxfd=dave; 1g. The normalized delays have av-
erage O(1), and the total original delay on any monotone path is at most dave times
the total normalized delay.) We divide the delays in H into O(logn) levels. Level ‘
contains the delays that are in the range of [2‘; 2‘+1), and level ‘+ contains the delays
that are at least 2‘.
3.3.1. Shortcuts and edge coloring. If an edge with a large delay is sur-
rounded by edges with small delays, we can route around this large delay. The intu-
ition is that in a random permutation most long delays can be shortcut. For each edge
f , we consider four edge-disjoint alternate paths that connect the two endpoints of f .
(See Figure 8.) The 3  3 box that contains these four paths is called the bounding
box of f . After the shortcut, the total delay on f equals the shortest alternate path
length. For clarity, we shall refer to the delay before the shortcut as the original delay
and the delay after the shortcut as the shortcut delay.
For a given set of delays with a constant average, the number of level-‘+ original
delays is O(n22¡‘). Therefore, the probability for an edge f to have a level-‘+ original
delay is O(2¡‘). However, shortcutting dramatically decreases this probability as the
following lemma shows.
Lemma 3.14. The probability for an edge f to have a level-‘+ shortcut delay is
O(2¡4‘).
Proof. If edge f has a shortcut delay from level ‘+, then the four alternate paths
must each have an edge whose original delay is from level (‘ ¡ 4)+. For a particular
set of four edges to have level (‘ ¡ 4)+ original delays, the probability is ¡B4=¡M4 ,
where B is the number of level (‘¡4)+ original delays, and M is the number of edges
in H. Since there are 3  3  9  1 = 81 ways to choose four edges from four alternate
paths, we derive the following from a union bound:









Our lemma follows from the observation that B = O(n22¡‘) since dave = O(1), and
that M = (n2).
Unfortunately, these probabilities are not independent from edge to edge for two
reasons. First, the arrangement of delays is a permutation of a given set of delays.
This does not cause a problem, however, as the analysis in Lemmas 3.15 and 3.17 will
show. Intuitively, in a permutation if one edge has a large delay, then other edges
are less likely to have large delays. Second, the bounding boxes are not necessarily
AUTOMATIC METHODS FOR HIDING LATENCY 633
disjoint. To resolve this problem we introduce an edge coloring, so that any two
distinct edges with the same color have edge-disjoint bounding boxes. Clearly, only a
constant number of colors are needed.
We show in the following that, for any monotone path in H, the total delay
incurred from the edges in one particular color group is O(n) with high probability.
Since there are O(1) color groups, our results follows from a union bound. For each
color group we consider two cases, edges with large shortcut delays and edges with
small shortcut delays.
3.3.2. Large delays. In this section we show that, with high probability, the
total delay in H due to shortcut delays from large levels is O(n). Therefore, any
monotone path can only pick up O(n) delays from these levels.
Lemma 3.15. With probability 1 ¡ O(n¡1), any monotone paths pick up a total
delay of O(n) from levels ‘  L, where L = 12 logn¡ 12 log logn.
Proof. By Lemma 3.14, the probability that one particular edge has a shortcut
delay from level (34 logn)
+ is O(n¡3). Since H has (n2) edges, with probability
1¡O(n¡1) no edge in H has shortcut delay from level ( 34 logn)+.
We show below that, with high probability, H has O(log3 n) shortcut delays from
level L+. Let A = an2 be an upper bound on the number of edges in one particular
color group, where a is a constant. Since dave = O(1), at most B = bn
3=2 log1=2 n
original delays can be from levels (L ¡ 4)+, where b is a constant. We show that,
with a small probability, more than C = c log3 n edge delays are from level L+ for a
suciently large constant c.
For a particular set of C edges to have level-L+ shortcut delays, at least four
edges in each of these C bounding boxes have level (L ¡ 4)+ original delays. For
a particular set of four edges in each bounding box to have level (L ¡ 4)+ original










. This is true since all the C bounding





ways to choose C edges whose shortcut
delays are from L+ and 81C ways to choose four edges from each of the C boxes. We
therefore derive the following from a union bound:


























where e = 2:718 is the base of the natural logarithm. By the denitions of A, B; and











81  a  b4  e5
24  c  logn
c log3 n
:
Let c be a suciently large constant; then the above probability is bounded byO(n¡1).
Summing over all the O(1) color groups, we conclude that with probability 1¡O(n¡1)
H has no shortcut delays from level ( 34 logn)
+ and O(log3 n) shortcut delays from level
L+. Hence, any monotone path picks up a total delay of O(n3=4 log3 n) = O(n) from
levels ‘  L.
634 M. ANDREWS, T. LEIGHTON, P. T. METAXAS, AND L. ZHANG
3.3.3. Small delays. In this section we show that the shortcut delay from small
levels do not accumulate too much on any monotone path with high probability. In
particular, with probability 1 ¡ O(n¡2), each monotone path picks up an O(n2¡‘)
delay from each \small" level ‘. Summing over all O(logn) \small" levels, we can con-
clude that each monotone path picks up a total of O(n) small delays with probability
1¡O(n¡1).
Consider a particular level ‘ < L, where L = 12 logn ¡ 12 log logn. We divide
H into m2 squares of size 22‘  22‘, where m = n2¡2‘. There are ¡2m¡1m  sequences
of 2m ¡ 1 squares that some monotone path could possibly go through. We call
these sequences of 2m¡ 1 squares monotone sequences. If the total number of level-‘
shortcut delays in each of these sequences is bounded, then the total level-‘ shortcut
delay that any monotone path picks up is also bounded.
Lemma 3.16. With probability 1 ¡ O(n¡2), any monotone path picks up a total
of a O(n2¡‘) delay from level-‘ shortcut delays, where ‘ < L is one particular level
and L = 12 logn¡ 12 log logn.
Proof. Consider one particular monotone sequence of 2m ¡ 1 squares of size
22‘22‘, where m = n2¡2‘. Let random variable X be the number of level-‘ shortcut
delays in this sequence of squares, and let random variable Xi be the number of level-‘
shortcut delays from the ith square in the sequence. We use a moment generating
function argument to upper bound X = X1 +    + X2m¡1. We rst bound the
probability Pr [ X1 = k1; : : : ; X2m¡1 = k2m¡1 ]. Let A = a24‘ be an upper bound on
the number of edges from one particular color group in each 22‘  22‘ square, where
a is a constant. Since dave = O(1), at most B = bn
22¡‘ original delays can be from
level (‘¡ 4)+, where b is a constant. Let k = P2m¡1i=1 ki. By applying the same logic
as for inequality (4), we have




















By inequality (5), the probability is bounded by























We proceed to bound the expectation of eX :






























AUTOMATIC METHODS FOR HIDING LATENCY 635






2m¡1. We use Markov’s inequality to bound the probability
that the total number of level-‘ shortcut delays exceeds (2m¡ 1) in this particular
monotone sequence:








< 22m¡1 monotone sequences. By a union bound, the probability
that every sequence has fewer than (2m ¡ 1) level-‘ shortcut delays is at least 1 ¡
22m¡1e(y¡)(2m¡1). Let  be the constant y + 2; then this probability is bounded by
1 ¡ O(n¡2), since m = n=22‘  logn. Therefore, every monotone path picks up a
total of O(n2¡‘) shortcut delays from level ‘ with probability 1¡O(n¡2).
Summing over all levels ‘ < L results in a total delay that is linear in n as desired.
Lemma 3.17. With probability 1¡O(n¡1), all the monotone paths pick up a total
delay of O(n) from levels ‘ < L for L = 12 logn¡ 12 log logn.
For the case in which dave = O(1), Lemmas 3.15 and 3.17 show that any monotone
path has a total delay of O(n) with high probability. For the case in which dave is
nonconstant, the discussion at the beginning of the section implies that Theorem 3.18
holds.
Theorem 3.18. Suppose H is a network with average delay dave; then with high
probability every monotone path in H has delay O(ndave).
To make the algorithm work ecient, we use an mm subarray of H of average
delay at most dave to simulate G. Theorems 3.4 and 3.18 imply that the slowdown
is O(n2=m2 + davem=n) with high probability. By choosing m to be maxfnd¡1=3ave ; 1g,
we have Theorem 3.19.
Theorem 3.19. Suppose the delays on network H are from a random permuta-
tion of a set of delays whose average is dave; then with high probability H can simulate
G with slowdown O(d
2=3
ave).
Congestion problems are not an issue here, since each edge of H is used O(1)
times by alternate paths in the shortcut process.
4. Database model. We switch our attention to the database model. As dis-
cussed in section 1, simulation in the database model is more dicult than in the
data°ow model. For algorithms such as Stripe in section 2 to work for the database
model a host processor needs (n) copies of the databases on average. This is un-
realistic because of the memory requirement as well as the diculty in updating the
databases. We therefore develop new machinery for the database model. Contrary to
the data°ow model, we make substantial use of redundant computation. Apart from
the slowdown, another important parameter for the database model is load, which is
the number of databases that a host processor copies.
The main contribution of this section is an algorithm called Overlap that sim-
ulates linear arrays in the database model with a small load and a small slowdown.
Since Overlap is technically involved, we begin with a special case in section 4.1,
where the host linear array has delay d on all edges. The simulation in this special
case is much simpler, and it conveys some intuition for using redundant computa-
tion in the general case. Section 4.2 presents Overlap in detail. The techniques
are generalized to simulate linear and two-dimensional arrays on general networks in
sections 4.3 and 4.4. Last, in section 4.5 we discuss the lower bounds on slowdown
when each database is allowed a small number of copies.
636 M. ANDREWS, T. LEIGHTON, P. T. METAXAS, AND L. ZHANG
4.1. A special case. In this section we consider a special case. Let G be a
guest linear array with n processors and unit-delay edges, and let H be a host linear
array with n processors and delay d on all edges. We use redundant computation
to achieve an optimal slowdown of O(
p
d). Recall that in the data°ow model, the
optimal slowdown is achieved for linear arrays without using redundancy.
Theorem 4.1. In the database model, H can eciently simulate G with a slow-
down and a load of O(
p
d). This slowdown is optimal up to a constant factor.
Proof. We consider two cases. If n  pd; then one host processor copies all the
databases and carries out the entire computation by itself. Hence the load and the
slowdown are n, which is O(
p
d). Otherwise, the rst np
d
host processors are used for
the simulation. For 1  j  np
d
, processor pj copies 3
p
d databases bi and computes
3
p
d columns of pebbles (i; t), where (j¡ 2)pd+ 1  i  (j+ 1)pd and 1  t. In this
way each processor shares
p
d databases with its right and left neighbors and each
pebble is redundantly computed by three neighboring processors.
We show how to simulate the rst
p
d rows of pebbles created by G in O(d) steps
by H. Every subsequent
p
d rows of pebbles are simulated in the same manner. The
algorithm is demonstrated in Figure 9. For 1  j  np
d
; let
Pj = fPebbles (i; t) : 1  t 
p
d; ¡2pd+ 1  i¡ jpd  pdg;
Lj = fPebbles (i; t) : 1  t 
p
d; 1  i¡ (j ¡ 2)pd  tg;
Rj = fPebbles (i; t) : 1  t 
p
d; ¡t+ 1  i¡ (j + 1)pd  0g;
Tj = Pj ¡ (Lj [Rj);
Aj = fPebbles ((j ¡ 2)
p
d; t) : 1  t  pdg;
Bj = fPebbles ((j ¡ 1)
p
d+ 1; t) : 1  t  pdg;
Cj = fPebbles ((j
p
d; t) : 1  t  pdg;
Dj = fPebbles ((j + 1)
p
d+ 1; t) : 1  t  pdg:
Processor pj of H computes all the pebbles in Pj . First, pj computes the pebbles in
the trapezium Tj without communicating with its neighbors. There are 2d pebbles
in Tj ; and so this takes 2d steps. Next, pj passes column Bj to processor pj¡1 and
receives column Aj from pj¡1. It also passes column Cj to processor pj+1 and receives
column Dj from pj+1. This communication takes d+
p
d < 2d steps using pipelining.
Processor pj can now compute the pebbles in triangles Lj and Rj in d steps. It
is important for pj to compute the pebbles in Lj and Rj in order to continue the
simulation of the next
p
d rows of pebbles, since databases need to be updated. This
presents a major dierence between the data°ow and database models.
Hence, it takes at most 5d steps in total for processor pj to compute every pebble
in Pj . The next
p
d steps of computation can be simulated in a similar fashion. The
slowdown is therefore O(
p




Note that during the computation of Tj , the pebbles in columns Bj and Cj can
start to travel to the neighboring processors of pj as soon as they are ready. Processor
pj can also start to compute triangles Lj and Rj before the entire columns of Aj and
Dj are transferred. In this way, the communication time can be saved. Although
it does not make a dierence asymptotically in this case we take advantage of this
observation in Overlap.
AUTOMATIC METHODS FOR HIDING LATENCY 637
Fig. 9. Simulating
p
d steps of computation of G on H.
4.2. Algorithm OVERLAP. To simulate a guest linear array on a host linear
array with arbitrary delays we use an algorithm called Overlap. In Overlap, we
remove host processors that are surrounded by high delays. The motivation of this step
is similar to that of section 3.2. For the remaining processors, we decide how much
redundancy is needed for neighboring processors and how much computation each
processor is able to carry out. During the simulation, some pebbles are redundantly
computed to ensure that the communication is not too costly. We rst obtain a
slowdown of O(dave log
3 n), where dave is the average delay of H and n is the size of
G and H, and later improve the slowdown to O(
p
dave log
3 n) while achieving work
eciency.
4.2.1. Removing useless processors. We recursively represent H using a bi-
nary tree, in which each node corresponds to a subarray of H. The root represents the
entire array. The left and right children of the root represent the left and right halves
of the array, respectively. In general, a node at depth k of the binary tree corresponds
to a subarray of H that contains n
2k
processors. We refer to this subarray as a depth-k
interval. The leaves represent the individual processors of H. (See Figure 10.)
We describe a two-stage process that removes the processors that are surrounded
by high delays (Stage 1) and the processors that are surrounded by few unremoved
processors (Stage 2). During Stage 2, we also label each live subarray, where a live sub-
array contains some unremoved processor. These labels indicate how many columns
of pebbles the live subarrays are able to compute.
For every depth k, we dene Dk to be the \delay threshold" and mk to be the
\overlap size" as follows. Note that Dk is larger than the average delay in a depth-k
interval by a factor of (logn), and mk is smaller than the number of processors in
a depth-k interval by a factor of (logn). We shall use mk to indicate the size of
overlap between neighboring depth-k intervals, i.e., the number of columns of pebbles
redundantly computed by both intervals:






638 M. ANDREWS, T. LEIGHTON, P. T. METAXAS, AND L. ZHANG
Fig. 10. The binary tree that represents H. In this gure, unremoved processors of H are
represented by black circles; removed processors are represented by white circles. Arrows indicate









As we shall see, any constant c > 5=2 works for our argument. We also dene a
maximum depth kmax such that when k = kmax the overlap size mk becomes 1:
kmax = logn¡ log logn¡ log c:(9)
 Stage 1. From depth k = kmax down to depth 0, if the total delay in a depth-
k interval exceeds Dk, then all the processors in that interval are removed.
 Stage 2. At depth k = kmax, let I be a live depth-k interval and let x be
the number of unremoved processors in I. If x is smaller than 2mk, then we
remove all the remaining processors in I and I is no longer live. Otherwise,
we label I with x.
Suppose all the live depth-(k + 1) intervals are labeled. Now consider each
live depth-k interval I. If I has two live children I1 and I2 that are labeled
with x1 and x2, then let x = x1 + x2 ¡mk+1. If I has one live child I1 that
is labeled with x1, then let x = x1. If x < 2mk, we remove all the remaining
processors in I and I is no longer live. Otherwise, we label I with x. We
proceed to depth k ¡ 1 until reaching depth 0.
Lemma 4.2. At most n=c processors are removed at Stage 1.
Proof. The total delay in the array H is ndave. At most
2k
c log n depth-k intervals
can have delay more than Dk. Each depth-k interval contains
n
2k
processors and so at
most nc log n processors are removed at depth k. Since there are log n depths, at most
n=c processors are removed at Stage 1.




n at Stage 2.
Proof. Before Stage 2, the number of remaining processors in H is at least (1 ¡
1=c)n by Lemma 4.2. At depth k = kmax of Stage 2, the sum of the labels on the live
depth-k intervals is at least (1¡ 1=c)n¡ 2mk2k, which is (1¡ 1=c)n¡ 2nc log n . At each
depth k < kmax, the sum of the labels on the live depth-k intervals decreases by at
most (2mk + mk+1)2
k, which is 5n2c log n . Summing over all depths, we conclude that





4.2.2. Assigning databases. For clarity of presentation, we rst assume that
G has n0 processors, where n0 is the label on the root interval of G and n0 is a
constant fraction of n by Lemma 4.3. We also assume the existence of pebbles (0; t)
and (n0 + 1; t), for all t  1, which are known to H at time step 0. This ensures that
each pebble computed by G is dependent on three pebbles.
AUTOMATIC METHODS FOR HIDING LATENCY 639
Algorithm Overlap assigns one database to each remaining processor of H so
that H has load one. In particular, a depth-k interval with label x is assigned x
databases. The depth-0 interval, i.e., H, has all the databases b1; : : : ; bn0 . We assume
inductively that a depth-k interval I labeled x is assigned databases bi+1, : : :, bi+x.
If I has only one child I1, then Overlap assigns bi+1; : : : ; bi+x to I1. If I has two
children I1 and I2 that are labeled x1 and x2; respectively, then x = x1 + x2 ¡mk+1
by the construction of Stage 2. Overlap assigns bi+1; : : : ; bi+x1 to interval I1 and
bi+x¡x2+1, : : :, bi+x to I2. Note that mk+1 databases, namely, bi+x¡x2+1; : : : ; bi+x1 ,
are assigned to both I1 and I2. These mk+1 columns of pebbles will be redundantly
computed by both I1 and I2. At depth kmax each remaining processor of H is assigned
one database.
4.2.3. The simulation. In Overlap, H recursively simulates every m0 =
n
c log n rows of pebbles created by G as follows. If H (the depth-0 interval) has two
live depth-1 intervals I1 and I2 as children, then I1 and I2 recursively compute the
rst m1 = m0=2 rows of pebbles and then repeat for the next m1 rows. In particular,
I1 (resp., I2) computes all the pebbles of the form (i; t), where I1 (resp., I2) owns
database bi and 1  t  m1. Intervals I1 and I2 share m1 databases and therefore
redundantly compute these m1 columns of pebbles. If H has one live child I1, then I1
recursively computes the rst m1 rows and then repeats for the second m1 rows. At
depth k = kmax, each depth-k interval computes mk = 1 row of pebbles. Theorem 4.4
explains the simulation in detail.
Let us dene a set of values s
(k)
t for 0  k  kmax and 1  t  mk, where the
superscript k represents the depth of the recursion, and the subscript t represents
the row number. Roughly speaking, s
(k)
t corresponds to the time by which a depth-





m0 corresponds to the time that H takes to simulate the rst m0
steps of computation by G. Recall that the delay threshold Dk dened in (7) is an














for mk+1 + 1  t  mk:(11)
The base of the recurrence is dened to be
s(k)mk = s
(k)
1 = 1 for k = kmax:(12)
Let the left endpoint of interval I be the leftmost unremoved processor in I, and let
the right endpoint be the rightmost unremoved processor in I. (See Figure 10.) For
notational simplicity, we assume that I is the leftmost live depth-k interval and is
assigned databases b1; : : :, bx. Let Bk = f(i; t) : 1  i  x; 1  t  mkg. The proof of
the following theorem describes how algorithm Overlap performs the simulation.
Theorem 4.4. For 1  t  mk, if pebbles (0; t) and (x+ 1; t) are known by time
step s
(k)
t by the left and right endpoints of interval I; respectively, then by time step
s
(k)
t every pebble (i; t) in Bk is computed by all the processors in interval I that have
a copy of database bi.
Proof. We proceed by a backward induction on k. At level k = kmax, we have
mk = 1 and box Bk has size x 1. Since the remaining processors of I have load one,
640 M. ANDREWS, T. LEIGHTON, P. T. METAXAS, AND L. ZHANG
Fig. 11. The box of pebbles Bk+1 has size x1 mk+1 and is represented by the lower left box
with a dashed boundary. B0k+1 has size x2 mk+1 and is represented by the lower right box with a
solid boundary. Bk is the union of all four boxes. For interval I to compute every pebble in Bk, I1
and I2 (the live children of I) recursively compute Bk+1 and B
0
k+1. Once the bottom half of Bk is
computed the top half is computed in a similar manner.
each processor computes one pebble in Bk. By denition, s
(k)
1 = 1. Hence, the base
of the induction holds.
Suppose that the inductive hypothesis is true for k+ 1. Note that the hypothesis
can be applied to any depth k+ 1 interval. Let us concentrate on I, the leftmost live
depth-k interval. Suppose I is labeled with x. There are two cases to consider.
Case 1. Suppose I has two live children I1 and I2 that are labeled with x1 and
x2; respectively. By construction x = x1 +x2¡mk+1. Let Bk+1 = f(i; t) : 1  i  x1;
1  t  mk+1g. Let y = x1¡mk+1 and B0k+1 = f(i; t) : y+1  i  x; 1  t  mk+1g.
Let column C consist of pebbles (y; t) and column D consist of pebbles (x1 + 1; t),
where 1  t  mk+1. Note that boxes Bk+1 and B0k+1 have an overlap of width mk+1;
i.e., the mk+1 columns between C and D are common to both Bk+1 and B
0
k+1. (See
Figure 11.) Two observations can be made from the inductive hypothesis.
 Observation 1. For 1  t  mk+1, every pebble (y; t) in column C can be
computed by I1 by time step s
(k+1)
t without any conditions on pebbles (0; t)
and (x1 + 1; t). Since C and D are mk+1 columns apart and x1  2mk+1 by
the construction of Stage 2, the pebbles in column C therefore do not depend
on the pebbles (0; t) and (x1 + 1; t). (The dotted diagonal lines in Figure 11
show the dependencies of columns C and D.)
 Observation 2. Let z  0 be some constant. For 1  t  mk+1, if the value
of pebbles (0; t) and (x1 + 1; t) are known at time step s
(k+1)
t + z by the left
and right endpoints of interval I1; respectively, then by time step s
(k+1)
t + z,
every pebble (i; t) in Bk+1 is computed. This is true because there is no
dierence between starting the simulation at time step z and at time step 0.
Similar statements can be made about the box B0k+1 and column D. Now suppose
that the value of pebbles (0; t) and (x + 1; t) are known at time step s
(k)
t by the
left and right endpoints of interval I; respectively. Observation 1 and the inductive
hypothesis imply that any pebble (y; t) in column C can be computed by I1 by time
s
(k+1)
t . Since the total delay in interval I is at most Dk; then the left endpoint of
interval I2 can receive the pebble (y; t) (together with any relevant database changes)
by time s
(k+1)
t +Dk; which equals s
(k)
t (10). Similarly, all of the pebbles in column D
can be sent to the right endpoint of interval I1 by time s
(k)
t . Since s
(k)
t is greater than
AUTOMATIC METHODS FOR HIDING LATENCY 641
s
(k+1)
t by a constant amount, namely, Dk, for 1  t  mk+1, Observation 2 and the
inductive hypothesis imply that pebbles (i; t) in box Bk+1 (resp., B
0
k+1) are computed
by I1 (resp., I2) by time s
(k)
t . Therefore, pebbles (i; t) in the bottom half of Bk are
computed by time s
(k)
t .
Once the bottom half of Bk is simulated I simulates the top half in a similar





t¡mk+1 ; which equals s
(k)
t (11).
Case 2. The case in which I has one live child is simpler. Let I1 be the child of I.
By construction, I1 has label x1 = x. By Observation 2 and the induction hypothesis,
if the values of the pebbles (0; t) and (x1 + 1; t), for 1  t  mk+1, are known at
time steps s
(k)
t by the left and right endpoints of interval I1; respectively, then every
pebble (i; t) in Bk+1 (i.e., the bottom half of Bk) is computed by I1 by time step s
(k)
t .
Since intervals I and I1 have the same remaining processors (and hence the same
endpoints), the above statement holds for I. Interval I then computes the top half





t¡mk+1 ; which equals s
(k)
t (11).
The inductive step is complete. Hence, given that the value of pebbles (0; t) and
(x+ 1; t) are known at time step s
(k)
t by the left and right endpoints of interval I; all
pebbles (i; t) in box Bk are computed by time step s
(k)
t .
Recall that n0 is the label of the tree root and n0 is a constant fraction of n by
Lemma 4.3. We have the following theorem.
Theorem 4.5. Suppose that guest linear array G has n0 processors and the host
linear array H has n processors and an average delay of dave. Algorithm Overlap
simulates G with H such that the load on H is one and the slowdown is O(dave log
3 n).
Proof. The load on H follows directly from the database assignment. The box
B0 contains all of the pebbles for the rst m0 steps of computations by G, where
m0 =
n
c log n . The root interval I0 contains all the remaining processors of H. Since
pebbles (0; t) and (n0 + 1; t) are available at time step 0 by assumption, Theorem 4.4
implies that I0, i.e., H, computes the pebbles in box B0 by time s
(0)
m0 . We derive s
(0)
m0
from the recurrence of s
(k)
t in (10) and (11) and the denition of Dk in (7).
s(0)m0 = 2
ks(k)mk + 2kD0 for k = kmax:(13)
Therefore, s
(0)
m0  nc log n + 2cdaven log2 n = O(daven log2 n). Since m0 = nc log n , the
slowdown is O(dave log
3 n).
4.2.4. Bandwidth. It is clear that the bandwidth required for the communica-
tion between depth-k intervals is at most the bandwidth of G. Therefore, congestion
is not an issue if the bandwidth on H is at least log n times the bandwidth on G. If,
however, the bandwidth on G and H are comparable, then we need to pay an extra
factor of logn in the slowdown.
4.2.5. Improvements. In this section we rst modify Overlap to achieve work
eciency. So far each host processor is assigned at most one database, and the base
of the recurrence is therefore s
(k)
mk = 1 for k = kmax as dened in (12). Observe that in
(13) the second term of s
(0)
m0 dominates the rst term. We can balance the two terms
by increasing the value of s
(k)
mk for the base case, i.e., increasing the load on the host
processors.
642 M. ANDREWS, T. LEIGHTON, P. T. METAXAS, AND L. ZHANG
In particular, we use an m-processor subarray of the host linear array H to sim-
ulate an n-processor guest linear array, where m = maxf1; ndave log
¡3 ng and the
subarray has average delay at most dave. If m = 1, the slowdown and the load are
both n. Otherwise, we carry out the two-stage process to remove the useless proces-
sors of the m-processor subarray as as described in section 4.2.1. The only dierence
is that the network size is m instead of n, and the variables such as Dk, mk; and kmax






















from (13). Sincem = ndave log








m0=m0 is O(dave log
3 n). This implies that the simulation is work
preserving.
Theorem 4.6. In the database model, an n-processor guest linear array can be






, where the host has average delay dave.
Combining Theorems 4.1 and 4.6 we can improve the slowdown by a factor of
O(
p
dave) while preserving eciency. Suppose that G is an n-processor guest linear
array, and H is an n-processor host linear array with average delay dave. We make use
of an intermediate linear array H0 that has a delay of dave on every edge. Theorem 4.1





; 1g processors of H0 are used. In the simulation by H0, every O(dave)
steps of computation interleave with every O(dave) steps of communication. If we
treat every O(dave) steps as one time unit, then H0 acts like a guest linear array with
unit-delay edges and H has a normalized average delay of O(1). Theorem 4.6 implies
that H can simulate H0 with a slowdown of O(log








Theorem 4.6 is improved to the following.
Theorem 4.7. In the database model, an n-processor guest linear array can be




3 n), where the host has average delay dave.
4.3. Simulating linear arrays on general networks. We generalize algo-
rithm Overlap to simulate a guest linear array on an arbitrary bounded-degree
connected host network. Given a connected bounded-degree n-processor network H
with average delay dave, we rst nd a linear array H that can be embedded one to
one to H and has average delay dave. As discussed in section 2.4 such H can be found,
and H is used to carry out the simulation. Combined with Theorem 4.7, we obtain
Theorem 4.8.
Theorem 4.8. An n-processor guest linear array can be eciently simulated by a




the host has average delay dave.
For the same reason as in section 2.4, Theorem 4.8 does not hold when H has
unbounded degree.
4.4. Simulating two-dimensional arrays on general networks. Our tech-
niques can also be generalized to simulate a two-dimensional array on any connected
bounded-degree network.
AUTOMATIC METHODS FOR HIDING LATENCY 643
Theorem 4.9. In the database model, an nn guest can be eciently simulated




where the host has average delay dave.
Proof. As discussed in section 2.4 there exists a linear array H such that H is
embedded one to one in H and that H has average delay O(dave). The simulation of
G on H will be performed by simulating G on H. We rst show how to simulate G
on an intermediate linear array H0, where H0 has delay dave on all the edges. The
size of H0 depends on the relative sizes of dave and n.
Case 1. If dave < n, then H0 has n processors, each of which simulates one column
of processors of G. To simulate one step of G, a processor of H0 computes n pebbles
and then communicates with both of its neighbors. The communication takes at most
n+ dave steps, which is O(n) steps. Hence the slowdown of H0 simulating G is O(n).
Also, in this simulation every O(n) steps of computation interleave with every O(n)
steps of communication.
Since dave < n, if every O(n) step is treated as one time unit, then H has a
normalized average delay O(1) and H0 acts like a guest linear array with unit-delay
edges. Therefore, Theorem 4.7 implies that H can eciently simulate H0 with a
slowdown of O(log3 n). The combined slowdown is therefore O(n log3 n).
Case 2. If dave  n, then H0 has n=x processors, where x =
p
dave=n. Each
processor of H0 simulates 3x columns of G, overlapping x columns with each neigh-
bor. (The redundant computation used here is similar to that in Theorem 4.1.) To
simulate x steps of G, each processor of H0 computes at most 3x2n pebbles and
then communicates with both of its neighbors. The communication takes at most
3x2n+ dave steps, which is O(dave) steps. Hence the slowdown of simulating every x
steps is dave=x, which is O(
p
ndave). Also, in this simulation every O(dave) steps of
computation interleave with every O(dave) steps of communication.
If every O(dave) step is treated as one time unit, H has normalized average delay
O(1) and H0 acts like a linear array with unit-delay edges. If n=x processors of H are
used to simulateH0, Theorem 4.7 implies a slowdown of O(log3 nx ), which is O(log3 n).




The above technique can be applied to the data°ow model, where H0 simulates
G in the same manner and H simulates H0 with a slowdown of O(1) in both cases.
Theorem 4.10. In the data°ow model, an nn guest can be eciently simulated
by a bounded-degree host network with a slowdown of O(n+
p
ndave), where the host
has average delay dave.
4.5. Lower bounds. In this section we discuss the impact on the slowdown of
the simulation when the number of copies of each database is bounded and the load
is a constant. We consider the case in which each database can have one copy and the
case in which each database can have at most two copies. Notice that although we
are restricting the number of copies of each database to either one or two, a particular
processor in the host can have a copy of many databases.
For the case in which each database is allowed one copy we give an example to
show that the slowdown can be dmax. Let G and H1 be n-processor guest and host
linear arrays. Every
p
nth edge of H1 has a delay of
p
n; and all other edges have unit
delay. Therefore, H1 has an average delay of O(1). If at most
p
n processors of H1
have copies of databases, then by a work argument the slowdown when H1 simulates
G is at least
p
n. Otherwise, there exist databases bi and bi+1 such that they are
assigned to processors p and q of H1; respectively, and that the delay between p and
q is at least
p
n. Hence, for all time steps t, processor p cannot compute pebble (i; t)
644 M. ANDREWS, T. LEIGHTON, P. T. METAXAS, AND L. ZHANG






n steps after q computes (i + 1; t ¡ 1), and q cannot compute (i + 1; t) untilp
n steps after p computes (i; t¡ 1). This implies a slowdown of dmax =
p
n, whereas
dave is a constant. Note that the above argument makes no assumption on the load.
Theorem 4.11. If each database can have at most one copy, then there exists a
host with dave = O(1) such that the slowdown is ›(
p
n).
For the case in which each database is allowed at most two copies we construct a
host network H2 whose average delay is O(1), but for which the simulation slowdown
is ›(log n). Network H2 is made up of (n) processors and the edge delays are either
1 or d. The following is a recursive construction of H2 in which we dene a series of
boxes. (See Figure 12.) We regard H2 as a level-k box, where k = log
n
d . Network
H2 consists of two level k ¡ 1 boxes that are connected by 2kdlog n edges of delay 1. In
general, a level-‘ box, for 1  ‘  k, consists of two level ‘¡1 boxes that are connected
by 2
‘d
log n edges of delay 1. We say that these
2‘d
log n processors are in a segment. A level-0
box consists of a single edge of delay d.
Let d = logn. Since a level-‘ box contains 2‘ edges of delay d and 2
‘d‘
log n edges
of delay 1, H2 has (n) processors and constant average delay dave. Furthermore,
Lemma 4.12 holds.
Lemma 4.12. If processors p and q are in two dierent segments I and J , then the







, where u and v are the numbers
of processors in segments I and J; respectively. In particular, the delay between p and
q is at least d = logn.
Theorem 4.13. If each database is allowed at most two copies and the load is
a constant c, then there exists a host with dave = O(1) such that the slowdown is
›(logn).
Proof. We consider the following two cases when H2 simulates G.
Case 1. There exists some \overlap" in the database assignment. In particu-
lar, suppose databases bi, bi+1; : : :, bi+j are assigned to processors in segment I and
bi+1; : : :, bi+j , bi+j+1 are assigned to segment J 6= I for some j  1. Suppose also
that the other copy of bi+j+1 is assigned to J
0 6= I and the other copy of bi is assigned
to I 0 6= J . Notice that pebbles of the form (i+ k; t), for 1  k  j, can only be com-
puted by processors in segment I or J . Since the load is c, the number of processors
in segment I is at least j=c. The same is true for segment J . We shall nd a path
of 4j pebbles such that either a delay of O(j logn) occurs, or a delay of log n occurs
O(j) times during the simulation. For simplicity we assume that j is even. The case
in which j is odd is similar.
We use a triple (i; t; p) to say that processor p computes pebble (i; t), and we use
expressions of the form (i; t; p) ˆ (i ¡ 1; t ¡ 1; q) to indicate dependency. That is,
processor p receives pebble (i ¡ 1; t ¡ 1) from processor q before p computes (i; t).
AUTOMATIC METHODS FOR HIDING LATENCY 645
Fig. 13. A path of 4j pebbles, where j is even.
(Note that p may be the same as q.) Consider the computation of the following path
of 4j pebbles, 1 ˆ    ˆ 4j , where k is a triple of the form
k =
8>>>>>><>>>>>>:
(i+ k; t¡ k; pk) for k 2 A; where A = fk : 1  k  jg;
(i+ j + 1; t¡ k; pk) for k 2 B; where B = fk odd : j < k  2jg;
(i+ j; t¡ k; pk) for k 2 C; where C = fk even : j < k  2jg;
(i¡ k + 3j; t¡ k; pk) for k 2 D; where D = fk : 2j < k  3jg;
(i+ 1; t¡ k; pk) for k 2 E; where E = fk even : 3j < k  4jg;
(i; t¡ k; pk) for k 2 F; where F = fk odd : 3j < k  4jg:
This path goes backward in time and zigzags during time steps k for k 2 B[C[E[F .
(See Figure 13.)
By assumption, processors pk, for k 2 C [ E, can only belong to segment I
or J . If processors pk, for k 2 C [ E, do not belong to the same segment, then
Lemma 4.12 implies a delay of j2c logn for the communication between segments I
and J . Hence, it takes more than j2c logn steps to compute this path of 4j pebbles.
Otherwise, processors pk, for k 2 C [E, all belong to segment I. Lemma 4.12 implies
a delay of log n in computing every k for j < k  2j. This is because processors
pk, for k 2 B, cannot be in segment I by assumption. Similarly, if processors pk, for
k 2 C [E, all belong to segment J , then there is a delay of logn in computing every
k for 3j < k  4j. Hence, it takes more than j logn steps to compute this path of
4j pebbles.
We can repeat this argument for every 4j steps. Hence the slowdown is ›(log n).
646 M. ANDREWS, T. LEIGHTON, P. T. METAXAS, AND L. ZHANG
Case 2. There exists no \overlapping" of the databases as in Case 1. Let bi; : : : ; bj ,
for j  i, be the longest sequence of consecutive databases assigned to one segment.
Call this segment I and the sequence of databases SI . Notice that processors in I do
not have a copy of bi¡1. Let J be a segment that is assigned a copy of bi¡1. Let SJ be
the sequence of consecutive databases such that bi¡1 is a member of SJ and that each
member of SJ has a copy in J . If bi were a member of SJ , then either the database
sequences SJ and SI would produce the \overlapping" pattern sucient for Case 1 or
SJ would be longer than SI . This latter case contradicts the denition of SI . Hence,
any segment that has a copy of bi¡1 cannot have a copy of bi. This implies that the
processors computing the pebbles in the (i ¡ 1)st and ith column are at least log n
delay apart by Lemma 4.12. Therefore, the slowdown is ›(log n).
5. Conclusions. In this paper we presented methods for latency hiding in sim-
ple networks such as linear arrays and two-dimensional arrays. Ultimately, we are
interested in the ecient implementation of algorithms designed for networks that
appear often in the architectures of parallel computers, such as trees, arrays, butter-
°ies, and hypercubes, on a network with arbitrary topology and arbitrary link delays,
such as NOWs. The special case in which two networks have identical topology but
dierent link delays is a starting point where we can study the eect of latencies in
isolation. Indeed, the general case of simulating a unit-delay guest on a host with
arbitrary delays and arbitrary topology so as to minimize slowdown seems to be a
very challenging problem.
REFERENCES
[1] The Connection Machine CM-5 Technical Summary, Thinking Machines Corporation, Cam-
bridge, MA, 1991.
[2] F. Afrati, C. H. Papadimitriou, and G. Papageorgiou, Scheduling DAGS to Minimize
Time and Communication, Aegean Workshop on Computing (AWOC), Corfu, Greece,
1988, pp. 134{138.
[3] Y. Aumann and M. Ben-Or, Computing with faulty arrays, in Proceedings of the 24th Annual
ACM Symposium on Theory of Computing, Victoria, British Columbia, Canada, 1992,
pp. 162{169.
[4] G. E. Blelloch, S. Chatterjee, J. C. Hardwick, J. Sipelstein, and M. Zagha, Implemen-
tation of a portable nested data-parallel language, in Fourth ACM SIGPLAN Symposium
on Principles and Practice of Parallel Programming, San Diego, CA, ACM Press, New
York, 1993, pp. 102{112.
[5] P. Chretienne, A polynomial algorithm to optimally schedule tasks on a virtual distributed
system under tree-like precedence constraints, European J. Oper. Res., 43 (1989), pp. 225{
230.
[6] R. Cole, B. Maggs, and R. Sitaraman, Multi-scale self-simulation: A technique for recon-
guring arrays with faults, in Proceedings of the 25th Annual ACM Symposium on Theory
of Computing, San Diego, CA, 1993, pp. 561{572.
[7] J. Y. Colin and P. Chretienne, C.P.M. scheduling with small communication delay and task
duplication, Oper. Res., 39 (1991), pp. 680{684.
[8] D. N. Jayasimha and M. C. Loui, The Communication Complexity of Parallel Algorithms,
Technical Report CSRD 629, University of Illinois at Urbana{Champaign, 1986.
[9] H. Jung, L. Kirousis, and P. Spirakis, Lower bounds and ecient algorithms for multipro-
cessor scheduling for dags with communications delays, Inform. and Comput., 105 (1993),
pp. 94{104.
[10] C. Kaklamanis, A. R. Karlin, F. T. Leighton, V. Milenkovic, P. Raghavan, S. Rao,
C. Thomborson, and A. Tsantilas, Asymptotically tight bounds for computing with faulty
arrays of processors, in Proceedings of the 31st Annual Symposium on Foundations of
Computer Science, St. Louis, MO, 1990, pp. 285{296.
[11] R. Koch, T. Leighton, B. Maggs, S. Rao, and A. Rosenberg, Work-preserving emulations
of xed-connection networks, J. ACM, 44 (1997), pp. 104{147.
AUTOMATIC METHODS FOR HIDING LATENCY 647
[12] F. T. Leighton, Introduction to Parallel Algorithms and Architectures: Arrays  Trees 
Hypercubes, Morgan-Kaufmann, San Mateo, CA, 1992.
[13] F. T. Leighton, B. Maggs, and R. Sitaraman, On the fault tolerance of some popular
bounded-degree networks, SIAM J. Comput., 27 (1998), pp. 1303{1333.
[14] C. E. Leiserson, Z. Abuhamdeh, D. Douglas, C. Feynman, M. Ganmukhi, J. Hill,
D. Hillis, B. Kuszmaul, M. S. Pierre, D. Wells, M. Wong, S. Yang, and R. Zak,
The network architecture of the connection machine CM-5, in Proceedings of the 4th An-
nual ACM Symposium on Parallel Algorithms and Architectures, San Diego, CA, 1992,
pp. 272{285.
[15] C. E. Leiserson, S. Rao, and S. Toledo, Ecient out-of-core algorithms for linear relaxation
using blocking covers, J. Comput. System Sci., 54 (1997), pp. 332{334.
[16] M. Palis, J.-C. Liou, S. Rajasekaran, S. Shende, and D. L. Wei, On-Line Scheduling of
Dynamic Trees, manuscript, 1994.
[17] M. Palis, J.-C. Liou, and D. L. Wei, Task Clustering and Scheduling for Distributed Memory
Parallel Architectures, Technical Report Fukushima 965-80, University of Aizu, Japan,
1994.
[18] C. H. Papadimitriou and J. D. Ullman, A communication-time tradeo, SIAM J. Comput.,
16 (1987), pp. 639{646.
[19] C. H. Papadimitriou and M. Yannakakis, Towards an architecture-independent analysis of
parallel algorithms, SIAM J. Comput., 19 (1990), pp. 322{328.
[20] M. O. Rabin, Ecient dispersal of information for security, load balancing and fault tolerance,
J. ACM, 36 (1989), pp. 335{348.
[21] B. J. Smith, Architecture and applications of the HEP multiprocessor computer system, in
Real-time Signal Processing IV, 298, SPIE, Bellingham, WA, 1981, pp. 241{248.
[22] L. W. Tucker and G. G. Robertson, Architecture and applications of the connection ma-
chine, Computer, 21 (1988), pp. 26{38.
[23] L. G. Valiant, Bulk-Synchronous Parallel Computers, Technical Report TR-08-89, Center for
Research in Computing Technology, Harvard University, Cambridge, MA, 1989.
[24] L. G. Valiant, A bridging model for parallel computation, Commun. ACM, 33 (1990), pp. 103{
111.

