C3: A Parallel Model for Coarse-grained Machines by Hambrusch, Susanne E. & Khokhar, Ashfaq A.
Purdue University 
Purdue e-Pubs 
Department of Computer Science Technical 
Reports Department of Computer Science 
1993 
C3: A Parallel Model for Coarse-grained Machines 
Susanne E. Hambrusch 
Purdue University, seh@cs.purdue.edu 
Ashfaq A. Khokhar 
Report Number: 
93-080 
Hambrusch, Susanne E. and Khokhar, Ashfaq A., "C3: A Parallel Model for Coarse-grained Machines" 
(1993). Department of Computer Science Technical Reports. Paper 1093. 
https://docs.lib.purdue.edu/cstech/1093 
This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries. 
Please contact epubs@purdue.edu for additional information. 







c3: A parallel model for coarse-grained machines *
Susanne E. Hambrusch
Depa.rtment of Computer Sciences
Purdue University
West Lafayette, IN 47907, USA
seh~c5.purdue.edu
Ashfaq A. Khokhar
School of Electrical Engineering and Department of Computer Sciences
Purdue University




In this paper, we propose a model for parallel computation, tile C3-modcl. The C3 _
model evaluates, for a given parallel algorithm and target architecture, the complexity of
computation, the pattern of communication, and the potential congestion arising during
communication. A metric for estimating the effect of link and processor congestion on the
performance of a communication operation is developed. This metric allows the evaluation
of arbiLrary communication operations without the user having to specify fine scheduling
details. We describe how the C3-model can serve a'i a platform for the development of coarse-
grained algorithms sensitive to the parameters of a parallel machine. The initial validation of
the C3-model is discussed for the Inlel Touchstone Delta. We compare predicted and actual
performance of different solutions for communication operations and of various divide-and-
conquer approaches for contour ranking on images.
Keywords: Parallel processing, coarse-grained machines, commnnicalion operations, com-
putation versus communication, divide-and-conquer.
"Research supported in part by ARPA under contract DABT63-92·e-00220NR. The views and conclusions
contained in LItis paper are those of LIt<! allthors and should not be interpreted as representing official policies,
expressed or implied, of the U.S. govcrnment. A preliminary version of this paper appeared in thc 6-tb iEEE
Symposium on Parallcl and Distribtllcd Processing, October 199'.1.
1
1 Introduction
The development of a parallel model that bridges software and hardware has been recognized as-
crucial to the success of lTIa.'isively parallel computation. Such a model should be simple, should
accurately reflect the constraints of a parallel machine, and should have broad applicabHiLy with
respect to existing machines. In addition, such a model should provide i:L platform for algorith m
development and allow accurate prediction of the preformance of an algorithm. Recently, a
Dumber of models with this goal have been proposed [3, 6, 10, 13, 17, 23, 24, 25J. In most
of these models, Including the BSP model [24J, the postal model [3], and the LogP model [6],
processors are assumed to communicate using a point-ta-point message router. Composing morc
involved communication operations by using the message router places a significant burden on
application programmers. Furthermore, the above models do not attempt to captu re the effect
of link or processor congestion on communicatlon.
In this paper, we propose a parallel computation model, the CJ-model, for developing and
analyzing algorithms on coarse-grained machines. This model captures the complexity of com-
putation, the pattern of communication, and the potential congestion arislng during commu-
n.ic,ttion. We propose ;:t metric for estimating the effect of link and processor congestion on
lhe performance of communication operalions. Parameters of our metric include lhe number of
processorS, lhe number of processor pairs commu nicating, lhe latency and the bisection width of
the communication network, lhe message set-up cosl, and the packet length. Our metric allows
lhe evaluatlon of arbitrary communicatlon operations, and it can be applied withoul having to
specify fine scheduling details. We lnvestigate how well the C~-model serves as a platform for
lhe developmenl of coarse-grained algorithms and as a tool for estimating the performance of
an algorHhm. We report our lnitial validation results of the C3-model on the Inlel Touchstone
Della. vVe compare pred.lcted and actual performance for common communication operations,
lncluding one-to-all, all-to-one, and all-to-all rouling, and for contour ranking algorithms based
on different d.lvide-and-conquer solutlons.
In our model, we assume that computatlon is synchronized by a barrler-slyle synchronization
mechanism similar to the one described in [24]. More precisely, an algorithm can be partitioned
into a sequence of supersteps, with each superstep corresponding to local computation followed
2
by sending and reCeIVIng messages. Synchronization occurs between supcrsteps. We express
the performance of a superstep, and thus of an algorithm, in terms of computation units and
communication units. Counting in units allows us to penalize certain undesirable aspects in
local computation and in communication. The number of computation units charged depends
on the amount oflocal computation done. The number of communication units charged depends
on the amollnt of data sent by a processor, the amount of data received by a processor, the
latency encountered by the messages, and the congestion arising due to the volume of inter-
processor communication. Our method for evaluating communication units estimates the effect
of these factors on the performance of a communication operation. The routing schemas and
routing protocols available on a machine also influence the performance and this is reflected in
the total number of communication units charged.
Section 2 describes the C3-model and the metric devised to determine communication and
computation units. In Section 3, we use the model and the metric to determine the commu-
nication units for common communication operations when each operations is implemented
processors by issuing direct sends and receives. In Section 4, we consider the same communi-
c<ttion operations and evaluate and analyze dlIIerent implementations for each opemtion. We
describe how machine parameters and message sizes influence the performance. In Section 5,
we use divide-and-conquer based algorithms for an image-processing problem to validate the
C3-model.
2 The C3-Model
In this section we describe the metric used by the C3-model to compute the communiC<Ltion
and computation units of a superstep. The parameters of the machine entering the metric are
the following:
• p, the number of processors
• fl, the latency of the communication network
• b, the bisection width of the communication network
• s, the set-up cost for a message
• I, the length of a packet.
3
;:trchitecture latency h bisection width b
linear array p(3 1
binary tree 0(logp) 1
square 2-D mesh Uji yP






fat tree 4log4 P 0(p)
complete graph 1 p'(4
Figure 1: Latency and bisection width of machines
'Are define the latency as the average distance between two processors. The average distance
is CLO:Si,i:Sp-l d j ,i)/p2, where di,i is the minimum distance between two processors. A message
is made up of fixed-length packets and a packet is the logical unit for communication between
two processors. The quantity l denotes the number of bytes in a packet. The bisection width
is dcfined as the minimum number of links that have to be removed in order to disconnect
the machine into two halves with identical numbers of processors. Figure 1 gives latency and
bisection width for various p-processor architectures.
vVe assume that algorithms on coarse·grained machines arc not constrained by the amount
of local memory. In current coa..rse-grained machines, the computing power of a processor
is equivalent to that of a state-of-the-art workstation. Hence, for a reasonable problem size,
memory is not likely to dictate or heavily influence algorithm design. When describing our
metric, we assume that both the processor bandwidth and the network bandwidth are equal to
t. How to handle and account for different bandwidth values is described later.
A common feature oIparallel algorithms and algorithm design approaches (e.g., divide-and-
conquer) is that, at some point or other, the P processors are logically pa..rtitioned into q sets
Sl,' . .Sq, with Si containing Pi processors. Communication occurs only between processors in
the same set. A programmer familiar with the architecture and the algorithm can often perform
a mapping such that communication within processor set Sj docs not compete for resources
with communication done in the other processor sets. This is possible, for example, when every
4
processor set Si corresponds to a scaled down version of size Pi of the p-processor machine. An
algorithm then operates on independent submachines. The importance of being able to operate
on independent submachines has been recognized. It has been incorporated into the Message
Passing Interface (MPI) [8] and has been extended to arbitra.ry process groups [1]. When it is
known that a superstep operates on independent submachines, we charge communication units
based on the parameters of the associated submachines.
2.1 Computation Units
The charging of computation units In a superstep is done as follows. Assume that in one
superstep processor Pi accesses t; bytes. At this point we do not distinguish between access
to the processor's registers and access to its local memory_ However, such distinctions can be
incorporated. The superstep is cll<Lrged ma.'O,5i,5p-l r!t-l computation units. The reason for
normalizing computation units by I is that too little computation between two communication
stellS should have a negative impact on the performance. If l; < I, we charge one computation
unit and thus also penalize for not accessing enough bytes to fill a packet.
2.2 Communication Units
The communication units charged to one superstep reflect the time spent in sending messages,
the time spent in receiving messages, the time messages arc enroute under ideal conditions,
the amount of congestion that could occur, and an estimate on the resulting delay. In order
to demonstrate broad applicability of our model, we describe the evaluation of communication
units for different routing schemas and dilferent send and receive primitives. The two routing
schemas we consider arc 8tore~and-Jorwa.,.d and wormhole Touting. Both arc common <Lnd they
arc conceptually quite different. We refer to [14] for details. Most e.xisting machines support
both blocking and nonblocking protocols for send and receive primitives. These protocols differ
in implementation based on the synchronization methods used. For the sa.ke of completeness,
we describe these protocols. A blocking send is a send operation initiated by a source processor
which does not terminate until the message is received by the destination processor. During
this time the source processor cannot perform other computations or communications. In a
nonblocking send the source processor, after filling its send buffer, has to wait only until the
5
nonblocking sends and receives blocking sends, non blocking receives
send time S"j receive time Ti,j send time Si,j receive time Ti,j
store-and- s+r¥l*h r'T'-l 2(8 +h) + r'T'-l * h L 1s+h+rT *11.
forward
wormhole L r'T'-l 2(s+h)+ r'T'-l +h s +2h +r'T'-1s+f'Tl+ h
Figure 2: Send and receive times when Pi sends Li,j bytes to P j
message has been read out of the send buffer. Nonblocking sends thus allow overlapping of
communication and computation and pipelining of multiple send operations. Analogously,
receive operations issued by the processors can also be blocking or nonblocklng. For additional
details on routing protocols we refer to [7, 14, 17].
Sending a single message from Pi to Pj
We start the description of how communication units are determlned by giving a cost estimation
for sending a single message between two processors. Assume processor Pi sends a message
consisting of Li,j bytes (i.e., r~l packets) to processor Pj. We charge processor Pi <L send
time '~i,j and processor Pj a receive time 7'i,j. Send time Si,j is an estimate on the time needed
to send the message when it encounters no congestion. Receive time ri,j represents the time
processor Pj is occupied with receiving the mess<Lge. Send and receive times [or different routing
protocols <LIld routing methods are stated in Figure 2.
The send time includes the time elapsing between issuing the send until processor Pi can
resnme computation and communication. In addition, it includes the time taken by the message
to reach destination Pj. In the case of nonblocking sends, processor Pi could be doing another
task at this time. However, a message in transit takes resources away from the machine and
the C3 _model charges this to processor Pi. For nonblocking sends and receives with store-and-
forward we thus have Si,j = S + r~l * hand Ti,j = r!:p.l· For the case of blocking sends,
processor Pi is charged s + It to initiate communication with processor Pj. After that, both
G
processors arc engaged in the sending of the message. Doth send and receive time accumulate
another s + h when Pj sends a confirmation back to Pi- Processors Pi and Pj are c.harged the
number of units corresponding to the time it takes for the message to reach Pj , resulthlg in the
quantities shown in Figure 2.
Sending multiple messages from Pi
For every processor sending and receiving multiple messages in a supcrstep, we determine total
sentI ,md receive times. The total send (rcsp. total receive) time measures the time a processor
is engaged in sending (resp. receiving) messages when messages are not delayed by congestion_
We assume that a processor cannot send and receive simultaneollsly.
Assume that in a superstep processor Pi sends Li,j bytes to processor Pj, 1.i,j 2': 0, 0 ::;
i,j .::; p - 1. Let n.~(i) denote the number of processors to which Pi sends a message; i.e.,
ns(i) = IUILi,j > all. Let Si,j be as defined above (l.e., it is the cost of sending the message
from Pi to Pj without congestion). The total send time, Si, experienced by processor Pi is
an upper bound on the cost for processor Pi to send all n.s(i) messages in a congestion-free
environment. Let rj,i be the receive time, a.'i defined above, and let Ri be the total receive tlme
experienced by processor Pi. Further, let nr(i) denote the number of processors from which
Pi receives ;:t message; i.e., nr(i) =:: IUILj ,' > all. Clearly, L05:i5:p-1 na(i) = L095:p-1 nr(i).
Total send and total receive times depend on the routing schema and the routing protocol used.
Flgure 3 gives the total send and receive times experienced under d·lfferent routing protocols.
Consider the case of store-and-forward routing with non blocking sends and nonblocking
receives. Let Pj be the first processor to whom Pi issues a send. After s + r¥l steps,
processor Pi is no longer engaged in the send process for Pj and can proceed with the next
send. TII..is allows pipelining the n ..(i) sends. Let Li,imo:r = maXO<j<p_l Li,j. The total send
tlme experienced by processor Pi thus contains the n.s(i) start-up costs, the sum of all the
packets sent, and 11.* rL;"i"°:rl. The final quantity accounts for the latency encountered by the
last message to reach its destination. The total receive time is the sum of all the individual
receive times. For wormhole routing with nonblocking sends and non blocking receives, we again
pipeline the ns(i) sends. The latency of the last message shows up as an additive quantity of h.
7
Protocol- S; R;
SF, nbs, nbI s*ns(i)+h*r~l+L~<"< _If¥l I: r"'"1_J _p O<j'<p-I I
,I\'TT, nbs, IIbr s*ns(i)+h+Lo<"< _If¥l LO~j.:c:;p-Ir¥l
_J_P
SF, bs, nbI 2(s+h)*ns(i)+h*Loc< -tf¥l (s+h)*Tlr(i)+h*Lo<"< _lff.p-l
_J_P _J_P
WIT, !ls, nbr 2(s+h)*n.(i)+h+Lo<'< _lr¥l (s+h)*nr(i)+h+Loc< _tf¥l
_J_P
_J_P
Figure 3: Total send and receive times for processor Pi under different routing protocols,
·SF = Store-aJId-Forward, WII = wormhole routing, nbs = non blocking sends, nbr = nonblock-
iug receives, bs blocking sends, br = blocking receives
The quantity Hi + Ri represents a bound on the time processor Pi spends in one superstep
on sending and receiving messages. Charging one superstep maxO<i<p_1 {Si + R;} communi-
cation units reflects the overall send and receive time experienced by the machine during the
communication operation, not including the delay the messages encounter because of link and
processor congestion, '-IVe point out that when stating communication units we h;:we not scaled
the set-up cost 5, but simply included the total number of set-up costs experienced. When giv-
ing communication units for operations on specific machines, as done in Section 4, we convert
set-up costs to communication units,
Measuring congestion
"Ve next describe the metric used to estimate the potential congestion arising at the processors
or communication links. Congestion plays a crucial role in the tlme required to complete
alt routings. At the same time, congestion is difficult to evaluate. Congestion is a global
phenomena and where it occurs depends on specifics of the architecture and the routlng paths
taken. A formal model to deal with congestion in a shared memory machine has recently been
proposed in [9]. Congestion depends on the amount of data sent between processor pairs and is
8
independent of whether we use store-and-forward or wormhole routing_ During a routing step,
store-and-forward stores J( packets in a single processor, while wormhole stores 1 packet (or
part of a packet) at J( (or more) processors. In our estimation of congestion, we measure G/,
the congestion over links, and Gp , the congestion at the processors. We mea.<;ure processor and
link congestion under the assumption that all messages arc routed simultaneously. Clearly, this
may not be done under a given protocol. However, delaying the sending of a message by using
blocking sends is, in some sense, a possible way of dealing with the congestion. In both cases,
the messages experience a del;:Ly.
Our metric uses two quantities related to the communication being performed in a superstep.
Let cong be the total number of processor pairs communicating and let La be the average
number of packets routed between processors. Congestion over links is closely related to the
bisection width of the machine. In a machine with a bisection width of b, it takes at least rJ~l
steps to sentl Ii.." packets from processors in one half of the machine to the processors in the
other half. We set
eongC, = [" * f-b-l
Our estimation of the link congestion G/ is both optimistic and pessimistic. It is optimistic. in
measuring congestion only over a single link cut (namely, the cut that separates the machine
into halves). Clearly, link congestion occuring within each half can have an impact on the
overall link congestion. It is pessimistic in a.<;suming th;:Lt all eong communicating processor
palrs have the source processor on one half and the destination processor IS the other half.
In order to estimate the congestion at the processors, assume that aU cong processor pairs
arc routed simult;:Lneously. Processor congestion is then estimated as
eong
Cp = La * r--l *h.p
The quantity rCO;gl represents the average number of messages at a processor at the beginning
of the communication operation. We use La in estimating the slow-down a message experiences.
We argue that a message of size La traversing a distance of h links and thus competing for the
resources with other messages at each of the h -1 intermedi;:Lte processors is slowed down by a
factor of rco;91 at each processor. We do not take into account that congestion at the processors
9
is likely to decrea.<;e during the routing. Capturing thls behavior in <:L simple way is difficult and
in many realistic routings (e.g., a transpose and bit reversal) the decrease in the congestion is
slow.
In summary, the total number of communication units charged in a superstep is
In order to estimate actual execution time of an algorithm, relative weights need to be attached
to computation and communication units. These weights should be based on the ratio between
the processor clock speed and the network clock speed as well as the ratio of the bandwidth of
the network and the bandwidth of the processors [18]. In the hlgh-level approach taken by our
model, clock speeds and bandwidth parameters do not influence the design of an algorithm and
they are thus not included. Put in a different way, we give units for the case when the network
dock speed is equal to processor clock speed and network bandwidth is equal to processor
bandwidth. When evaluating an algorithm the ratio of computation units and communica-
tion units over all supersteps gives information as to whether an algorithm is computation or
communication intensive.
3 Charging Examples
Our metric allows evaluation of arbitrary communication patterns. While arbitrary patterns
occur in applications, regular patterns are more common on coarse-grained machines. In th.is
section we give the number of communication units charged for regular patterns when each
communication operation is implemented using the naive approach of each source processor
sending messages directly to the destination processors. The communication operations we
consider include one-to-one, one-to-all, all-to-one, and all-to-all routing. The communication
units are given for wormhole routing with nonblocking sends and nonblocking receives. To
simplify the presentation, we assume that every message is of length L.
One-to-one Routing
In one-to-one routing, also known as permutation fouting, every processor sends L bytes to
a unique destination (i.e., unique among all p processors). Our charging method does not
10
S, Ri C, C
one-to-one s+r-t-l+ h r1'l f1'l*ffl r-t-l *11
one-t.o-all (p-1)*(s+rf-l)+h,i=l rfl,'#' r1'H'ifl r-t-l *11
all-lo-one s+ril+ h,i#l rfl*(p-l),i=1 rfH'ifl r-t-l *11
all-to-all (p-I)*(Hrfl)+h r1'l *(p-I) r-t-HP(p; 1)1 r-t-l */H(p-l)
Figure:1: Communication units charged for wormhole routing with nonblocking sends and
nOllblocking receives
distinguish between one-la-one routings that are easy or difficult with respect to the arising
congestion. Clearly, for any particu!<tf architecture, such ctifferences do exist. In one-to-one
routing we have nsU) = 1, nr(i) = 1, 0 ~ i :s: p - 1, and cong = p. Figure 4 gives total send
and total receive times, link and processor congestion for one-to-one and other commun.iC<Ltion
oper;:Ltions.
For one-to-one routing, link and processor congestion dominatc the communication units.
'Whether one can expect more congestion over the links or at the processors, depends on the
bisection width of the machine. Assume that one-to-one routing is done on a p-proc:essor square
mesh with b = ..;p and h = ~JP. Then, processor and link congestion appear almost balanced
and we charge
communication units. On a p-processor hypercube we have b = p/2 and h
processor congestion dominates. In total, we charge
log]) rLl ( logp)s+--+ - * 4+--2 I 2
!..2KI2 and the,







In one-to-all rouLing, a source processor Pt , sends p - 1 distinct messages, each to a different
destination. One-to-all is also refered to as scatLer or personali:r;ed broctdcast [8, 15]. We have
ns(t) =]J - 1, nAi) = 1 for i f: t, and cong = p - 1. The total send time experienced by the
source processor PI dominates the number of communication units.
All-to-one Routing
All.ta-one routing, also known as the gather operation, is the inverse of one-to-all: every pro-
cessor now sends a message to a common processor, say processor Pt. We ll<tve n.li) = I for
i f: t, 0 ~ i ~ ]J - 1, nr(t) = p - 1, and cang = p - 1. The total receive time at processor Pt
dominates the number of communication units.
All-to-all Routing
In all-ta.alL routing, also known as total exchange, every processor sends a message to every
other processor. We have ns(i) = p-l, nr(i) = p-1, 0 ~ i ::; p-l, and cong = p(p-l). From
the number of communication units charged shown in Figure 4 it follows that the bisection
width of the underlying architecture greatly influences the performance.
4 Validation through Communication Operations
In the previous section we gave the communication units for communication operations when
each operation is implemented through source processors issuing direct sends. Such implemen-
tations are likely to be used by programmers not familiar with parallel processing. Nor surpris-
ingly, they do not always result in good performancc_ Tn this section we use the C3-model as ;:t
platform to develop and analyze different implementations of communication operations. For
each implementation we determine computation and communication units and compare total
units to the actual performance of the algorithms on the Intel Touchstone Delta. We show that
the C3_model and its metric give an accurate prediction of the relative performance between dif-
ferent implcmentations of the same operation. Our results also indicate that the performance of
12
an implcmcntation is influenced by the relationship among parameters of the parallel machine,
as wen as by the relatlonship of the parameters to the amount of data involved. Thls ;:tgrees
with other rese;:trch done on the implementation of communication operations [1, 2, 4, 19J.
The Intel Touchstone Delta is a coarse-grained multi-processor system with 512 nodes or-
gani7.ed a.<; a 16 X 32 2-dimensional mesh. Each node is directly connected to its Jj nearest
neighbors. The communication network uses wormhole routing. Packet si7.e is 512 bytes, with
'182 bytes reserved for data and 30 bytes for the message header. The operating system supports
both blocking and non-blocking communication primitives. We give communication units a.nd
performance for wormhole routing with non blocking sends and nonblocking receives.
In order to classify diITerent approaches used in our implement;:ttions, we introduce the
notion of a k-Ievel algorithm. Intuitively, in a k-Ievel algorithm, the machine is partitioned into
I.: levels of Sll bmachines, with the submachines within each level operating independently from
each other. An algorithm is a i-level algorithm if, in the descrlption given in terms ofsuperstcps,
no Sllperstep operates on different submachines. In a k-level algorithm, k > 1, ;:tt least one
superstep assumes a partition into submachines, not necessarily of identical size, and subsequent
supersteps specify a (k - I)-level algorithm for each submachine. In our implementations,
processors belonging to the same submachine form a scaled down version of the bigger machine.
For a mesh, a scaled down version will be either a smaller mesh with the same aspect ratio or a
linear array. TillS is a stronger requirement than the use of process groups as proposed by the
MPI !vlessage Passing Standard [8J. When determining communication units, we assume that
communication within a. submachine occurs without interference from other submachines.
"Vhen describing our algorithms, we assume that the size of the message routed between
any two processors is L. The objective of Ollr algorithms is to have the processors send out
their packets as fast as possible and to minimize the time between processors sending out their
last packet and receiving the last packet destined for them. In many situations thls time Is
minimized by combining original messages of slze L Into larger messages and by performing
independent rOlltings in submachines. We refer to L as the actual message size. This is in
contrast to the effective message size, which is the size of the message routed between two
processors in <t particular superstep. For all algorithms, the effective message size Is never
13
smaller than the actual message size.
4.1 One-to-all Routing
In tIlls section, we use the k-level concept to develop a number of differenl implementations
for one-to-all routing. We evalu;:tte each implementation using the metric of the C3-model
and compare the predicted performance with lhe performance of the algorithms on the Intel
Touchstone Delta.
Description of Algorithms
There ex..ist two conceptually quite <lifferent I-level algorithms for one-to-all routing. In the
Iirst one, Algorithm l-lev-dir, source processor PI issues p - 1 direct sends (and every other
processor issues a receive). Using Figure 4, i-lev·riir is charged
communication units. We point out that throughout this section, we make a number of simpli-
Iications whcn giving communication units. We write p when the corrcct quantity is p - 1 and
we may omit additive terms of h. Another I-level approach is to have processor PI form Olle
long message of size L(p - 1) which is broadcast lo every processor. After receiving this mes-
sage, every processor extracts the message destined for it. Our hroadcasting implementation,
Algorithm l-lcv-br, uses a binomial heap as a broadcasting tree. One expects lhe broadcasting
approach to be efficient only when L is small and/or when the parallel machine has a control
nelwork supporting fast broadcasts. Figure 5 gives an oUlline of the different algorithms for
one-to-all operation.
We next describe ;:t generic 2-level approach. Logically partition thc p-processor machine
inlo pO' submachines, each containing pI-(\' processors for lo~p :::; a: < 1. Designate one proccssor
in each sllbmachine as a leader. Source processor PI then forms pO' long messages, each having
an elfective message size of Lpl-(\'_ The i-th long message formcd consists of the pI-(\' actual
messages destined for thc processors in the i-th submachine, a :S i < pO' -1. Nexl, processor Pt
issues pO' sends (or pO' -1 sends if PI is a leader) to route lhe long messages to the leaders. Once
14
Algorithm 1·lcv·dir(p)
The source processor issues p-l sends, one to each
dislincl destination.
Algonlhm I-lev-br(p)
I. Thc source processor concatenates the p-l mcs-
sages into one long message which is broadcast.
Algorithm l-Iev-ollr-bruses a broadCll5t ba~ed on
the binomial heap panem.
2. Each proce.'\.~or extracls ils message from the long
message received.
Algorithm 2-lev-ret:(p)
t. The source processor prep~res (pll2_ l) long mes-
s~ges, e~ch cont~ining pll2 messages, am.l sends one
long message to e~ch processor in ils column.
2. Aprocessor that received a long message, applies
Algorithm J-1ev.dir(pll2) within ils row.
Algorithm J.lcv·sq(p)
I. The machine is partitioned into pll2 square subma-
chines.
2. The source processorpreparcs pll2_[ long mes-
sages, each containing pll2 messages and sends oae
long message to eaeh leader processor in the sub-
machine.
J. Each submachine applies Algorithm 2-lev-rec(pll2).
Algorithm !ogp.lev·sq(p)
I. The machine is partitioned into 2submachines,
alternating partitions along the columns and rows.
2. The source processor concatenates pn messages
into one long message and sem.ls the long message
to the leader processor in me other submachine.
3. Each submachine applies Algorithm
!ogp·lev-sq(pn}.
Algorithm logp-lev-ret:(p,y)
I. The machine is partitioned into 2submachines, one
containing lP processors induding me source pro-
ccs.sor, and lhe other containing (l-rlp proce.'\.~ors.
2. The source proce.'\.~or concatenates (l.y)p messages
into one long message and semis it to the leader pro-
cessor in the other submachine.
3. The submachine with lP processor applies Algo-
rithm logp.!el'.rec(lP, r), and the submachine with
(I-r)p processors applies Algorithm logp-/ev-
rec«(J-r)p, r)·
Figure 5: Outline of one-to-all algorithms.
a leader ha.s received its long message, it divides the message into pl-c. of size L a.nd initiates
a I-level one-to-all algorithm within its submachine.
On the Intel Delta we have implemented a 2-1evel algorithm with a = 1/2 in which each
sllbmachine is a row containing ..;p processors. We refer to it as Algorithm 2-lev-rcc. The
leaders are the pmcessors in the column containing processor Pt. We use Algorithm l-lev·dir
as the I-level algorithm within each row. In Algorithm 2-lev.re.c, the first superstep operates on
a single column of the meslL The second superstep uses Algorithm 1-1cv-dir within each row.
The number of communication units charged in both supersteps is
15
where b' and hi are the bisection width and the latency in a JP-processor linear army, respec-
tIvely.
A 3-lcvel algorithm is obtained by applying a 2-level approach to submachlnes. We consid-
ered the following 3-1evel algorithm, Algorithm 3-lcv-sq, on the Intel Delta. The p-processor
machine is logically partitioned into JP submachines, each being an array of size p1/4 x pl/'l.
Once a leader receives its long message from Pt , it initiates a 2-levcl algorithm for one-to-all
routing (using Algorithm 2-lev-rcc) within its submachine.
The value of k = logp leads to a class of interesting algorithms. A p-processor machine is
now divided into two sllbmachines and the source processor PI issues one send to the leader in
the other submachine. If the submachlnes aIe of equal size, the effective message size is Lp/2.
After tItis send, a (k - I)-level algorithm is invoked. If the (k - I)-level algorithm proceeds
in the same faslliotl, we refer to the algorithm as a Binomial Heap algorithm (since the sends
issued Induce a tree h;:wing the shape of a binomial heap). When the machine is dIvided into
sullmachlnes of equal size, we perform logp superstcp, with each superstep experiencing only a
single message set-up cost. Further, the total number of set-up costs experienced is minimized.
Algorithm logp-lev-sq divides the mesh into half by alternating vertical and horizontal cuts;
i.e., the algorithm operates on a square mesh of size p/4 after two supersteps. Let CJ3IJ(p) be
the number of communication units charged to Algorithm logp-lev·sq on a p-processor machlne.
Then,
CBH(p) = 2(s + il) + (f~rl + r~rlJ * (h + 2) + CBIl(p/d).
For the mesh, the average distance in the p/4-proccssor machine reduces from h to h/2. Hence,
the recurrence is bounded by
LpCBH(p):S;slogp+c*r-tl*h,
for a constant c ~ 1.5.
Algorithm logp-leu-rec( '"I) divides the mesh into two submachines using '"I, 0.5 ~ '"I < 1,
as the partitioning factor. The partition is made so that the submachine containing source
processor PI consists of ,p processors and the other submachine consists of the remaining
(1 - ,)p processors. Evaluating Algorithm logp-lev-rec(,) in the C3-model results In a larger
16
Algorithm Comm. Units Camp. Units Comm Units (with s-8)
f~lev-di1' 256, +0.55L A 2048 +0.551,512
i-lev-b, 88 +271 L 64 +271
2-lcv-rec 32, + 1.23D L 256 +1.23[,32
3-lev-sq 24, +0.93L J;,. 192+ O.93L25
logp-lev-sq 8s + 5.29£ L 64 +5.29L
Figure G: Approximate number of units charged for one-to-all algorithms assulning <:L 256-
processor Intel Delta wHh h = 10, l = 512, and b = 16.
number of communication units compared to logp-Iev-sq_ However, Algorithm logp-lev-rec(-y)
with J = 0.75 llerforms well on the Intel Delta. We discuss the reasons and why Oil r model fails
to evaluates ntis when comparing actual and predicted performance.
Predicted Performance and Experimental Results
In Figure 6 we show the total number of communication and computation units charged to
the one-to-all algorithms in the C3-model for the Intel Touchstone Delta. The units are given
for non blocking sends and nonblocking receives. Since we considered messages whose sizes are
powers of 2, the f-l's have been dropped. The unlts are given for p = 256, It = 10 (the precise
value wOlJld be 10.(7), l = 512, and b = 16. 'When converting the set-up cost s to units, we
assume s = 1400 processor cycles. Assuming '10MI-Iz processor clock speed and 12.5 MB/sec
network bandwidth, the number of units corresponding to one set-up cost is approximately 8.
Figure 7(a) shows the predicted performance of the algorithms in graphical form, varying the
message size from 16 to 16K bytes. From the communication units it appears that Algorithm
:I-lev-5q is the best for message sizes of up to 6Kbytes, and that Algorithm l-Iev-dir is likely to
give reasonable performance for large messages sizes. Algorithm f-lev-br is predicted to be a
poor choice.







































Message Size (In Byles)
(b)
Figure 7: (a) Predicted performance (in units) and (b) experimental results (in msec) of the
Onc-to-AII Algorithms on a 25G-Processor Intel Touchstone Delta using blocking sends and
nonblocking receives.
18
Message Size (in Byles)One·ta-All
Algorithms
16384 8192 4096 2048 1024 512 256 128 64 32 16
1-1ev-dir 420.82 226.21 130.80 75.22 52.80 37.61 29.50 28.58 26.05 26.45 26.04
1-1cv-br 4377.44 2193.51 1096.23 549.12 275.79 138.88 70,67 36.73 19.90 11.29 6.78
19.6236.6270.63138.30545.37 274.02
~~~.~~ E~~;~: W~·ij: jH~~~~{ E\~:?P) ,~]lr6;§~ ~;;':W~: ~ ~;;~:~~: :\1T5:&E ;::''1,45: H'3..H
:~o,y~: ]~p3.jf n~·6iJ:. ~~i~~j: .;::~:46' ::;:~.~:~; >W·~?:' ;;U~~~.\ \: ii~.52' :Co \~.:~~ )';~:.1i]
.-.-:::::::: ::.... .... - ;;",,:c::::: ":','"",:,





logp-lev.rec{O.75) ::~:9~i#~ :)9.H3: :~-~~;?:5 :;;:SlSsj ~~~~~~~~ :~1~5:J~:: ~~~~~:~~j ::;::5:80: ,":::4~:: :::}:~~: :<3:02::
Figure 8: Performance rcsults for one-to-all routing on a 256-Processor Intel Touchstone Delta
using non blocking sends and nonblocking receives (execution times are in msec).
slone Delta. We considered machine sizcs from 16 lo 256 processors and message sizes from
16 byles lo 16 Kbytes. The corresponding experimental resulls for p = 256 are shown in
Figure 7(b). For a more complete discussion on lhe performance of these algorithms on the
Intel Delta, we refer to [11]. Figure 7 shows that expressing each algorithm in terms of com-
mnuication and computation units gives an accurate prediclion of their relative performance
on the tillel Delta. Algorithm l-lev-(lir is indeed a reasonable choke for large message sizes
(at least 4 Kbytes). Independenl of the message size, l-lcv-dir always experiences a total of
p - 1 message sel-up costs. In addition, since the packet length on the tillel Delta is 512 bytes,
sending message sizes::; 5J2 costs approximately the same. The broadcasting algorithm gives
the worst performance. The poor performance is partly due to the large effective message si7.e,
as well as due to the absence of a dedicated fast broadcasting network. Algorithms 2-lcv-rec
and 3·lev-sq give approximately the same performance and arc the besl choice among the five
algorithms listed in Figure 7. Algorithm logp-lev-sq gives good performance only for small
message si7.es (::; 256 bytes). TIllS also agrees with its predicted performance. Figure 8 gives
detailed performance results in tabulated form.
From Figure 8 is follows lhat Algorithm l09P-lev-rec(O.75) performs quite well. Actually,
IOflp·lev-l'cc(O.75) gives optimal or near optimal results for all machine and message sizcs on
Delta [11]. As already stated earlier, the metric of the C3-model evaluates logp-Iev-rec(O.75) to
19
be no better than Algorithm logp-Iev-sq_ If Algorithm logp-Icv-rcc(0.75) were implemented with
a barrier-style synchronization between supersteps, we would see no improvement. However,
logp-lev-rcc(O.75) was implemented with no such synchronization. The value I = 0.75 captures
characteristics of the send and receive ratio of the Delta (the value of f = 0.75 was obtained
through experiments). Before the leader in the other submachine received its long message,
the source processor already starts sending the next long message to the next leader. While
exploiting such features of a machine can bring good performance results, they are diIIicult to
incorporate into a computational model aimed at making parallel machines easier to use_
In summary, our validation work on the Intel Delta indlc<Ltes that the message-combining
algorithms which keep a balance between the total number of sends <Lild the effective message
siZe! perform well for small message sizes. Which one of them gives the best performance
depends on the ratio between the send and receive time, the packet length, the ratio between
the processor and network bandwidth, and the message set-up cost.
4.2 All-to-one Routing
In all-to-one routing every processor sends a message to destination processor Pt. Processor Pt
is now the hottleneck. Conceptually, all-to-one is the inverse of one-to-all. Qur one-to-all algo-
rithms, except the algorithm based on broadcasting, have corresponding all-to-one algorithms.
Algorithm t·le.v·dir for all-to-one is one in which every processor issues a send to processor
Pt. Algorithms 2-lev-rec and 3-1ev-sq are the corresponding 2-level and 3-level algorithms, re-
spectively. Algorithm logp-Iev-sq is the logp-Ievel algorithm partitioning the mesh into two
su bmachines by alternating horizontal and vertical cuts. Algorithm logp-Iev-l'cc(I) partitions
the mesh into two submachines based on the value of" 0 < f < l.
The number of communication units charged for each of the all-to-one algorithms is almost
identical to the ones charged for one-to-all and we omit details. The difference lies in the number
of message set-ups charged. For example, the communication units charged to Algorithm I-lev-
(fir for all-to-one include only a single message set-up, compared to 1) - 1 for one-to-all. For all
all-to-one algorithms, the receive times are the dominating terms in the communication units.
From a practical point of view, the best one-to-all algorithms do not necessarily correspond
to the best all-to-one algorithms. We refer to [11] for a complete discussion and only state
20
our main observations_ On a 25G-processor Intel Delta, Algorithm 1-lev-di1' is no longer a
reasonable choice for large message sizes. For a 25G-processor machine, all algorithms that
combine messages give a comparable performance for L :s: 512, whlle for L > 512 Algorithm
loop-lev-rec(O.60) gives the best performance. For messages oflength < 512 bytes the all-to-one
algorithms are slightly faster than their one-to-all counterparts, while for messages of length
;:: 512 bytes the all-to-one algorithms arc significantly slower. This can be explained by machine
cha.racteristics which we do not attempt to capture in the C3-model.
4.3 All-to-all Routing
In this section, we first describe a number of different algorithms for all-to-all routing. We then
compa.re their predicted performance wlth the experimental results achieved on a 25G-processor
Intel DelLa.
Description of Algorithms
The most straightforward I-level <Lpproach for all-to-all routing is to have each processor send
its p - I messages, one by one, regardless of what other processors are doing. The machine
is thus Hooded with messages and the arising congestion is left to be handled by the system.
This approach is IIsed in Algorithm l-lev-dir. An approach that attempts to control congestion
implements all-to-all through p - 1 one-to-one routings; Le., the ]J(p - 1) routing requests are
partitioned into permutations. Common are the linear permutations and exclusive-or permu-
tations. When partitioning into lineur permutations, processor j sends a meSS<Lge to processor
(j + i) mod (p - 1) in the i-th permutation, 1 ::; i :s: p - 1. When partitioning into excl1.l.sive-
or permutations, all-to-all is partitioned so that in the i-th permutation processor j sends
a. message to i ffi j. Implementations of these approaches on different machines have shown
exclusive-or permutations to be superior to linear permutations [19, 22]. Another interesting
approach for partitioning all-to-all routings into permutations has been introduced in [21]. We
call this approach partitioning into balanced permutations and refer to [11] for implementation
details. Balanced permutations are relevant to the mesh architecture since they minimize the
congestion over the links.
We view algorithms that partition into permutations as I-level algorithms and refer to such
21
Algorithm Communication Units Computation Unlts
l-lev-dir 256s + 1/IL l'i-lev-perm 256, + 14L ,-
2-lev-sq 338+ 23£ 1.51-
2-lev-c, r 325 + 23£ I-
logp-Iev-bfiy 4, +28.51, 4L
Figure 9: Approximate number of units charged for all-to-all algorithms on a 256-processor
Intel Touchstone Delta with It = 10, 1 = 512, and b = 16, assuming non blocking sends and
receives.
algorithms as Algorithm i-lev-perm. The metric of the C3-model charges the same number of
communication units for each of the three permutations. This is because our metric is unable
to distinguish between easy and hard permutations without explicitely giving a partitioning
into submachines. Further, our metric charges the same number of communication units for
algorithms which partition into p permutations and Algorithm i-lev-dir, in which every pro-
cessor issues p - 1 sends independent of what the other processors are doing. The number of
supersteps and the amount of congestion in each superstep for both of these I-level approaches
is different, but the total number of units charged is the samc. Figure 9 gives the total number
of communication and computation units for the all-to-all algorithms.
Next consider the following two 2-level algorithms. The approach used in the first one,
Algorithm 2-lcv-sq, is indepen(Lent of the underlying architecture. The approach Ilsed in the
second one, Algorithm 2-lev-1',c is tailored towards the mesh architecture. An idea similar to the
one used ill Algorithm 2-lev-sg is described in [4] and an implementation of Algorithm 2-lev-c, r
has also been reported in [22]'
In Algorithm 2-lev-sq, a p-processor machine is logically partitioncd into -IP submachines,
So, .. ,S.jP_l' Suhmachine Si performs an all-to-all routing with.in S,. sending long messages of
length -.[ji""=TL. After th.is step, processor i in submachine Sj contains the p messages destined
for the processors in submachine Si (and which have their source processor in submachine Sj).
The algorithm then performs a one-to-one routing step in which processor i of submachine Sj
sends this long message (having length Lp) to processor j in submacb..ine S,.. The thlrd and final
step is an all-to-all routing within each submachine which routes the messages to their correct
22
destinations. Algorithm 2-lev-c,r uses a similar principle, but avoids a one-to-one ronting step
by using different submachines in the first and second step. In the first step the Vii submachlnes
correspond to the Vii columns of the mesh. We perform an aU-to-all routing within each column
so that processor i in column j receives the p messages destined for the processors in row i (and
which have their source processor in column i). An all-to-all routing within each row completes
the opemtion. As shown in Figure g, the number of communication units charged to the two
2-level algorithms is identical. In Algorithm 2-lev-sq, I4L of the 23L muts charged come from
the second step, the one-to-one routing. In Algorithm 2·lev-c,1' the number of communication
units cll<Lrged is split evenly between the two supersteps.
'VVe have also considered a logp-level algorithm, Algorithm logp-Icv-bfly, based on the but-
terlly communication pattern_ In the first superstep of tlus algorithm every processor Pi sends
the p/2 messages destined for the p/2 processors not in its half to processor PU+p!2)mo ilp' After
the received messages are combined with the messages that remained in a processor, aU-to-aU
in performed on two p/2-processor submacrnnes.
Comparing Predicted and Experimental Results
In tlus section we again compare the performance predicted b)' the C3-modelto the performance
achieved on the Intel Delta. Recall tlULt Figure 9 gives the communication and computation
units for the algorithms described in the previous section. The C3-model predicts the I-level
algorithms to be superior for large message sizes and it predicts message combining algorithms
to perform better for small message sizes.
\"fl,'e have implemented the above mentioned algorithms all a 256-processor Intel Delta. Al-
gorithms i-lev-lin, i-lev-Xo7', and i-lcv-oal are the three I-level algorithms partitioning all-to-all
commullciation into permutations. The predicted performace and implementation results are
shown in Figure 10(a) and Figure 10(b), respectively.
Algorithm l-lev-Xor gives the best performance for large message sizes. Observe that the
advantages of Algorithm i-lev-bal with respect to the arising congestion are not evident from
the experimental results obtained from the Delta. As already stated, the metric proposed in




















;:(r', .. -- ..-
-:-::... -'~l.c,--'I--"~""_~""'",'.~,"'O::"'':"C--''':...:!:'-~,.~, ~~,..
















Figure 10: (a) Predicted performance (in units) and (b) experimental results of the all-to-all
algorithms on a 256-processor Intel Della using nonblocking sends and non blocking receives.
21
form,tnce for all I-level algorithms follow the same curve. However, in actual implementations
different permutations induce different patterns of link and processor congestion and thus give
a different performance. Capturing tIus behavior in the model would be difficult.
Message Size (in Bytes)AlI-~o-All
Algorithms
16384 8192 4096 2048 1024 512 256 128 64 32 16
I-lev-direct 6660.21 3115.27 1494.48 598.62 316.78 169.48 82.84 73.21 70.28 68.11 69.75
I-lev-lin 5476,28 2661.56 1294.39 639.83 330.90 182.18 94..48 71.12 67.66 63.03 66.55
--------_._.._ - _._-_ .
l_lev_xor ~@:~~: f~~:6:0:: lOSl::tK :~536:oi ::n3;i.8:: :147".98: :: ,7820: 63.75 59.51 59.21 6lAO
... -- .... --
I-lev-balance 4988.76 2492.90 1221.90 619.62 305.24. 144.25 77.43 77.83 72.83 64.47 61.11
2-1ev-sq
2_lev_c,r
6561.45 3260.92 1633.19 809.42 401.09 201.35 99.75 60.03 34.43 24.18 18.69
---- - _...... ---
5632.29 2659.75 1319.53 665.28 330.50 163.02 78.58 !~;~~.'4~~ :~:~2.til-6:: :: ~~;~~:: ~~'?f
2-lev-c,r-int ~i?t~.!t-~~ ~232:63:~ ~Q~;~~ 543.55 284.23 168.85 113.30 91.23 82,76 78.81 75.96
logp-Iev-bfly 2206.67 1112.07 569.08 298.10 163.34 97.10 74.03 43.09 31.84
Figure II: Performance Results for all-to-all routings on a 256-processor Intel Touchstone Delta
llsing nonblocking sends and nonblocking receives (execution times aIe in msec).
The experimental results show that Algorithm 2-lcv-c,r performs best [or small message
sizes (:::; 256 bytes). Since in Figure 10 il is not easy to cUstinguish between the performance
of the algorithms for small message sizes, we refer lo Figure 11. Algorithm 2-lcv-sq gave the
second hest performance for small message sizes. The reason 2-lev-c,r outperformed 2-lev-sq,
lies in the fact that 2-lev-sq is a 3-slep algorithm (which sends out data three times), wIllie
2-lcv-c, r is a 2-step algorithm. The ad vantage of the 3-step algorithm is that it uses square
meshes as submadtines, whereas the 2-step one uses linear arrays. The approach in Algorithm
logp-lcv-bfly has consistently been judged as being expensive for large message sizes [11,22]. Our
metric and the observed performance on the Delta, confirms that as well.
5 Validation through Divide-and-Conquer Solutions
On coarse-grained machines, divide-and-conquer strategies are nalural and often result in ef-
Iicient solutions. Divide-and-conquer typically contains a merging process in which results
25
computed by different processors are combined to obtain the final solution. Different merging
patterns hi:we diITerent communication and computation requirements. Depending on machine
and problem parameters, different patterns are likely to result in different performance.
In this section we use contour ranking, a low-level aimage-processing problem, to validate
the C3 _model. Contour ranking can be viewed as performing Jjst ranking in images. The
problem arises when edge contours generated by edge operators in a 2-dimensional image plane
are transformed into a linei:Lrized representation. Such representations are more compact for
processing performed in subsequent mld- and high-level vision tasks [5, 16, 20]. Generating the
linear representation is called conlour ranking.
The algorithms we describe use divide-and-conquer and merge information about subimages
in order to compute the final values. The information needed about a subimage is proportional
to the number of edge points on the boundary of the subimage. The time needed to merge
suhimages is linear in the number of edge points on the involved boundaries. A number of other
problems on images can be solved by algorithms following the same principle. These problems
include component labeling, straight line approximations, and region growing. }<or example,
each one of our contour ranking algorithms can be turned into a component labeling algorithm
by Ilsing a diITerent merging procedure. The relative performance of the so obtajned component
labeling algorithms will correspond to the relative performance of contour ranking algorjthms.
5.1 Problem Definition and Basic Approach
We refer to i:L pixel on an edge contour as an edge point. For each edge point c, succ(e) points
to either one of e's eight immediate successors on the edge contour or it is nil. An edge point
c with .succ(e) = nil is called a head. The succ-rclation induces linked lists and thus each edge
contour corresponds to a linked list. In contour ranking we determine, for every edge point c,
the head of the list containing e and the distance from e to this head, called the rank of c. Once
t.he ranks arc known, a final di:Lt.a movement step generates the linear representation. Clearly,
by following the s'ucc-links, heads and ranks can be determined sequentially in linear time.
Let 1 be an image of size m x n. For simplicity, we assume that p is a perfect square and that
"111 am] n are both multiples of Vii. We Msume that image I is pmtitioned into p rectangular
26
suhimages, each of size ,fi x )p. We number these subimages using a row-major numbering
scheme. For darity, we assume that processor Pi,] is assigned subimage Ii,j, 0::; i,j::; .;p - 1.
For any subimage 1f of J, the information needed about image l' in order to compute the
head and rank information of all edge points outside I' is proportional the number of edge
points on the boundary of I'. Conversely, if the final head and r<mk are known for every edge
point on the boundary of 1', then the head and rank in image I can be computed for every edge
point within If. In a forward phase, our algorithms merge information about the boundaries of
subimages in order to compute the boundary information of larger subimages. In a backward
phase, the final head and rank in image J of edge points on the bounda.ry of subimages are used
to determine head and rank inforrn;:ttion for the remaining edge points within the subimages.
We refer to [12] for details on how the boundary is represented and for details of the merging.
In brief, each one of our algorithms consists of the following three steps.
1. Processor Pi,] performs contour ranking on sllbimage Ii,i. Pi,i then constructs the bound-
ary list representing the information about subirnage 1;,i needed in future computations.
2. Determine, for each edge point on the boundary of subimage Ii,i, its rank and head in
image I. In order to compute this information, boundaries of subimagcs are merged.
3. Determine the rank and head in 1 for every edge point in sullimage Ii,j.
Steps 1 and 3 are identical for each contour ranking algorithm and can be viewed as prepro-
cessing and postprocessing, respectively. They require no communication lletween processors.
In the ne.xt section we describe different divide-and-conquer patterns for performing Step 2.
5.2 Divide-and-Conquer Patterns
In this section we describe four algorithms for performing Step 2; i.e., fOf determining, for each
edge point on the boundary of subimage Ii,i' 0 ~ i,j ::; vp - 1, its head and rank in image
I. Assume processor Pi,] contains the boundary list of some rectangular sllbimage Ii,i and
processor Pk,1 contains the boundary list of an adjacent subimage 11.". Let I' = Ii,i U Ik,I. In
order to determine the boundary list of subimage If, both Pi,i and Pk,1 send their boundary
list to each other. After each processor has received the other procC!ssor's list, it proceeds to
27
determine the boundary lisL of subimage I'. Doth processors continuc to merge subimages
until each processor knows the boundary list of image I. At this point the forward phase of
the contour ranking algoriLhm is completed and the backward phase begins. The goal of the
backward phase is to determine, for every edge point on the boundary of subimage Ii,j, its
rank and head in image I. For Lhe algorithms we analyze in this section, the backward phase
requires no communication between processors. Processor Pi,i uses informaLion about larger
subimages to update the heads and ranks of smaller subimagcs, proceeding until the smaller
subimage equals 1;,j.
Algorithm l-lcv-dir
1. Every processor sends its boundary list to every other processor.
2. Every processor l~i,j merges the p boundaries and determines, for each edge point on the
boundary of h.j, its the rank aud head information in image I.
Algorithm 2-lev-rc
1. Processor Pi,j sends its bouudary list to every other processor in row i.
2. Processor P,',j merges the received data, creating creating the boundary list of subimage Ii,•.
3. Processor Pi,j sends the boundary list of subimage Ii,_ to every processor in column i.
4. Processor Pi,j determines the boundary list of image I. It then determines the rank and head
in I of every edge point on the boundary of li.j.
Algorithm logp-Iev-quad
1. Form p/4 groups, each containing 4 processors, so that processors P2i,2j P2i+l,2j, P~i,~j+l, and
P2i+I,2j+l, 0::; i,j::; -JP/2 -1 belong to the same group. Number the processors in a group
from 1 to 4. Every processor sends its boundary lists to every other processor in the same
group.
2. Let l~i.2j :::: 12i ,2j U I2i+I,2j U I2;,2j+1 U hi+l,2j+l' A processor in the same group with P3i,2j
determines the boundary lists of subimage l~i,2j.
:1. All the processors with number I, I :$ { :$ 4, recursively merge their subimages. After the
recursion, every processor in the group with P2i,2j knows the head and rank in image I for
each edge point on the boundary of subimage I~i,3j'
4. Processor Pi •i determines the rank and head in I of every edge point on the boundary of I i •j .
Figure 12: Outline of contour ranking algorithms.
The communication in the forward phase is an all-to-all broadcast performed on subma-
chines. The sizes and types ofthc submachines depend on the algorithm. We again characterize
28
1st iteration 2nd iteration
Figure 13: All-to-all broadcast patterns for Algorithm logp-lev-quad
our algorithms by the number of submachine levels they employ. Figure 12 contains an outline
of three of the algorithms. In Algorithm l-lev-dir every processor sends its boundary list to
every other processor. This is the only communication operation of the algorithm. After this
communication, every processor can determine the head and rank in image I for every edge
point on the boundary of its subimage. In Algorithm 2-lev-rc, the processors Hrst perform an
all-to-all broadcast within every row, followed by an all-to-all broaclcast within every column.
The third algorithm, Algorithm logp.lev-qlwcl, merges subimages in a quad-tree like fashion;
i.e., a,t every iteration the boundary lists of four adjacent subimagcs arc merged. Figure 13
shows the all-to-all patterns arising in the first two iterations of Algorithm loyp-leu-quad on a
II X " mesh. The processors communicating in the all-to-all broadcast in the second iteration
are linked with arrows of the same type. In each one of these three contour ranking algorithms,
every processor merges subimages at each iteration. At the same time, the number of processors
merging identical subimages, and thus performing identical computations, increases with every
iteration.
On ;:L mesh architectures, Algorithm logp-lev-quad experiences the following communication
imbalance. The size of the boundary of the subimages, and thus the si7.c of the lists sent be-
tween processors, increases in subsequent iterations. In initial phases, processors commun.ic;:Lte
over short distances. As the algorithm proceeds, the communication distances and associated
29
congestion increases. This is also evident from Figure 13. This imbalance is the mDtivatiDn
for om fourth contour ranking algorithm, Algorithm logp-le.v-bal. In Algorithm loyp-lev-bal the
imbalance is reduced by performing a permutation that sends the boundary list from processor
Pi,j to processor Prcl1(i),rev(j), where re.v(i) is the index obtained by applying the bit-revcrsal
to the binary expansion of i. The result of applying this permutation is that processors ini-
tially mmmunicate over lDng distances. As the size Df the subimages and thus the si7.es Df the
boundi:LTY lists increases, the distance between communicating processors and link congestion
decrea..<;cs.
5.3 Predicted Performance and Experimental Results
We next use the C3-mDdel to analyze the fDur cDntour ranking algorithm described in the
previous section. Clearly, the performance of each one of the algorithms depends on the size Df
the boundary lists and is thus image-dependent. In order tD analyze the algorithms, we need
to make assumptiDns about the input. We measure for every image the edge point density
which is defined as the fraction Df all pixels that are edge pDint.s. We consider images with edge
point. densities from 5 to 100%. We use synthetic images cDnsisting of vertical or diagonal lines
through the entire image. The desired edge point density dictates the spacing of these lines.
Real images with the same edge pDint density will give the same prediced performance and very
similar experimental results.
We again use a 25G-processor Intel Delta for Dur analysis. Figure 14 gives the cDmmunication
and computi:Ltion units of the algorithms for an image cDnsisting Df diagDnallines with an edge
pDint density Df 100%. The quantity B represents the size of the boundary list of subimage l;,j
assigned tD processor Pi,j.
Figure 15(a) and (b) shDws the predicted performance in terms of communication and
complltat.ion units in graphical form for four image sizes ranging from 256 X 256 to 2K X 2J(.
I<:ach image has an edge pDint density of 100%. The performance Df the algorithms on a 256-
processor Intel Delta is shown in Figure 15(c) and (d). Overall, Algorithm loyp-lev-quall gave
the best predicted and actual performance. We pDint Dut that each Dne of our algorithms
experiences roughly logp message set-up costs. In Algorithms l-lev-dir and 2-lev-l'c these set-
ups are experienced by the all-to-all broadcast (which is implemented using a binDmial heap
30
Algorithm Communication Units Computation Units
l-lev-dir 8., + 38 + 3.33B lo5B
2-le.v-rc 8s + 20 + 2.14B O.84B
logp-Iev-quad 8s +38 + lo8B 0.113
logp-Iev-bal lOs +58 + 0.8013 O.lllJ
Figure 14: Approximate number of units charged for contour ranking algorithms on a 256-
processor Intel Touchstone Delta with h = 10, l = 512, and b = 16, assuming nonblocking sends
and receive!> and an edge density of 100%.
structure). In the quad-tree based algorithms, each of the log,! J) iterations experiences 2 set-
up costs. The execution times reported in Figure 15(c) and (d) were obta.ined by monitoring
one of the processors in the array. Since the edge density is uniform over the entire image,
e;:Lch processor experiences same computation and communication load. Therefor, monitoring
;:L single processor not only gives a reasonable approximation of the overall performance, but
also allows us to measure separetly the time spent in communication and on local computation.
Comparison of the actual performance with the predicted performance for the algorithms reveals
a high-degree of correlation between the two.
\~Ic conclude this section with a brief discussion on how the algorithms behave under clifferent
edge point densities. We considered edge point densities from 5 to 100%. The obta.ined results
give insight into the behavior of the algorithms when the si:;>;es of the boundary lists change.
"Vhen increasing the edge point density in large im;:Lges, we observed that the communication
time of Algorithm l-lcv-dir increases much sharper compared to the other algorithms. The
growth rate in the communication time for Algorithms logp-lev-quad and logp-Iev-balis relatively
slow. Algorithm logp-Iev-bal gave the best performance for large images with a high edge-point
density. TIllS is attributed to the fact that for Imge, dense images the amount of data routed
dlJring the merging steps is significant enough to cause congestion in the routing network.
Therefore, a data movement step before the actual merging in Algorithm loyp-lev-bal pays off.
For images of size 2K X 2I{, Algorithm loyp-lev-baloutperforms loyp-lev-quad for images with an
edge point density higher than 10%. Load balancing performs better by 10-15%. On the other

























































Figure 15: Predicted performance ((a) communication, (b) computation) and experimental
results ((c) communication, (d) computation) of the contour ranking algorithms for images
with an edge polnt density of 100% on a 256-processor Intel Delta.
32
edge point densiLy close to 100%, as is also evident from Figure 15(c) and (d). The analysis of
the actual performance on varying edge densities also conforms with the performance predicLec1
by the Gl model.
6 Conclusions
A computational model, the C3 model, has been proposed for developing and analy?:ing algo-
rithms on coarse-grained machines. The C3 model allows evaluation of communication opera-
Lions without a user having to specify fine scheduling details. Also, a metric has been defined
to estim,Lte the arising link and processor congestion. Coarse-gra.ined algorithms have been
developed for common communication operations and for a low-level vision problem solvable
through divide-and-conquer algorithms. The validation of the model has been discussed by
implementing the algorithms on the Intel Touchstone Delt;:L and comparing the performance
results with the predicted performance. This initial validation is encouraging and it provides
insight into the interaction of various machine parameters and on their effect on the performance
of coarse-grained algorithms.
7 Acknowledgements
We would like to thank Mike Atallah for helpful conversations and Farooq Hameed for his
valuable assist;:Lnce in the implementation of the algorithms.
References
[1] V. Bala, J. Bruck, R. Cypher, P. Elustondo, A. I-Io, C.-T. Ho, S. Kipnis, and M. Snir,
"CCL: A Port;:Lble and Tunable Collective Communication Library for Scalable Parallel
Computers," Proceedings oj 8-lh International Parallel Processing Symposium, pp. 835-
8'14, 199'1.
[2J M. Barnett, R. Littlefield, D. G. Payne, and R. van de Geijn, "Global Combine on Meshes
Architecures with Wormhole Routing," Proceedings oj 1-th Intemational Parallel Process-
ing Symposium, pp. 156-162, HJ93.
[3J A. Bar-Noy, S. Kipnis, "Designing Broadcasting Algorithms in the Postal Model for
Message-Passing Systems," Proceedings of 4-th ACM Symp. on Parallel Algorithms and
Architectllres, pp. 13-22, 1992.
33
[4] S.H. Bokhari, "Multiphase Complete Exchange on a Circuit Switched Hypercube," Pro-
ceedings of 1991 International Conference on Parallel Processing, pp. 525-529, 1991.
[.5] L. T. Chen, L. S. Davis, and C. P. Kruskal, "EIIicient parallel processing of image contours,"
IERE' Transactions on Pallern Analysis and Machine Intelligence, Vol. 15, no. 1, pp. 69-81,
1993.
[6] D. Culler, R. Karp, D. Patterson, A. Sahay, ICE. Schauser, E. S;:tntos, R. Subramonian,
T. von Eicken, "LogP: Towards a Realistic Model of Parallel Computation," Proceedings
oJ 4-th ACM SIGPLAN Symp. on Principles and Practices of Parallel Programming, pp.
1-12,1993.
[7J R. Cypher, E. Leu, "The Semantics of mocking and Nonblocking Send and Receive Prim-
itives," Technical Report, IBM Almaden Research Division, 1993.
[8J .T..T. Dongarra, R. Hempel, A.J .G. Hey, D.W. Walker. "A Proposal for a User-level, Message
Passing Interface in a Distributed Memory Environment", Technical Report TM 12231,
Oak Ridge National Laboratory, 1993.
[9] C. Dwork, M. Herlihy, O. Waa.rts, "Contention in Shared Memory Algorithms", Pmc. of
25-th .llCM STOC, pp. 174-183, H)93.
[10] P.B. Gibbons, "I\. More Practical PRAM Model," Proceedings of 1989 ACM Symposium
on ParnllelAlgorithms and Architectures, pp. 158-168,1989.
[11] S.E. Hambrusch, F. Hameed, and A. Khokhar, "A Study of Coarse-Grained Communica-
tion Operations on Mesh Architectures" Technical Report, Purdue University, May 1991\.
[12] F. Hameed, S.E. Hambrusch, A. Khokhar, and J.Patel, Contour Ranking on Coa.rse-
Grained Machines: A Case Study for Low-level Vision Computations, Technical Report,
Purdue University, November 1994.
[13] T. Heywood and S. Ranka, "A Practical Hierarchical Model of Parallel Computation: I.
The Model," Journal oj Parallel and Distributerl Computing, Vol. 16, pp. 212-232, 1992.
[14] K. Hwang, Advanced Computer Architecture with Parallel Programming, McGraw-Hill,
1993.
[15] S.L. Johnsson, C.-T. lIo, "Optimum Broadcasting and Personali7.ed Communication in
Hypercubes," IEEE Transactions on Computers, Vol. 38, pp_ 121\9-1268, 1989.
[16] M.H. Kim, O.II. Ibarra, "Tmnsformatlons Between iloundary Codes, Run LengLh Codes,
and Linear Quadtrees," Proceerlings of the 8th International Parallel Processing Sympo-
sium, pp. 120-125,1994.
[17] P. Liu, '-iV. Aiello, S. BhaLt, "An Atomic Model for Message Passing," Proceedings of 5-th
.!lCM Symp. on Parallel Algorithms and Architectures, pp. 154-163,1993.
[18] G. Papadopoulos, "Constant Factors Matter: Putting Communication on the Compu-
tation Power Curve/' P1'Oceedings of DIMACS Workshop on Model, Architect1lres, and
Technolo,gics for Parallel Computation, 1993.
[19] R. Ponnusamy, A. Choudhary, G. Fox, "Communication Overhead on CM5: 1\ n Experi-
mental Performance Evaluation," P1'Oceedings of -I-th Symposium on the F1'071ticl'S of Mas-
sively Parallel C'omputation, pp. 108- lIS, 1992.
[20] H. Si:Lmet, Applications of Spatial Data Structures, Computer Graphics, and Ima,ge Pro-
cessing, Addison Wesley, 1990.
[21] D.S. Scott, "Efficient I\II-to-All Communication Patterns in Hyperculle and Mesh Topolo-
gies," P1'Oceedings of 6-th Distributed Memory Computing Conference, pp. 398-403, 1991.
[22] R. Thaku r, 1\.. Chaudhary, "All-la-all Communication on Meshes with Wormhole Routing,"
Pmceedings of 8-th International Parallel P1'Ocessing Symposium, pp. 561-565, 1994.
[23] P. de let Torre and C.P. Kruskal, "Towards a Single Model of Effident Computation in Real
Pi:LraUel Machines," Future Generation Comp'uler Systems, Vol. 8, pp. 395-1\08, 1992.
[24] L.G. Valiant, "A Bridging Model for Parallel Computation," Communications of the ACM,
Vol. 33, No.8, pp. 103-111,1990.
[25] D.S. Wills and W. Dally, "Pi: A Parallel Architecture Interface," P1'Oceedings of -I-lh
Symposium on the Frontiers of Massively Parallel Computation, pp. 345-352, 1992.
35
