A Model for Analyzing Generalized Interprocessor Communication Systems by Cuny, Janice D. & Synder, Lawrence
Purdue University 
Purdue e-Pubs 
Department of Computer Science Technical 
Reports Department of Computer Science 
1982 
A Model for Analyzing Generalized Interprocessor 
Communication Systems 




Cuny, Janice D. and Synder, Lawrence, "A Model for Analyzing Generalized Interprocessor Communication 
Systems" (1982). Department of Computer Science Technical Reports. Paper 333. 
https://docs.lib.purdue.edu/cstech/333 
This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries. 
Please contact epubs@purdue.edu for additional information. 








We present a model of parallel computation which is
parameterized by execution mude, allowing us to express
different modes within a common framework. The model
enables us to make legitimate comparisons of execution
modes. We report here on some preliminary results.
October 6, 1962
This work is part of the Blue CHiP Project. It is supported in part by the
Office or Naval Research Contracts N00014-BOwK-0816 and N00014-BI-K-
0360. The latter is Task SROIOO.







Algorithmically-specialized computers are likely to be parallel
machines since parallelism is an effective me thad of circumventing the
physicallimiLs of sWitching and signal transmission delays. If so, then an
important design decision is whether the algorithmically-specialized
computer execules synchronously, asynchronously or in an intermediate
mode such as dala-driven execullon. The decision is crucial because it
inilucnces cost, performance and the convenience of programming. For
us)' l1Chl'OIIOUS execution. there is overhead associated with processor to
processor communication because of the requisite hand-~haking proto-
col. Dala-driven execution must be charged for the additional circuitry
needed Lo buffer daLa arriving at a processor prior to iLs USe and to pro-
vide a signalling-back mechanism indicating when buffer space is avail-
able. To their credit, boLlI mechanisms appear to be easy Lo program,
- 2-
although the programs are subject to possible deadlocks. Synchronous
executiun has none of the overhead problems, nor is it subjccL La
deadlock However, assuming (as is reasonable) that the single "steps" oJ
an abstract algorithm are implemented by varying numbers of mOTe
primitive processor steps, idles will have to be inserted in some proces-
SOTS so that they match the execution rate of the processors with which
they communicate. There are cases where this cannoL be done. More-
over, when it can be done, the resulting programs can be problem-size
dependent, hardware dependent, and extremely difficult to write. These
are important considerations that cannot be easily dismissed.
In order to evaluate the consequences of problems sueh as these, we
have developed a model for analyzing general interprocessor communica-
lion. What makes Lhe model unique and especially useful for the prob-
lems mentioned above is that it is parameterized by execution mode,
ThIS enables different execution modes to be expressed in one formalism
in which fair and accurate comparisons can be made.
The purpose of Lhe pnpcr is to presenL the model in iLs full generality
and Lo sUHlIllarize oUl' early experience wiLh il.
- 3 -
A MODEl. OF PARAU.EI. PROGRAMS
We assume that a parallel processor is composed of m processing ele-
ments M t,M2•...•Mm which collectively implement an algorithm. The pro-
cessing elements (PEs) have local memories for program and data
storage, and they execuLe sequential programs under the conirol of their
own program count.ers. We are concerned only with the input/output
behavior of these machines. To avoid hiding communication costs, we
assume that the PEs do not share any common memory; instead they
communicaL"e through read and write operations. On each time step, a
PE can attempt a set of I/O operations simultaneously. Whether or not
an operation executes when it is attempted depends on the execution
mode. An operation thai docs not execute is retried on the next step and
a process does nol proceed wilh a new set of operations until all of its
current operations have completed.
We model such systems as Interprocess Communication (IC) Sys-
tems. An Ie system is complelely defined by a function, A (mnemonic for
"advance"), giving lhe execution mode of the syslem and a sel of
sequences VI, V2,... , Vm , each describing lhe behavior of a single PE. The i-
tIl sequence describes the behavior of the i-tIl machine. There are three
lypes o[ operations which are represented as follows
reads: the read of value u from PE i is denoted T~,1l' ;
wriles: the write of value 0 to PE i is denoted Wi a ;
and
time delays: a delay of n time units is denoted rJ.,.. (these delays are
used only in asynchronous mode as described below).
Each symbol in a behavioral sequence is a (possibly empty) set of these
operations subject 1.0 two restrictions: there is at most one time delay
operation in any set (if there is no lime delay, the operation is assumed
to require one lime step); and lhere is not more Lhan one read (write) to
(from) any PE in a single set. Figure lea) is an Ie system representing
the sysLolic processor for band matrix-vector multiplication with a
bandwidth of four [6]." The sequences of operation sets for each PE are
specified by regular expressions. Since the system is synchronous, there
are no Lime delay operations and since the system does not have dataw
dependenl branches, we represent Lhe transmiLted values by a single,
genenc value x. Figure l(b) shows the communication graph for ihis
system; each vertex represents a FE and a directed edge from node i to
node j represents a communication link over which the i-ih FE writes to
the j-Lh FE and the j-th PE reads [rom the i-1.h FE.
t Note thilt 111 our figures we use rectangular boxes to enclose sets rather than










lCa) Ie system ~cprcsenting systolic processor
for band matrix - vector mUltiplication.
1 (bl Communication graph for the Ie system of
Figure I(a).
Figure! 1.
We define the execution of an Ie system in terms of three sequences,
operations that are attempted on the k-th execution step, 11k describes
Lhe Lime needed for those operations to complete, and Qk describes Lhe
status oJ communications if Lhey all do complete. Each element of the
first sequence is an m-vccLor giving program counter values (indexes into
operation seL sequences) Ior all PEs. Each element of Lhe second
sequence is an m-vector giving timer values (the number of steps that
- 6 -
must elapse before the completion of the current operation set) for all
PF:s. l~ucll clement of the third sequence is an mxm matrix of strings,
giving the status of communications in terms of strings of messages and
requests. The slatus of communications on the link from PE i Lo PE j is
given by qi,j' Values that have been written bUl that have nol yet been
read are denoted by elements of an alphabet E; values that have been
requested but that have not yet been wriUen are denoted by their
inverses.', qt,} is a queue of written values (head on the right end) Iol-
lowed by a queue of requested values (head on the left end); correspond-
ing writes and reads cancel at the boundary between these queues.
To start the sequences we define. for all i,j E: [mJ, tt ci'=l and
and g/i = a 'b where








t We repre:;ent the inverse of a symbol rJ by rJ- 1 and define q.q-I to be A. the emp.
Ly string; £;-1 = la-II a c EJ
tt Em] denotes tho set /l,2,3,.,.,mJ.
- 7 -
with V(j) denoting the j-lh set of operations in the sequence V.
c 1 shows all PEs executing their first set of operations, ~l shows all of
lhe timer values set Lo their initial values, and QI shows that Lhe initial
reads and writes are pending. The remainder of lhe sequence of Cs is
defined to reflect the fact that a PE moves to a new set of operations only
if all operations in iis previous set have completed: for k >0
ifA(i,k)
otherwise
where A(i,k) is lrue if the i-ih PE finishes the cf-th operation set in step
k. The exact form of A depends on the mode of execution and is dis-
cussed below. For k >0, Ilk is defined so that the timers are set by the
execution of a d operation (default d =1) and are decremented by 1 on
each subsequent step until ihey reach 0:
if ~I c:: V(ct+l) A cf¢r:f+l
H d.. (f V(r..:fH) " cf¢cf+l
otherwise .
The remainder of the sequence of Qs is defined to reflect the execution of
rcad and write operations: Ior k >0




if Ti.aE:lj(Cf+l) /\ Cf+l '¢ cf
otherwise .
We observe that our execution rules are more general and more realistic
than those used in many models because we do not insist that all of the
operations in a set execute simultaneously. Depending on the definition
of A. it is pOSSible. for example, to allow independent reading and writing
on different ports.
The execuLion of an Ie sysLem is parameterized by the predicate A.
All of the common forms of execution modes can be succinctly expressed
within our model:
Execution Mode A(i,k)
synchronous '1i,j E [m J (q'~;=A)
data-driven '1j E [mJ (qt" E E')
(unbounded bUffer)
data-driven Vj E: [m] (qf.~ E r;b)
(b-bounded buffer)
data~drivcn 3j E [m] (qt; E E-') "-
(lazy evaluation) Vj E[m] (qf!i Er:~)
asynchronous 6t = 1
Parameterizing our model with the execution mode enables us to com-
pare modes and it distingUishes our model from previous formal models
- 9 -
of computation such as the model proposed by Lipton, Miller and Snyder
r7], Petri Nels [01. the vccLor addiLion system model (~] and the model
developed by Arjomandi, Fischer and Lynch [1].
PllEUMINARY RESULTS
OUT initial work has been practically motivated: we would like to be
able Lo program algorithms for the CHiP machine [9]. In particular, we
are working with an architecture in which computational operations are
executed synchronously but I/O operations may be either asynchronous
or synchronous. In asynchronous mode, a read that occurs before the
corresponding wriLe is delayed and a write that occurs before the
corresponding read interrupts the destination PE to buffer Lhe transmit-
Lcd value. In synchronous mode, I/O interrupts are masked off and
corresponding reads and writes must occur siInultaneously. A program
that can be run fUlly in synchronous mode is said to be coordinated.
We would like, whenever possible, to run coordinated programs.
UnforLunaLely it is extremely difficult [or programmers to produce such
code and the the code itself is often problem-size and hardware depen-
denL. To surmount these problems, we have developed and implemented
algoriLillTIS [or the automatic synt.hesis of coordinated programs from
data-driven programs [2]. These algorithms enable the programmer to
work ill the more natural data-driven environment without. forfeiting any
- 10-
of the advantages of a synchronous device. They apply only to loop pro-
grams in which each FE executes a single loop of instructions. (This res-
triction at fLrsl may seem quite prohibitive bul, in fact, most recent algo-
rithms [or algorithmically-specialized computers are loop programs. In
addition, many programs can be viewed as loop programs by collapsing
parallel branches that have the same input/output behavior.)
We have developed two synthesis algorithms. The first, the Wave AIgo-
,"iLhrn, works on all duLa-driven loop programs rOT which conversion is
possible but in some cases it produces inefficient code. The second algo-
rithm, the Buffered Write Algorithm, works for only a subset of loop pro-
grams and, although it often increases the length of PE code significantly,
the results are more efficient. We are currently working to expand the
class of programs that we can convert and to develop measures for accu-
rately evaluating the efficiency and trade-offs of our algorithms.
For the programs Lhat we cannot coordinate or that, for reasons of
emeicncy, require manual design, we have developed and implemented
algorithms for testing coordinaLion l3]. We have efficient algorithms for
testing the coordination of loop programs and the worst-case coordina-
tion of arbitrary programs. The general testing question is PSPACl~-hard
[4]. We expect that as libraries of parallel programs become available,




In E. i\rjomandi, M. Fischer, and N. Lynch, A difference in efficiency
between synchronous and asynchronous systems, Tech. Rep. #81-03-
01, University of Washington, 1981.
[2] J. Cuny and 1. Snyder, Conversion from data-now to synchronous exe-
cution in loop programs, Tech. Rep. #CSD-TR-392, Purdue University,
1902.
[3J J. Cuny and 1. Snyder, "Testing coordination for 'homogeneous' paral-
lel algorithms", Proceedings of the 1982 International Conference on
Parallel Processing, pp. 265-267, August, 1982.
[4] M. Garey and D. Johnson, Computers and Intractability, W. H, Free-
man and Co., San l~rancisco, California, 1979.
[5J R. Karp and R.E. Miller, "Properties of a model for parallel computa-
tions: determinacy, Lermination, queuing", SIAM J. Appl. Math. 14,
pp.1390-1411 November, 1966.
(6] H.T. Kung and C.K Leiserson, Systolic arrays (for VLSI), Tech. Rep.
CMU-CS-79-103, Carnegie-Mellon Universily, 1979.
[7] R. J. Lipton, R. E. Mmcr, and 1. Snyder, "Synchronization and eom-
puling capabiliLies of linear asynchronous slructures", JCSS 14, pp.
49-72, February, 1977.
[8J J. Pclcrson, "Pelri nels", Compo Surveys 9, pp.223-252, September,
1977.
[9J L. Snyder, Inlroduclion lo the Configurable, Highly Parallel Com-
puLer, Computer 15, pp. 65-82, January, 1982.
ERRATA
In the original version of this paper, an Ie system was determined by
a function A and a set of infinite sequences VI ,V2, ... ,Vm. Using that formu-
lation, a PE has only one possible behavior and the system as a whole has
only one possible execution sequence. We had intended a more versatile
model.
In the corrected version, an Ie system is determined by the function
A and a set of finite slate machines. U 1,M2•••• ,/.1m • each describing the
behavior of a single PE. It is a specific execution sequence that is deter-
mined by the seL VI' V2• ••.• Vm . To avoid explicit representation of pro-
cess termination, we assume that each Vi is of the form a(rp)'" wiLh ex a
prefix of some element of L(Mi ). Execution sequences are defined as
before.
Finally, we intend for lhe operations Ti,O and Wj,o' lo correspond only if
a;:: r1', To enforce the malching, We consider just legal computation
sequences in which for all i, i. and k
• n" en-I)"q"J' E: u U u ,
This reslriction allows us to express the dependency of branching on
transmitted values because, unless lhe corresponding reads and writes
match, some link will have a slalus in E- v (E-I)-,
The corrected version of lhis paper will appear in AlgoTithmicall;;-
Specialized ComputeTs. L. Snyder, L, Seigel, H. Seigel, and D. Gannon
(eds.).
