Programming Processor Interconnection Structures by Snyder, Lawrence
Purdue University 
Purdue e-Pubs 
Department of Computer Science Technical 
Reports Department of Computer Science 
1981 




Snyder, Lawrence, "Programming Processor Interconnection Structures" (1981). Department of Computer 
Science Technical Reports. Paper 308. 
https://docs.lib.purdue.edu/cstech/308 
This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries. 












Parallel computer arcbiLecture complicat.es the already
difficult task of parallel prograrnming in many ways, e.g., by
a rigid inLerconnecLion sLructure, addressing complexily,
and shupe and size mismaLches. The CHiP computer is a new
arcllilccturc thai reduces Lhese complicaLiolls by permiLLing
the processor inLcrcoIlnccLioll sLrucLun~ La be programmed.
This new kind of programmrning is explained. Algorithms arc
presented for several intercollIlecLion patterns including the
Lorus and Lhe complcLc binary Lree and general embedding
straLegies arc idenlificd.
'The rcsccrch dcse,ibed herein is parL of the Blue CHiP Project. Funding is provided in part
by the OJTicc oi Naval Rcsc~rch under Conlmc~ No. NOOOI-l-BO-K-OtlIO and Contract No.









Although it is a difficulL task to design a sequential computer archi-
tecture that efficiently hosts sequential algorilhms, it is perhaps even
more challenging lo design a parallel architecture that efflcienlly hosts
parallel algorithms. The aspects of parallel cumputation that Irustrate
the harmonious malch between algorithm and archilecLure are many:
Rigid inl-erconneclion structure: Parallel architectures Lend to pro-
vide a fixed inierconneclion slructure beLween processing elements
(PE's). For example, ILLIAC IV is mesh connected; the Massively
Parallel Processor [1] has a toroidal sLrucLure. BuL recenLly
developed parallel algoriLhms use u varie Ly of PE inLcrconnecLion
sLrucLures. ror example, ihere are lree algoriihms for everything
Irom sorting Lo graph coloring [2J as well as applicaLivc language
expression evaluation [3J, hexagonally connecLed pipelined algo-
"The research dcscribed hcrcin is purL 01 DIe l3luc CHiP Projecl. ["undilla is providcd in purL
by thc OITicc of Nuval Rcscarch undcT ConLmcl No. N0001'1-1:I0-K-08HJ and COllLrucL No.
NOOO1'l-81-K-03GO, Special Research 0PllorLwliLics Prourarn. Tusk SRQ-IOO.
-2-
rithms [or numeTlC problems [4], "double trees" for searching and
data Lase operaLions [5J, and many nOnSLiJlldurd inLerconnecLion
graphs. (See F'ig ure 1.) The problem is LhaL Lhe rigid interconnec-
tion st.ruclure biases Lhe architecture t.owards a particular class of
algorithms and makes it difficull Lo use for any other class of algo-
rithms.
Problem shape and size mismat"ch: Parallel algorithms tend Lo
require a particular number of PE's in a parti.cular shape that is
ueLermined by lhe problem's input, but the architecture provides
only one fixed size and shape. For exumple, an algorithm re4uiring
an n/2 x 2n array of PE's does nol "Ht" on an nXn mesh connecLed
archiLecture even though lhere are enough processors.
Addressing complexity: Certain parallel architectures, c.g., the UILra
Computer [6J and Lhe Cube connecLed cycles [7J, provide a "univer-
sal" inLcrconnecLion sLrucLure in which a logical inLcrconnecLion
sLrucLure is implcmenLed on Lhe physical structure by means of
packeL routing opera lions. Time is was Lcd in unproducLive packeL
switching. More seriously, the programs sLored in Lhe PE's are com-
plicaLeJ by Lhe need to compuLe LargeL addresses.
Pauc'ily oj programming languages: i\lLhough languages such as APL
and ConcurrcnL Pascal have "parallel scrnanLic:-;," mosL parallel algo-
riLhms are specified in an ad hoc manner. Thus there is liLLIe gui·
dance from Lhc programming language as lo whaL fcaLurcs Lo opLim-
ize rOT.
These and oLher complicaLions explain in large measure why highly paral-


















Figure 1. Interconnection patterns [or parallel algorithms (a) mesh, (b) hcxago~
nally connected mesh, (c) torus. (d) binary tree, (c) double tree.
-4-
We report on a new family of architectures, the Configurable, Highly
Parallel (CHiP) cOlnpuLers, Lhat respond Lo the demands of parallel algo-
riLhms, especially lhe need lor locality and DexibiliLy. The centrul con-
cept is this:
The processing elements are embedded into a progranlmable switch
latLice that permits nol only the programming of Lhe PE's buL also
the direct programming of their interconnection structure.
This second kind of programming not only ameliorates the difficulties
lnentioned above, it. also permits the convenient composition of parallel
algorithms. Il has even led to the development of entirely new parallel
algorithms [8]. In this paper we give a synopsis of the CHiP architecture
and then explore the consequences of this new kind of programming,
interconnccLion structure programming. The main resulLs are algo-
riLhms of programming various inLerconnecLion sLrucLures.
Synopsis oj the CHiP Computer
[Readers Iamiliar "wiLh Lhe CHiP CompuLer may wish to omit this sec-
tion.]
A CHiP Computer is composed of a seL of homogeneous rnicroproces-
SOl' elemenLs connected aL regular intervals La the switches of Lhe swiLch
latlice. The laLLicc is composed of programmable swiLches connected by
daLa paLhs La each oLher or La the PE's. PerimcLer swilches are aUached
to exLernal slorage devices. Figure 2 illuslrales Lwo examples of this
slructure." Each PE hus ils own local program and dala memory and
·NoLke LhaL the picLurcs arc nol drawn La scalc. The PE's are much larger than
the swiLchcs.
-5-










Figure 2. Two lattices. Circles represent switches; squares represent
processing elements; lines represent data paths.
A configuration seLling is an instructi.on ·which, when invoked, causes
the switch La form a passive connection belween any combinaLion of iis
incident data paUlS. No Lice that Lhis is circuit swiLching rather Lhan
packet sWiLching and thai Ian out is possible al the switches. Figure 3(a)
shows the configuration settings for a mesh pattern for Lhe latlice of Fig·
ure 2(a); Figure 3(b) shows lhe same lattice configured as a binary tree.
To implement an interconnection paLlerll, the switches are loaded ·wiLh
eOlll1guraLion seLLings by an exLcrnal conLrol processor via a "skelcLon"
LhaL is LransparenL La Lhis discussion. This aeLiviLy is usually performed
in parallel wiLh Lhc controller's loading of Lhc PE programs.
-6-
(a) (b)





















o a 0 0 0
° °
° 0 0




° Hi~? ° ° o~
° ° ° ° °
0






° ° ° ° °
-0-'
° ° °
I 0 0 0 0 0 0
j--o-{ r0-
°
0 0 0 0 0 0 0
°





0 o~o 0 oL oJo
-0-' J-o-! ~'1-:-)~ -<>-f'}-O-
° °
0 o+oLo 0 o~o
° ° ° o ~ra~?' 0 ° r:
Figure 3. Two configurations of the lattice in Figure 2(a).
A parallel program is viewcc.l as the composition of several parallel
algorithms each ·with its own processor interconnecLion pattern. Each of
these interconnection palterns and the associated PE code is called a
"phase." The conlroller loads LIte PE's and swiLches vdth the instructions
for several phases. Processing begins wiLh a broadcast command from
Lbe conLroller Lo Lhe switches Lo invoke a parLicular sLored inlcrconncc-
Lion pattern. This also causes the PE's Lo begin synchronously exccuUng
Lheir locnl progrnl11s. The inLerconnection sLrueLure remains sLuLic
throughouL Lhe cxccuLion of Lhe phase. Whcn the phase compleLes,
anolher broadcast commnnd causes a different inLerconnection pattern
/
-7-
to be invoked and a new phase to be initiated. The action continues in
lhis manner from phase to phase.
Several points are worlhy of special clnphasis. First, to implement
an interconnection pattern requires that all configuration seLtings be
stored in the same location in all of Lhe switches. This is so that the
broadcast command can take Lhe simple form "invoke the setting in loca-
tion x," Lhus making possible one step phase transitions. Second,
switches can provide the ability for data paths to "crossover" one
another, i.e., a setting can implement mulLipJc data path interconnec-
Lions. Third, the PE's need noL know Lo whom they are connecLed; Lhey
simply execute instruclions of the form READ EAST, WRITE NOHTHWEST,
elc. The inLerconnection palLern expliciUy implemenls Lhe routing.
Fourth, Lhe data paLhs arc bidirecLional.
Example.: Consider the problem of finding the solution to a syslem of
linear equations, Ax::.!), where A is an nxn band maLrix of width p and
b is an n vee Lor. To solve Lhe problem we use Lhe Kung-Leiserson LU
decomposition pipelined (systolic) algorithm [4J and lheir lower tri-
angular system (LTS) solver algoriLhm. The inLerconneclion paLlern
(Ior p ::4) is shown in Figure 4. The exact operation of the algorithms
is unimportant except Lo say thal they are pipe lined and Lhe dala
moves in the direcLion of Lhe arrows. Phase 1 decomposes A inlo
lower and upper Lriangular maLrices, A=LU, and aL the same lIme
solves lhe lower Lriangular syslcm, Ly=b. P'igure 5 shows the embed-
ding in lo Lhc laLtice of Figure 2(a) of lhese Lwo algoriLhms -- Lhe L
matrix is Lrans[crred direcLly from the decornposition processor to
Lhe L'l'S solver. The x vecLor resulL Ciln be fanned by solving Ux=y,
















a. 1 .1.:- ,.1
'. 2 . ,. 3 .
.11 <,l .l+ ,.l
1I~CJ-[J-Yk
FIGURJI 4. Kung-Lciscrson Systolic arrays [4]. (0) LU-Dccomposition; (b)




a. 1 .~- ,l
a. 2 .
L - r J..
a. 3 .1.- , J..
'\:, k+2 uk I k+l u k,k
FIGURE 5. The embedding of the LV-Decomposition processors (l-lG)
and Lhe lower Lriangulur sysLclCI solver (A-D) oJ l"igurc iJ- ill
the laLLicc oi T"igure i~(a). Tllc embedding llppcurs in LlIC
North-WesL corner of the l<J.LLicc.
-10-
the LTS algorithm, but U must be completely generated before being
rewriLten. Thus, phase 1 saves the U matrix and y vector values in
preparation for phase 2 by threading Lhenl through Lhe lattice. (See
}i'igurc 6.) In phase 2 the values arc lhrc<ldcd back through Lhe laL-
Lice in Lhe opposite direcLion, which effects the rewriling opera Lion,
and Lhey aTC input to another LTS solver. (See Figure 7.) The result







figure G. Threading the U matrix Figure 7.
and y vcclar v-aluc:;; or
IJhilSC 1. SwiLcllCS noL
shuwn.
Reverse threading of the
U and y valuc~ for
F1Jase 2 LUF:cLJlcr wilh il
::;CCUllcJ LTS :-;olvC'l'.
The example is specialized to a band matrix of width p;:::4. A general
procedure Lhal solves lhis problem for arbiLrary widLh bands would differ
only in lhc inLerconncclion sLrucLure; Lhe various PE programs reqUired
for an arbiLrary widlh soluLion are all represenLed in Lhis p=:1 case. Thus,




We will emphasize the specification of uniform rather than ad hoc
interconnection patterns because they are of interest. in their own right
and ihey are often the building blocks that arc used by the less regular
patterns. First, ...·.,.-e musL consider lhe laiLice thaL is io host the inLercon-
nee Lion paLLern.
As indicated in Figure 2, a variety of different lattices are possible,
although any particular architecture will use only one. Lattices differ in
complexit.y in several ways: corridor width, degree, and crossover capa-
bility. The corridor width, W, is the number of s-.,..-iichcs separaLing two
adjacent PE's, e.g., the latlicc of Figure 2(a) has w=l and that of F'igure
2(b) has w =2. Any lattice can elnbed an arbiLrary graph, buL La do so
may require leaving some PE's unused [9]. A wider corridor widLh uses
PE's marc efficiently when embedding complex graphs. The degree, d, of
a lattice is the number of dala palhs incidenL on a PE or a switch. (If
these Lwo numbers arc differenl, d is Lhe minimum.) For example, Fig-
ure 2(a) has d=O while Figure 2(b) has d=4. Finally, Lhe amounL of cross-
over cupabiliLy c is lhe number of distincL daLa palhs Lhal can inlersect
aL one swiLch. A crossover capabiliLy c =2 permiLs a crossover while c =1
does not. In the inLeresl of gencralily, we will assume the "simplest" lat-
tice suiLable for an inLcrconnecLion palLern.
Programming an interconnecLion paL1..ern requires Lhai the
configuration selling of each relevanl switch be defined. For the present
discussion it sufTices LhaL we give a logical specification of the selling
•
since lhe actual bit configurations are irrelevant. Accordingly, we will










and we will assign settings as pairs of these lellers. For example, EW is a
horizontal connection ·while ME is a 15 C angle. The lattice will alvmys be
nXn lNhere n is Lhe number of processors on a side. We name the
switches and PE's with a Lwo value index curresponding to it.s maLrix posi-
lion. See Figure 8. We wlll name lhe lattice "L",
1,lf----(V-----{g>-----(9-
I
2, lr-~---1 2,2 }---{2, 3)---1 2,4 -
I
I IQ---Y---· 3, J ----{
Figure B. The two index coding scheme for a lattice.
As an example of this specification meLhod, we observe that the mesh
interconncc Lion paLLern (F'igure 3(a)) can be dofLned * by the two condi-
Lions:
(i) i is odd and j is even implies L[i,j];:; NS
(ii) i is even and j is odd implies L[1.,j]::: NJY
'Ill our presenLation of interconnccliOl~{lullerns. we will usc a simple declarative
specification. We are prescnLly developing a configuration programming language,
but unlil it is compleLed. we prefer the neuLral declarulivc approach.
-13-
provided that the lattice is initially unconfigured. A hexagonally connect-
ed interconnection paltern requires the further condition
(iii) i is odd and} is odd implies L[i,} ] ~ OF
and requires a lattice of degree d=6 or (for symmetry) d=B. Notice
that this specification is somewhat moTC general than that used in
Figure 5.
Torus Interconnection Patterns
Since the nxn Lorus interconnection pattern is simply an nXn
mesh with the Lop row and bollom row PE's connected and the left
column and right column Pl~'s connected, (sec F'igure 1), one mighL
expect a one corridor, degree 4, crossover capable (c =2) lattice Lo
suffice Lo host Lhis patlern. Surprisingly, iL does not.
Theorem. Lei L be a w=l, d=4. c =2 nXn laLLice. L cannol be seL
Lo connect the PE's inLo an nXn torus.
The proal involves arguing LhaL Lhe perimeLer corridors musL be used
for Lwo purposes - La supporL boLh Lhe verLical and horizonLal "wrap
around" and Lhus cannoL lead Lo an edge disjoinL graph embedding.
Direcl TortJ,s Representation. Even when d =0, embedding the Lorus
is noL trivial if we are Lo avoid multiple use o[ daLa paLhs.
Lattice. w=l, d=O, G =2.
Settings for Crossover Levell.
FirsL we connecL Lhe PE's in the rows. Then we run a data path from
Lhe Norlheasl parL of the first PE Lhrough Lhe corridor above lhe row





shov,"s the conslruction for conditions (i) through (iii).
o
(i) [PE row connections] 1<i,j<2n+l and i is even and j is odd
imply L[i,j]=E'W.
Oi) [Norlheasl porls] i<2n+l and i is odd imply L[i,3]=AE' and
L[i,2n+l]=AW.
(iii) [Corridor above rows] i<2n+l and i is odd and 3<j<2n+l
imply L[i,j]=E'W.
Settings JOT Crossover Level 2. A similar strategy is used for the
columns.
(iv) [PE column connecLions] 1<i,j<2n+l and i is odd and j IS
even imply L[i,j]=NS.
(v) [Soulhwesl porls] j<2n+l and j is odd imply L[3,j]=MS'and
L[2n+l,j]=NM.
(vi) [Corridor lefl 01 columns] j <2n+l and j is odd and 3<i<2n+l
imply L[i,j]=NS.
Figure 9 illuslrates the en Lire construction.
The difficully with this interconnection pattern, of course, is that
it has long daLa paths that are subject Lo propagation delay. SO,me
algorithms can accept such a delay, buL generally we would like to
reduce iL. Accordingly, we prefer the following more inLricaie paLM
Lcrn LhnL inLerlenves the row and column processing elcrnenLs so
thnL Lhere is a fixed bound on the dis Lance a signal must travel.
-15-
Figure 9. Di.rect embedding of the torus inlo
the laUice of Figure 2(a). Edges
of like color intersecLing at a
swiLch are connected.
Figure 10. Interleaved embedding of the lorus
into the laLtice of Ji'igure 2(a).
-16-
Interleaved Torus Representation
Lattice. W=l, d=U, c=2.
Settings for Crossover Levell.
FirsL we connecl alternate PE's in rows. For example,
o
o
The end connections are specified by
(i) [Easl porl, end PE'sJ i is even implies L[i,3]=EW and
L[i,2n+l]=WO.
The westerly port connections of each PE are given by
(ii) [1YesL port] i is even and 3<j<2n+l and j is odd imply
L[i,j]=OE,
The connections in the corridor above the row are given by
(iii) lNorLheasl port.] i<2n+l and i is odd and 3<;;j<2n+l and j IS
odd imply L[i,j]=AE,
(iv) i<2n+l and i is odd and 3<j and j is even imply L[i,j]=WF.
Settings Jar Crossover Level 2. The columns are connected in a
manner analogous Lo the rows.
(i) [Soulh porl, end PE'sJ j is eveIl implies L[3,j]=NS and
L[2n+l,j]=ON.
(ii) [NorthporL] i IS even and 3<i<2n+l and i is odd imply
L[i,j]=OS.




(iv) j<2n+l and j is odd and 3<i and i is even imply L[i,j];::NF.
The enLire construe lion is shown in Figure 10.
Clearly the maximum number of s1'dlches that any data item must
pass through is three. We have increased the locality oj the torus
ernberlding. It is, Lherefore, more amenable VLSI implementation
and can be used in an arbitrarily large lattice with only a consLanL
delay.
Complete Binary Trees
Although an efficIent embedding of complete binary trees into
the plane is known [10], its direct applicaLion to interconnection paL-
Lern programming is very wasteful. (See Figure 11.) In facl, sillce u
complete binary tree of depth 17L has 2m -l nodes, we can expect a
lattice with Zk x 2k PE's Lo host a complete binary tree of depth Zk
YiiLh one unused node. Call this node a "spare." We can expect that
Lhe simplest laLLice hosLing this paLLern will not require crossover
capabiliLy, since trees are planar, and will require only degree d=4,
since Lrees have at most degree 3 connections. (The laLtice then is
given by w=l, d=4, C =1.) But if the reader atLempts to develop an
inlerconneclion with these conditions, he will find il La be unexpect-
edly difficull.
The overall stralegy is La begin wiLh smull, complete binary trees
embedded in square regions of Lhe lulLice. To reduce propagation delay
the rool will be placed in the ccnter of Lhe block. Each block will contain
a spare PD:. I\'c compose four such square blocks LogeLher Lo form a
larger billury tree in u lurger square block. Three of the four spare PE's
will be used as nodes in Lhe composed tree; Lhe fourLh spare will become
-18-
a 0 0 0 0 0 0 0 0 0 0 0 0 0 0
o O-O-Q-O-o 0 0 0 0-0 0- 0-0 0
o 0 0 ' ?-o-o 0 j-04 6 0 0 0
000 -oob-o 06-6000
000 ooo6-y oO?OOO
o 0-0-- 0 0 ? 0- o-D--o--{] 0
: ~ : ~ : ~ :~ ~ : ~ : ~ :
: ~~~ : ~-y ~-:-~-:~ :
o 0 0 ~ 0 0 0 c'--X 0 0 ;r, 0 0 0J I
o 0 O-? 0 \,-01 0 ? l( 0 0 0
o 0 0 6~ 0 -0-0 0 0 0 0
o 0-0--6--0-0 0 0 0 0-0--6--0--0 0
o 0 0 0 0 0 0 0 a 0 0 0 0 0 0
Figure 11. Hyper-H tree (Figure l(d)) embedding [10]. Filled PE's
are unused.
the spare of the new block. The goal is t.o place the spares so that they
will be cOllvcnienLly located for t.he composiLon.
Define three lypes of tree cmbeddings:
Type A blocks have t.heir spare PE midway along one side adjacent to
Lhe exiting edge Irom the block's rooL.
Type B blocks have their spare PE in the corner on the same side as
the exiling edge from t.he block's rooL.
Type C blocks have Lheir sl'are PE in the corner on the opposiLe side
of Lhe exiting edge from t.he root.
-19-
Figure 12 illustrates the three types of blocks and demonstrates that
















~"igure 12. Schematic of blocks composed to form larger blocks. Solid
squares represent original roois; open squares represent
spares. Superscript "Ii''' means reflect with respect 10 hor-
izontal axis (Aip); superscript "It" means relled with respecL
La verLical axis (reverse).
Notice, thaL as part of the induclive hypothesis, we must argue Lhat
the perimeter swiLches arc available [or rouling Lhe new edges. This is
obViously Lrue if Lhey are available in Lhe basis blocks. The smallest
blocks that we have been able La fwd wiLh this properly are 4x4 blocks
embedding 15 node binary trees. These are illuslratcd in Figure 13.
The concepLual algorithm is clear. Refer lo li'igure 14. Begin with an
objeclive block type, e.g., Type H, and a lattice oJ size Zk x2k PE's. Recur-
sively embed Lhe IouI' subLrees in laUices oI si:le 2k - 1x2k:-1 such thaL Lhe
proper blocl< Lypes are selecLed. In Lhe basis cases (22x22), use an explieiL
embedding. Nolicc Lhal Lhe results may require TcllecLion. Connect Lhe
Lhree spares by appropriaLe swiLch settings. This laLLer operaLion is
always possible based on an inductive argument that depends upon Lwo
Iads:


























a [)--O--y-o-o a Q 0 - _ _ 0
oooLo.~o 000
o
o 0 ?ooobo 0
o ---'~_J--o--o--o-C~ u
o 0 u--v---y 0 0 0 y..





:rtD:~'~~. 0 o-o--G'--<>-9 6--{J ,
o 0 0 ooooooo~~~ °
o 0 [}-o-r:}-o-[] l>--6--o-D-~"l-G-{l 0 ~n--o-O
ooobo5ooo o~oer---o-- 0 000 ' °
o ~:---o--r...J 0 u 0 D--o--O--<>-i:J 0 lS 0 0 0 0 0 [J'--<:>--fl-<:>-<-,





o 0 0 ~
[~8~~
~ 0 0 0 0
[J-o-O-o-O 0 [J







Figure 13. Basis blocks for planar binary tree embedding.
ea) AILey the basis connection, all spares have their origin as Type C
basis block elements, and
(b) None 01 the swiLches surrounding a Type C basis block spare IS
used and so there are lhree directions of access.
This guarantees that the three data paths can always be assigned. The
detailed program is omiLted.
Clearly, we have achieved our go"al of complete PE usage of this sim-
pIe lattice. If the available IatLice were more complex, e.g., had degree 8
or mulLiple corridors, then the same embedding would work and some
minor opLimizaLions would be possible.
Lacing a Corridor
AILhough we could present many more of our ernbeddings - a brond-
cast tree, a double tree, leaves on a line trec, shufflc exchange, elc. - it is
perhaps more inslrucLive to illustrate a technique that gives unexpected






and it takes optimum advantage of a fixed archit.ectural resource, the
corridor width.
Suppose one is embedding an interconnection pattern and mu~t
move a large llUlnber of distinct. daLa paLhs across a region of the laUicc.
By definition, the corridor width, W, is the number of switches separaLi.ng
adjacent PE's. Thus, if the degree d=4, then w disLinct data paths can be
routed between a pair of PE's. It. would appear that 101' the degree d=B
laLLice, w disLincl data paL]ls are sLill Lhe maximum that can be rouLed
clown a corridor. BuL we can do much beLLeI'.
The idea behind lacing is t.o begin with slraight data paths down a
corridor and then to add zig-zag paLhs that. exploit. the higher degree and













, , . , , .
· ·
. . .

















, , . . , , .
·
. . .

























· · · ·~
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
Figure 15. Lacing len distinct data paths t.hrough four switches.
shows a w =4, d =8, C =3 latliee in which ten dislincl data paths have been
squeezed through the raUl' available swilches! This is the maximum possi-
ble since Lhe biseclion widLh or this portion of Lhe laLLice is Len. (Bisec-
tion wic.lLh is a eoncepL inLroduceu by Thompson [11] referring Lo Lhe
minimum number or wires cuL by a line bjsecting a VLSI layouL.) If we
-23-
expand our scope somewhat and include the switches that bound the cor-
ridor, Lhen we can increase the number of disLincl paths by two. (We will
ignore this oplimization in the lacing definition below.)
Lallice. w >1, d =8, C =3.
The construclion is limiled to a region bounded by four PE's. The upper
lefl hand corner PE is L [r ,s ].
Settings for crossover level 1. [Horizontal Path]
(i) 15:i:o::;w and O::;:j~w+l imply L[r+i,s+j]=EW.
Settings for crossover level 2. [Dotted PaLh]
(ii) l~i~w-l and O~j~w+l and j lS even imply L[r+i,s+j]=AF.
(iii) 1~1,~w-l and O~j~w+l andj is odd imply L[r+i+1.s+j]=OM.
SelLings for crossover level 3. [Dashed Path]
(iv) l~i::::;:w-l and O:=::j::::;:w+l andj is even imply L[r+i+l,s+j]=OM.
(v) l::::;:i:;:w-l and Q::::;:j::::;:w+l andj is odd imply L[T+i,s+J']=AF,
Notice that if Lhe swiLches had even higher crossover cupability c =4,
which is the maximum for degree B switches, Lhen we could even rouLe
verLical wires across Lhe laces if Lhey were needed,
Conc.:lusions
We have introduced the CHiP architeeLure and argued that its provi-
sian lor int.ereonnecLion paUern programming alleviates many of the
difficulLies cneounLered 111 parallel progl'illTI developmenL. This
simplification is achieved in Lwo ways. Firsl, LIte rigidiLy of n nxcd inLer-
eonnecLion slructure is no longer an obstacle when one wanLs to program
an algorithm that uses a differenL inLerconnection paLLern, And
-24-
secondly, there is a clean separation between routing the data and pro-
gramming the activity of the PE's.
AddiLionally we have demonsLrated thai interconnection program-
ming is an inLeresting and challenging activiLy. We wave shown that local-
ity an be increased by carelul sLudy of Lhe torus. We have shown tbaL it is
possible La embed lhe complete binary Lree La achieve essentially corn-
plcte PIT; utilization. The resulL involves an illLercsling assignmenL of
spare PE's. And we have shown thaL there are general techniques (e.g.,
corridor lacing) La be found.
Acknowledgments
It is a pleasure to thank Ching C. Hsiao Ior his original use of lacing
und Paul )'IcNabb for developing Lhe software to produce these embcd-
dings aIld for sLimulaLing discussions of Lhe binary trce embedding.
Thanl<s are due La Paul Il'lorrisse LL for programming the Larus and lacing
figures awl Lo Julie Hanover for exccLlenL manuscript preparation.
Compound octagon-square lattice
Chengtu, Szechwan, 1825 A.D.
-25-
References
[1] P. A. Gilmore, K. E. Batcher, M. H. Davis, R. W. Lott and J. T. Burkley
Massively Parallel Processor
Technical Report GER-16684, Goodyear Aerospace Corporation,
July 1979.
[2J Sally A. Browning
The Tree Machine: A Highly Concurrent Programming Environment.
Ph.D. Thesis, California Ins LlLuLe of Technology, January, 1980
[3J BarL LocanLhi
The Homogeneous Machine
Ph.D. Thesis, California Institute of Technology, 1980
[4J H. T. Kung and C. E. Leiserson
Systolic Arrays (lor VLSI)
Technical Report CS-79-103, Carnegie-Mellon University, December
1979 (also in [10])
[5] Jon L. RcnLley and H. T. Kung
A Tree 1Iachine for Searching Problems
In Proceedings of the Dth international Conference on Parallel
Process-ing, IEEE, pp. 257-266, 1979
[6] J. T. Schwartz
UILracomputers
Transactions on ProgTrLTTLming Languages and Systems, ACM, 1980
[7] F. P. Prcparata and Jean Vuillemin
The Cube connected cycles: A VersaLilc NcLwork for Parallel
CompuLuLion
In Proceedings of lhe 20th J1nnuaL Symposi:um on the Foundations
oj Computer Science, IEEE OcLobcr, 1979
[8J D. B. Gannon and Lawrence Snyder
Linear r~ecurrenceAlgoriLhms for VLSI: The Configurable, Highly
Parallel Approach
In Proceedings of tfw 1Olh International Conference on Parallel
Processing, IEEE, 1981
[9] L. Snyder
Overview of the CHiP Computer
In VLSf 81, Academic Press, 1981
(10] Carver Mead and Lynn Conway
Introduction to VLSI Sy~·tems
Addison Wesley, 1980
[11] C. D. Thompson
A ComplexiLy Theory for VLSI
Ph.D. Thesis Curnegie-Mellon University, 1980
