Introduction to the Configurable, Highly Parallel (CHiP) Computer) by Snyder, Lawrence
Purdue University 
Purdue e-Pubs 
Department of Computer Science Technical 
Reports Department of Computer Science 
1980 




Snyder, Lawrence, "Introduction to the Configurable, Highly Parallel (CHiP) Computer)" (1980). Department 
of Computer Science Technical Reports. Paper 282. 
https://docs.lib.purdue.edu/cstech/282 
This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries. 
Please contact epubs@purdue.edu for additional information. 
Introduction to the Configurable, Highly
Parallel Computer
La.wrence Snyder




,11g;t1'acl .. The Configurable, Highly Parallel (CHiP) Computer
Family is introduced. These architectures are built around
a lattice of programmable switches and data paths that
permit processing elements to be connected in arbitrary
patterns. The approach preserves localit.y. The parameters
that det.crmine various family members are discussed including
switch configurat.ion storage capacity, swit.ch and processor
clement. degrees and corridor width. An efficient embedding
of a complete binary t.ree is presented to illustrat.e int.er-
connection pattern programming. An algorithm for solVing a





The research described herein is part. of the Blue CHiP Project.
Funding is prOVided in part by the Office of Naval Research under Contract.
N00014-IW-K-OSIG and Cont.ract NOOOI4-81-K-0360, Special Research
Opport-unities Program Task SRO-IOO.
Inl-roduction
polymorphism, n.(l): capability
of assuming different forms; cap-
ability of wide variation.
-IVcbster's Third International Dictionary~
•
When VOIl Neumann computers \~ere still nc\~ <Lnd exciting,
scientists noted in popular accounts that unlike mechanical machinc$,
computers are polymorphic - their function can be radically changed
simply by changing programs. Polymorphism is fundamental, but
it quickly bccame familiar to thc point of bcing obviolls ilnd ha!' becn
mcntioned little since, even though it has continued to underlie
important advances such as time-sharing and programmable microcode.
NO\~, as we are confronted with the potcntial for highly parallcl com-
puteTs madc possible by very large scale intcgrated (VLSI) circui t
technology, \~e may ask:
IVhat is the role of polymorphism in parallel computation?
To answer this question, \~e must review the characteristics of parallel
processing and the benefits and limitations of VLSI technology.
- 2-
AlgOY'if'lnn:caZly Sreeia7.ized Pmce.<;:'wru
Perhaps the most important property of VLSI cireui t technology is
that the manufacturing processes use photolithographic means to create
copjc~ of a circuit. F,lbriciltion by photolithography (or the newer X-r<lY
lithography techniques) requires <l fixed number of steps to produce <l
circuit, independent of "the circuit's complexity. It costs no more to
make copies of a chip containing a NAND gate than t:o make copies of a
chip containing a microprocessor, although yields \~ill likely be higher
for the former and \\'ire bonding costs higher for the latter. Preparing
and debugging the lithographic masks is expensive, so the technology
favors parallel processing techniques that employ many copies of the
s<Jnle, possibly complex circuit.
Rccogni tion of uniformity as the source of leverage in VLSI caused
a flurry of research during the past half decade. This research resulted
in a number of device proposals \~hich we may call algor'ithmically
specialized processor's. 13y focusing on computationally in'tensive
problems and carefully dissecting algori'thms for them, researchers have
developed algorithmically specialized processors having sever<ll impoTt<lrlt
characteristics:
construction is based on a fe\~ easily I:essellated processing
elements,
locality is exploited, that is, da'ta movement is often limit.cd
to adjacent processing elcmen'ts,
pipellning is llsed to achieve high processor utilization.
EX<lmplcs \)f algorithmic1ll)' special ized processors include J(:~;i~lls fOT ttl
d<"l'olllpositil)ll 12,.~] (the lIIain step in solvilll: system..; of Ij1\c;1r equiltiolls).
the solution of linear recurrences [2], tree processors [<1,5,6J (used ill
- 3-
searching, sort.ing and expression evaluation). dynamic programming (7]
l a gl?Jl(' 1"a 1 p rob 1 ell1 sui v i ng t CdlJll<lll'" \\" i l h IllUU..' I"\)[IS 11 1'1' ( i~' ill 11111 S l. .I 'l 1 II
processing [8] (for data base querying), and many others.
Algorithmically specialized processing components must be
joinccl 'together to sol-vc a large, computationally intensive problem.
This composition step is crucial since whole problems tend to be
multiphased and these components t.end to be specialized t.o an algorhhm
used in only one phase. For example, to solve a system of linear
equations (l1x=b) one might use a processor component to form 'the LU
decomposition of the matrix A (A=LU) and then usc a linear recurrence
solver component to perform the substitution phases (Ly=b and Ux=y).
A'3 anothcr cx.ample, querie'3 in data basc query languages arc formed
by composing operations such as "scarch" and "join".
If the component processors are implemented on chips, one way to
compose them is to wire them together. This solution is inflexible since
the components are dedicated to a particular problem and cannot be used
for another problem. Another compositional scheme is to join thc
processors to a bus as "pheripherals." This is more flexible since a
processor can be used in different phases, but the bus becomes a
bottleneck and time is \~asted in interphase data movement..
A more flexible approach is t.o replace the dedicated processi.ng
elements with more general microprocessors and simply to program the
algorithmically specialized processing function. This ::;olution 15 much more
flexible since different components can use the same devices by changing
rrograms (provided t.he interconnection pattern is the same). The bus
bottleneck is eliminated. There is a loss in performance with this
-, -
polymorphism, since circuit implementation of the primitive actions is
replaced by the slower process of .instruction execution.
But tIn: main prohlem \~ith this approach is that ;11!~orithmically !']1C'ci,J1-
ized processors often usc different illtercOnnel:tlOll structures (see Figure 1)_
There is no guarantee that the consecutive phases of the computation can
be done efficiently in place. For example, if we have an n x n mesh
connectcd microprocessor structure and want to find t.he maximum of n2
elelllC'llts stored one pel' pl"Occssor, 2n-l steps <lTC necessary and sufficient
to solve the problem. But il faster algorithmically speci.alized processor
for this problem lISCS il tree interconnection pat.tern to find the solution
in 2 Zag n step~" ror large n this is a benefit \~orth seeking. Again,
.:l hus can he introduced t.o link scveral diffcrently connccted multiprocessors
including mesh and tree connected mUlt.iprocessors. Data could be transferred
I.'hen a change in the processor structurc Imuld be beneficial. But the
bottleneck is quite serious - in the example, data has to be transferred LIt a
l.";lte proJlortional t.o n 2/log n words pcr stcp to make the transfer worthwhi Ie.
Wholt I'T need is a multiprocessor I>'ith more [)olymorphi sm that does not
l"OI1l11\'ollli~l' the 11l'1lCfits or VLSI technOlogy.
The ~onfigllrablc, Highly ~arallcl (CHiP) computer is a multiprocessor
architecture that provides a programmable interconnection structure in-
tegrated with the processing elements. Its objective is to provide the
flexibility needed to compose general problem solutions while retaining
the benefits of uniformit}, and locality that the algorithmically
specialized processors exploit.
The CIHp AY'chitecttlre OveY'v'iew





Figure 1. Interconnection patterns for algorithmically specialized
processors: (a) mesh, used for dynamic programming [7J;
(b) hexagonally connected mesh used for LD decomposition [2];
(e) torus used for transitive closure [7]; Cd) binary tree
used for sorting [4]; (e) double tree used for searching [5].
-6-
three components: (a) a collection of homogeneous microprocessors,
(b) a switch lattice antI (c) a controller. The swhch lattice is the
most important component and the main source of differences among family
mcmhers.
The switch lattice is a regular structure formed froln programmable
switches connected by data p<lths. The microprocessors (hereafter called
processing elements or PEs) arc not directly connected to each other, but
ratheT are connected at regular intervals "to the switch lattice. Figure 2
sh(H~S three examples of sNitch latt.lces. Generally, the layout. will be
square ;llthough other gcometril's arc possible. The perimeter switches are
conneet"d to eX1'cl"nal ::-:toragc devices. A production CfliP computer mi.ght
have from 2 8 to 2 15 PEs, (I'lith current technology only <I fel>' PEs nnd
sl~itches cnn be placed on a single chip. As improvements in fabrication
technology permit higher device densities per unit area, a single chip call
ho~t a larger region of the SI,,;tch lattice, Moreover. as discussed belm~,
the CJI;I' <ll'l"hitectlll'l' -j~ ql1ire- ~Ilit(lhle fol' "l,'<lfer level" f:lhrir:ltion.)
Each s\~itch in the l:tttit:t:: contains tocal memory capable of storing
several configurati.on ~ettings, fI configuration setting enables the
switch to establish a direct, static connection among two or more of its
incident data paths. (Notice, tilis is circuit sl~itchillg rather than
packet s\~itching.) For e:\:ample, I,e achieve a mesh interconnection
pattern of the PEs for the latti<':8 in Figure 2(3) by assigning North-South
configuration settings LO alternate sliitches in odd numbered rows and
East-I'lest settings to s\,'itches in the odd llumbered colunms. Figure::'i
i llustrntes the confIguration; FIgure <\ gives the confl~urat.ion














Figure 3. The s\~itch latbce of f-igure 2(n) configured
into a mesh pattcTn.
o 0 0 0 0 0 0 0 0
~ ~ ~ J-o-D: t ~
root 0 l~ 0 f--V--U-~i' 0
~~~t~~j~
I) LJ 0 [}-(:."-{J-(r-{] 0
o 0 0 0 000 0 0
Figure <I. The 5\6tch lattice of Figure 2ea) configured
into a. binary tree.
-9-
The controller is responsible for loading the switch memory. (This
task is performed via a separate interconnection "skeleton" that is
transparent to this discussion.) The switch memory is loaded pre-
paratory to processing and is performed in parallel with the PE program
memory loading. Typically, program and switch settings for several
phases can be loaded together. The chief requirement is that the local
configuration settings for each phase's interconnection pattern be
assigned to the same memory location in all switches. For example, ~n
each switch, location 1 might be used to store the local configuration
to implement a mesh pattern, location 2 might store the local
configuration for the tree interconnection pattern, etc.
CHiP processing begins with the controller broadcasting a command
to all switches to invoke a particular configuration setting. For
example, suppose it is the setting stored at location 1 that: implements
a mesh pattern. With the entire structure interconnected into a mesh.
the individual PEs synchronously execute the instructions stored in
tbei r local memory. PEs necd not know to I"hom they arc cOllnected; they
simply execllte instructions such as READ EAST, WRITE NOnTIlWEST, etc.
The configuration remains static. Whcn a new phase of processing is to
begin, the controller broadcasts a command to all switches to invoke a
ncw configuration setting, say the one stored at location 2 implementing
a tree. IHth the lattice restructured into a tree interconnection pattern,
the PEs resume processing, having spent only a single logical step in
interphase structure reconfiguration.
The overview of the CHiP computer family has been superficial, but
it has provided a context in which to present a more thorough treatment.
- 1Ll-
The next threc sections nre:
A a7.oser look~ glVl.ng details about sl.,'jt.ches, lattices and
t.he controller
Embedding an interconnection stI'Ucture~ an example of how t.o
configure the latt.ice into a complete binary tree, lind
SoZving a system of Zinear equ.ations~ illustrating how a
multiphased problem might be solved.
I\'e conclude wit.h a D'ismtBsion scction in which "'e lJlcntioll some of t.he
conscqucnces of the CHiP architect.ure approach.
A Clo:;ero Look
\'Ie consider some of t.llc characteristic!> that distinguish mcmbers of thc
family of CHiP computers.
Switches. It is convenient. to t.hink of sHit.ches as being defined by
severnl parameters.
m _ t.he number of wires entering a swit.ch on one data path, or data
path width,
d t.he degree, or Ilumner of Incident data paths,
c _ the number of configuration settings that can be stored in a
swi t.ch.
The value of m reflect.s the balance st.ruck betNeen par<lllel and serial
Jata tr<lnsmission. This balance will be influenced by several considerations,
one of I"hich is the limited number of pins on the package containing the
chips of the CfliP lnttice. Specifically, if a chip hosts a square region
of the lattice containing n PEs, then the number of pins required is
proport.ional to min.
The value of d ,·.. ilt usually be 4, as in Figure 2(a), or S, as
in Figure 2(c). Figure 2(b) sllOl"s a mixed strategy which exploits
the fact that switches tcnd to he llsed in two di fl"CI"Cllt roles. Switches
at t.he intersection of the vertical and horizontal switch corridors tend
-11-
to perform most of the routing while those interposed bet\~een two
adjacellt PEs :JL't marc I ikc extendcd PE port:> for $clcct-illg J:lt<l path~
frolll the •...:uTriJol" L)lISc::;", Spccializlll,!; tilt: dt~gt'~l: t,f till: ."'I>Ltt"lL LL>
these activities reduces the number of bits required to specify a
configuration setting and thus saves area.
The value of c is influenced by the number of configurations that are
likely to be needed for a multiphase computation and the number of blts
requireu per 5ettin~" This latter number depends on the degree and the
crossover cal-'ab i l i ty of the s\~i tclt.
"Crossover capability" is a property of switches referring to the
number of distinct data path groups that a switch can simultaneously
connect. l'le speak of data path "groups" rather than data path pairs
since fanout is permitted at :J switch, i.e. <I switch can connect more
than a pair of data paths. Crossover capability is specified by an
integer g in the range 1 to d/2. Thus 1 indicates no crossover and
d/2 is the maximum number of distinct paths intersecting at a degree d
s\~itch, Like the three parameters mentioned above, t.he crossover
capability g is fixed at. fabrication time.
The number of bits of storage needed for a switch is modest, dgc,
This pl-ovi.de~ a bit for each directiol\ for eal'h t.:Tossover group rOT e,lch
configuration setting. A technique to reduce this value is to provide
for the loading of switch settings while the CHiP processor is executing,
This quality, called "asyncronous loading". permits a smaller value of c
by taking advantage of two facts: algorithms often use configurations that
differ in only a few places, and configurations often remain in effect
long enough to provide time to prepare for future settings.
lAt/;icc. From Figure 2 it is clear that lattices call eli fEel' 111
several t·h;lra\~te,.istics. The 1'1; tll'HI"\'!'. I ike Ihl' switl'll dq:rl'e, i~ the
-12
nl1mber of incident. data paths. Most algorithms of interest use PEs of
dCllTCC eight or less. Larger degrees arc probably not. necessary since
the)' can be achieved either by multiplexing data paths or, \~ith some
loss in PE utili<:atioll, by logically coupling proccssillg elements, e.g.
two degree four PEs could be coupled to form a degree six PE where one
serves only as a buffer.
Call the number of data paths that separate 'tIW adjacent PEs the
cOl'l'idor width~ tJ. (Sec Figure 2(c) for a w .= 2 lattice.) This is
perhaps the most significant parameter of a lattice since i.t influences
the efficiency of PI::: utilization, the convenience of interconnection
pattern embcddings, and t.he overhC'ad required for the polymorphism.
To see the impac't of corridor l~idth, let us embrace gr<lph embedding
parlance and say that a switch lattice hosts a PE interconnection pattern.
In theory. even the simplest lattice (like the one in Figure 2(a)) can
host an 3rbitrary interconnection pattern. But to do so may require the
PF.s to be unoerutili<:ed for two n~a:ions. First PEs may be coupled to
achieve high PE degree as mentioned at the beginning of this section.
Second, and marc.: importantly, adjacent PEs In the (logical) guest illter-
connection pattern may have to be assigned to \~idely spaced PEs in the
hosting lattice (i.e. separated hy unused PEs) in order to prOVide
sufficiently many dat::!. paths for the edges. (Figure 5 shows the embedding
of the complete bipClrtitc graph, K4 4' in the lanice of f":igure 2(c),
\.'here t.he center col umn of PEs j S ullused.) [ncreasing corridor \~idth
imJli·oVC5 processor lltil izatioll Nhcn complex interconnection patterns
must. be embedded since it provides more data paths per unit area.
How wide should corridors be? It. all depends on which interconnection
Jlllttcrns arc likely to be hosted and how ecolHlmically nccessary it is to





Graph K4 4 shOlm in (a) is embedded into the lattice of
Figure 2tc) using a switch with crossover value g = :?
-l'1-
processors developed for VLSI implementation, a corridor width of two
suffices to achieve optimal ur ncar optimal PE utilization. 110l~ever,
to he sure of hosting all planar interconnection patt.erns of rz nodes with
reasonably complete processor utilizat.ion, a width proportional to log"
suffices and may be necessary [9J. To host patterns such as the shuffle-
exchange graph with high efficiency will require still wider corridors,
on the average w must be at least proportional to n/log n [10].
Selecting a corridor ",,-idth is a difficult decision, especially if
it: is a lIonr.::onstant Iddth. The benefit is higher PE utilization in some
r.::;ISCS; t:hc eost is a loss of some locality in all cases, introduction of
more area overhead, ancl increased problems with "pin" limitations.
Preliminary evidence indicates that w ~ -1 provides a reasonable
cost/benefit tradeoff, but further experimentation and analysis are
required. (See reference [12] for ail elaboration of this discussion.)
Embedding an Intm'connection Pattern
In addit.ion to the convention;ll pol}"TIIorphism derived from PE pro-
gnmming, \~C h<lve provided for a second kind of polymorphism _ the
prugrammable switches. This requires us to provide for interconnection
pattern programming, i.e. the speci.fication of a global interconnection
pattern. When \'ie\~ed in n programming langu<lge context, the "source
program" IS a global interconnection pattern that. a compiler translates
into an "object code" of individual switch settings suitable for loading
illto the Sl'itclws by"the CILi!' controller. The general prtl!:rnn~nJ'lg langll;lge
and compiler issues need Ilot concern u:'> here, hOl'ever, for h"C Idll explore
only unc particular interconllection ]KJttern: the complete hinary tree.
This example l'ill enable us to illustrate the differences between
-15-
embedding imo the plane and embedding into the CHiP lattice.
The complete binary tree has 2P-l PE'$, one at each node. One
possible layout of this structure in the CHiP lattice is a direct
translation of the "hyper-H" strategy [IJ illustrated in Figure I (d).
Figure 6 illustrates this embedding into the lattice of Figure 2(a) and
it is clear that a significant number (approaching one half) of the PEs
arc unused in this naive approach. The problem is then: although the
hyper-I! is an excellent embedding on plain silicon where the placement
of PEs and data paths is arbitrary, CHiP lattice embeddings must conform
to the prespecified PE and dat.a path sit.es. As we shall see, this
constraint is not onerous.
To illustrate an optimal embedding (in terms of maximizing the
use of PEs). assume that. we have an n x n CHiP lattice \~here n = 2k
for some integer k. This gives 22k PEs, so a binary tree of depth 2k
fits with only one unused PE, since it has 22k_l nodes. Call this
unused PE a lI$pare."
We proceed inductively by pairing two embedded subtrees to form
a new tree one level higher. For the basis of the induction it is
convenient to usc a three node binary tree embedded with one spare in
a 2 x 2 portion of the lattice. Pairing square subtree embeddings
produces rectangles \~ith sides in ratio 2:1. Pairing these rectangles
yield~ squares again. In general we pair two subtrees each with 22k_l
nodes and a spare to produce a new 2 2k+1_l node tree in which one of the
subtree spares becomes the root of the new tree and the other spare
bccomc~ tlw spare of the new tree. The interesting problem is to place
the spare$ at the proper sites for the next step in the induction.
-16-
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Figure 6. The hyper-H tree (Figure I (d)) directly embedded into
the s\\'itch la t tice of Figure 2 (al ; the sNitches are
not shown.
If we adopt the strategy of the hyper-II embedding and locate the
Toot at the center of the tree, then it. makes sense to place a spare at
the- middle of one side '0 tlwt when this tree i' paired to form the next.
l;lrgcr t.ree, there is a spare at the interface ready to become the new
root. This Idll be in the center of the ne\~ tree
"'
we intend. (Of
course, since the side:; always have an even nllmber of PEs, "middle"
here means <ldjaccilt to the midpoint of one side.) But \\'C callnot
]1;1 i r tl,o trees \\'i eh their spares in the middle of one siue since this
\·d 11 le<lve u, wi th e.itlwr a buried spare that
"
di fficul t to lI.sc when
fanning 'the next Lll'ger tree or it will le;lve us \\'i. t.1I a spare on the
Jlerimeter at a site inappropriate for the embedding of the next larger
tree. (See Figure 7.)
The solution is t.o pair one subtree h'ith a spare located at the
middle of one side \~ith a subtree whose spare is at the corner. The
:;pure in t.he middle becomes the root of the ne\\' tree and the corner spare
-17-
lieN spare
0---0- ---, [J ,-0
.... __ -J G----D
nel>' root old root
G---D
Figure? Pairing subtrees using spares located at the
midpoint of one side.
can be located (using reflection) to become either a middle spare or a
corner spare of the new tree depending on which is needed for the next
inductive step. Thus, at each step in the induction I~e must usc (and
\\'C can create) tl...O types of emlJeddings: middles and corners. (See
Figurc 8.) Notice that the basis tree, embedded in a 2 x 2 portion of
the lattice, actually serves as both tn1es.
Trees, of course, are planar; that is, they can be embedded in the
plane without crossovcrs. But if the reader endeavors to Eollol~ the
preceding algorithm with the lattice in Figure 2(a), it \.,.ill appear as
though crnSS{lVer~ Hrc l"cquircd, at least Jlll"in~ the c;lr1y stal:cS uf tilt'












Figun~ S. The formation of "middles" and "corners" embeddings
using ::J middle and corner pair.
cmbeddl'd in 4. x -1 square re,gions of the lattice, to achieve a completely
planar l'mbcJdillg. A solution is shOl·;n in Figure 9 and is completely described
in reference [Isl.
Solving a System of Linear gquations
III order to illustrate ho',' the ClliP processor can he used t,· .... omposc
algorithms, \~e pose the problem of solving a system of linear equ<ltiolls,
i .c. to solve Ax = b for an n x n coefficicnt matrix A of bandwidth p


































~ a 0 o ~ a ~ a a a (
a a a a 0 a a a a a a 0 a a a a a a 0 a a 0 a}--< 0 I-" a J-< a I-"
a a a a a a a[ I-" )-j )-j a r j-< J-- a H H )-j a r }- ra a a a 0 0 0 o a 0 0 a 0 a a[J-- ~ a a a
n-;
\--oaD- ~ a 0 on-; \--0a 0
r ~l a a a a a ~ r-~ ~ a a0 a a a a
0 a a a a a a
~ r r a H )-j
a a 0 0 )- a
?-[J-- a a a aI I- 0 .r- (J a \-0 r a a :>-- a }-
a 0 o 0- 0 o a .r- a o a '-1 a o a I- a 0 aD- a b-o r au aD a [)--<D- a \--0a a 0 a 0 a 0 a a a a o a a
a n
nJ 0 m l 1 D-o--C-o--{] D- a rooooo'Coa 0 0 o a a 0 a a;J- . a a (-!:l:J 0- a )-
0 a 0 0 a ;>-- a a a
:>-- 0- )-j )--< }-
a a a < a a a
a l~~ a[]--< [] a N° L6 '-<J 0 a ] a []a a a a ~ao a a a a a a- a a a ro D- o ~ roa a a a a a a a a a a a a}- ;J- ~ a ;J- r }- fl 0-{ :>-- H a ,r- }- }-
a I- a 0 a a a. 0 \--0 a 6--[ }- (] a ] 0-~
a o a
aJ: 0 o a a o a a a a a 0 a 0 a a 0 a a
•
an a -na a au a~ c
o a a o 0 0 0 0 o 0 0 o a 0 0 0 o 000 o 0 o a 0 o 0 o 0 0
























































clue to II. T. Kung ;lnd C. E. Leiscrson as Jescribed in /'-lead and Conway [l].
The first. is an LU-decomposition systolic array processor that factors A
into upper and lowcr traingular matrices U and L.
all a" an a 14 0 1 0 "11 u l2 "13 "14
{J
a 2l 11 22 a 23 a 24 a'5 '21 1 u 22 u 23
""
11 25
:1 31 :'1 32 D. 33 ;1 34 <l35 £31 '32 1 u 33 u 34 "-c
."
"41 "4' a 43 £41 .1'.<12 '43 1
11 52 "53 £52 '53
{J 0 0
The second syst.olic processor solves -'1 lower triangular linear system
J~!J = b where L is thc output from the decomposition step. (We call this
the LTS solver.) The final result vector x cap.. be found by solving
Ux = y where U is the upper triangular mat.rix from the first. step and y
is the vector output of the second step. By rewriting U 3S a 10\\'c1'
trian)jlllar systcm \,'C cnn lise anothcr instancc of t.he LT$ solvcr. Our
app1"o<lcll will be to I.:omposc thcsc pieces into a harmonious process
to solve t.he entire problem.
The first problem He must solve is the embedding of the Kung-Leisersol1
s)"stolir.: processors. These algorithmic:.ltly specinl i;;ed processors a1"l~
defined for n :..- n arrays of h'l!l(hddth p. (foigure 10 shows the LU-
decomposition processor for a p = 7 system. foigure 11 shaHS a suitable
Im\'cr tLi.angular system solver processor.) Since the LU-decomposi tion
processor is hcxagonally connected, it \~ill be convenient to embed the
processors into the lattice shown in Figurc 2(b). The obvious strategy
-21-
is to COllllect the processors in such a way that the lower triangular
output L of the decomposi tion stcp connects directly to the input of
thc lower triangular system solvcr. It is also obvious that these
cmbcddings should be placed at the perimetcr of the Cl-liP lattice so that
matrix A and vector b can be ret:eivcd from external storage. Figure] 2
shows such an embedding* where the PE label lings correspond to those
given in Figures 10 and 11.







Figurc 10. Thc Kung-Lciserson systolic arrily for I.U-decompositjon.
Li.lbellings indicate datil paths. Por timings, see
reference [1].
* Although the data paths are bidirectional, \.e have used arrOl.S to emphasize







A B C 0
b.)
Figure 11. The Knng-Leisersoll systolic J.TS $olvcl' for w--4. Labcllillg~





a .. 1 a .. 2 a .. 3~~'- '~- '~~'\-0--- - - - - - - - - - - - - - - - - - - - ~ ... " " .... - - - ~,,\O~;X __ :~_ )~. ~ll.I-:-i1(+-.) -41 '
- '] -" .1) -\'1) 'c_ ():
. "'I 1o ; 15 ---{)--l!,2--{) LB ---0-" 4 \-L-4-
~. 1
o : ~ "'-.r-l~iD}-o- 6 H:r--1 J] 011; 0
o '9--~-- 0 ~--~--~- 0
a. 1 .J.- ,J.
a. J .1- 11
a. 2 .l- ,1
Fil:lll'l: 12. The Clllbl!Jdilll'. of till.: 1.!I_Jccompositioll 1II'Ill~l'S:;"j" :,Ihl
the L'l'S solver in the lattice of FigllTl~ 2(hJ. ;'1:
1ahell i 111;-" l"Orr('sponJ to r-i gurc 10 and 11.
-23-
Several simple transformations have been employed to accomplish
tile embedJing. The most not.icable is that t.he hexagonal structure has
been $lightly deformed to accomoclate the rectangular CHiP lattice and
the LU-decomposition processor has been rotated clockwise 120". The
l"Onst,111t injluts (O's and -I) that appear 011 the periml'ter of the systolic
alTay !lavL' ht'CIl SllpprC$Sctl SLJH:C they call be generated intcl"Ilally to thc
PEs. The output wires carrying the L mat.rix resul t have been assigned
to one of the available ports and routed to the inputs of the LTS solver.
FinidlYJ to embed the double channel bet""een PEs of the LTS solver wc
have routed data diagonally out of the North-East port into the South-East
port. Notice that since the diagonal elements of L are all 1 J they are not
explicitly produced.
The next problem to solve is the rewritin~ of U as a 10l~er
triangulLlT syst.em suitable for input into another embedded LIS solver.
We must wilit until U has been entirely produced before performing this
operation. So, rather t.han writ.ing the elements of U to external storage
as they arc produced, we 'thread them through the lattice (assuming therc
is sufficient space to store them all). We also thread the y vector
output from the LTS process along with U. Then in the second phase of
our algori'thm, \~e can process 'the elements through another embedded LTS
solver.
Perhaps the most elegant h'ay to thread U and y through the lat.tice
is to use a graph embedding due to Aleliunas and Rosenberg [13]. The
scheme has the advantage of not requiring a large "bundle" of wires along
the perimeter of the lattice when the threads double back. (Figure 13
illustrates the embedding required for doubling back.) As the U and y
values are produced, they are passed from PE to PE. (They could be
-24-
"concentrated" by storing several per PE.) I~hell U and yare completely
produced, the first phase is completed,
,






















Figllre L'i. The AJcllllnas-Rus~l1berg l~1l1bedd ing of the threads
duubling lJ:lck. The arrows indicate the direction
of flO\~ of the U and y values.
-25-
lktween the first allJ s0conJ phasc.:s we lIlak0 a lllillur l'cCunl.'lgul";.ll iUll.
(This reconfiguration \~ould not have bl!cn necessary had the plw.sc 1
configuration been somewhat more clever; but as an example, it would also
have been somewhat more confusing.) The second configuration embeds the







Figure 14. The simple phase 2 embedding
- 26-
TI1C inputs to this group of processors come from reversing the direction
of flow of the threaded values from phase 1. Notice that this rcvc1"Sal
of flow has thc effect of renumbering the matrix U to be in IOl,'er
triangular form appropriate for the LTS solver. Thc appropriate values
of the y vector are also available at the proper locations. The outputs
from the second phase cmanate from the western port of processor (4,1).
These are the values solving Ax = b.
To summarize, the system of linear equations Ax = b is solved in tlW
phases all the CHiP processor, In phase 1 an embedded LU-decomposition
processor takes A as input and produces matrices Land U as output. The
L output is immediately input to an LTS solver that also takes b as inJlut
ilnd solves Ly = b. The vector y and the matrix U are threaded through the
lattice, Phase 1 completes when A has been decomposed. In phase 2
another embedded LIS solver takes the threaded output from phase l (by
reversing its flow) and solves Ux = y.
Phase 2 makes scant use of parallelism it runs 1n the same time as
phase 1 and the data arc already in the CHiP processor. And as noted, the
interphase reconfiguration I~as not essential. But, there are algorithms
to solvc the phase 2 problem that do make essential usc of configurabilit)'
to make el'f('ctive ll:;e of parallelism [14). A complete Jevelopment of thc
:Ippl'oach is I\ot p()~sible here, but the cssenti.al idea due to Chell, Kuck
and Samch [11) 1s straightforl,-ard: A trilJlsformation Oll U enahles LIS to
decompose the matrix into blocks Bl, ...• Bk whose product yields the result.
Because the product operation is associative, the whole produr:.t f.:i'.n be
formed by taking pain~ise products in parallel, then paindse products
of the results, etc. By reconfiguring the threaded portion of the lattice
Llsing one of several rather complicated interconnection patterns that
-27-
either implicitly or explicitly embed a tree, we can perform these paiTl~ise
products in parallel. The result is a faster parallel algorithm made
possible by configurability.
Discussion
Several characteristics of the CHiP approach should be mentioned.
First, the algorithmically specialized processors translate Imltatis
mutandis to programs for the CHiP computer. Thus, we have a ready
supply of algorithms that can effectively use the parallel processor.
Of course, all of these algorithms use one interconnection structure,
ilUd it is possible that improved algorithms might be found that exploit
the availability of multiple interconnection structures.
Second, configurability provides both interphase ami intraphase
flexibility. This distinction, though not very Clear-cut, tends to
correlate with whether or not pipelining is being used. If a problem is
solved by a sequence of phases that each complete before the next one
begins, we tt;:nd to use regular configurations that change at the completion
of a phase (interphase). The whole lattice is in a mesh ur tree ]J<lttcrn.
For a series of pipelined algorithms that can be coupled together, as in
the last section. we tend to form regions of the lattice dedicated to each
algorithm with data paths interconnecting the regions. We refer to this as
intraphase configurability because within one phase we interconnect
several l'egular structures. Clearly, I.e need not change configurations
to exploit the advantage of configurability.
Both kinds of configurability arc useful in adapting to changes in
problem size. For example, two different small problems might operate
com:urrently on different regions of the CHiP processor using entirely
different interconnection schemes. One pattern could change while the
-28-
other remained fj xeu by loading sId tches of the fixed region wi.th t\~O
caples of the same configuration setting. Pipe lined proce~sors, \~hose
sizc is usually a function of the input width, can be tailored to the
right size at loading time.
Another consequence of configurability is that it is quite fault
to1~rilf1t. Supposing that an errol" is detccted in a processor, dnta path
or switch, \~e can simply route around the offending device. For convenience,
we might choose to leavc other processors unused to "square up" the
lattice \~hen matching dimensions are important.
Perhaps the most intriguing consequence of configurability's fault
tolerance is the possibility of "liafer level" fabrication. That is,
instead of dicing a wafer and discarding the faulty processor chips, we
can leave a VLSI wafer whole and simply route around the unusable
proce~sors. (We o.:ould lise the dicing corridors for data paths, and
sh'itches.) For example if a wafer contains 100 processor chips and
yield characteristics indicate that roughly one third are faulty, then
a \~nfer is accept<lblc if \~e can finel :::n 8 x B sublattice tha1: is functional.
The mapping of the switches to host the 8 x 8 in the 100 could be done
on the wafer by special circuitry designed for that purpose. Although the
lllllribel' of pins required for the I~afcr Iwuld be large, their number i.s only
jll'Ollortional tu the j1l'l"ilTl<.'ter rather than the arc-a. This acrll:llly 1·C'dLlCCS
the l,)lal numher of \~in:s hontlcd.
SlimmaI'Y
By inte~ratillg programm<lble switches with the proces$ing elements,
the Cllil' computer achieves a polymorphism of interconnection structure
that also preserves locality. This enables us to compose algorithms that
-29-
cxpLoit differcnt intcn;onnectiull pattcnls. 1n additlOll to responding
to different problem sizes and characteristics, the flexibility of
integrated switches provides substantial fault tolerance and permits
wafer Ie"'!,;! i'abric;ltion.
AcknowZedgements
It is a great pleasure to thnnk Dennis Gannon fOl- his encouragement
and his assistance \~ith the linear systems solvi.ng example. ,Janice Cuny's
critical reading has lead to a simplification of the swi'Lch - the insight
is mL:ch appreciated. Thanks arc due Paul ~lcNabb who developed programs
to produce the embedding of Figure 9. Finally, Robert Grafton, Leonard




[I J Carver /'-lead and Lynn Conway
Int]'oduction to VLSI systems
Addison Wesley, 1980
[2] H. T. Kung and C.E ..Leiserson
Systolic arrays (for VLSI)
Tech. Report. CS-79-103, Carnegie-Mellon University, April 1979
(Also ill [1])
[3J D.Il. Gannon
On pilJelining a mesh connected multiprocessor for finite element
problems by nested dissection
Proc. Intll Conf. on Parallel Processing, pp. 197-204, 1980
['IJ Sally Browning
The tree machine: a highly concurrent programming environment
Ph.D. Thesis, California Institute of Technology, Jan. 1980
[SJ .Jon L. Bentley and H.T. Kung
A tree machine for searching problems




Technical Report, Yale University, M<lrch 1980
[71 1...1. Guibas, Il.L Kung and C.D. Thompson
lJirect VLSI implementation of combinatorial algorithms
In Cal. Tech. ConL onVLSI, California Institute of Technology
.Jallllary 1979
[Il] S.W. Song
A highly concurrent tree machine for data base applications
Proc. Intll Conf. on Parallel Processing pp. 259-268, 1980
[9J L.G. Valiant
University considerations in VLSI circuits
11:EE TrailS. COlnjllltl:rS, 19111
[10] C.D. Thompson
A complexity theory for VLSI
Ph.D. Thesis, Carnegie-Mellon University, 1980
[11] S.C. Chen, D.J. Kuck and H.Il. Sameh
Practical Parallel Based Triangular System Solvers
AOI TOMS (Sept. 78) pp. 270-277.
[l2J L. Snyder
Overview of the CHiP Computcr
In VJ.SI 81, John Grey, ed., Academic Press, PI'. 240-249, 1981
-31-
[131 Romas Aleliunas and A.L. Rosenberg
On embedding r.ect.angular grids into square grids
IBM Tech. Report. RC 8404 1980
[111] D.B. Gannon anJ L. Snyder
Linear Recurrence Algorithms for VLSI: The
Configurahle, Highly Parallel Approach
(in preparation)
[15] Lawrence Snyder
Programming Processor Interconnection St.ructures
Purdue Universities Department of Computer Sciences, TR-381, 1981
