Parallel computer arcbiLecture complicat.es the already difficult task of parallel prograrnming in many ways, e.g., by a rigid inLerconnecLion sLructure, addressing complexily, and shupe and size mismaLches. The CHiP computer is a new arcllilccturc thai reduces Lhese complicaLiolls by permiLLing the processor inLcrcoIlnccLioll sLrucLun~La be programmed. This new kind of programmrning is explained. Algorithms arc presented for several intercollIlecLion patterns including the Lorus and Lhe complcLc binary Lree and general embedding straLegies arc idenlificd. 'The rcsccrch dcse,ibed herein is parL of the Blue CHiP Project.
Introduction
Although it is a difficulL task to design a sequential computer architecture that efficiently hosts sequential algorilhms, it is perhaps even more challenging lo design a parallel architecture that efflcienlly hosts parallel algorithms. The aspects of parallel cumputation that Irustrate the harmonious malch between algorithm and archilecLure are many: We report on a new family of architectures, the Configurable, Highly
Parallel (CHiP) cOlnpuLers, Lhat respond Lo the demands of parallel algoriLhms, especially lhe need lor locality and DexibiliLy. The centrul concept is this:
The processing elements are embedded into a progranlmable switch latLice that permits nol only the programming of Lhe PE's buL also the direct programming of their interconnection structure.
This second kind of programming not only ameliorates the difficulties lnentioned above, it. also permits the convenient composition of parallel algorithms. Il has even led to the development of entirely new parallel algorithms [8] . In this paper we give a synopsis of the CHiP architecture and then explore the consequences of this new kind of programming, interconnccLion structure programming. The main resulLs are algoriLhms of programming various inLerconnecLion sLrucLures. which is done by rewriLing U as a lower Lriangular matrix and using
Synopsis oj the CHiP Computer
a. , .
-60--0-6 irS 
Programming Interconnection Patterns
We will emphasize the specification of uniform rather than ad hoc interconnection patterns because they are of interest. in their own right and ihey are often the building blocks that arc used by the less regular patterns. First, . . . ·.,.-e musL consider lhe laiLice thaL is io host the inLerconnee Lion paLLern. does not. In the inLeresl of gencralily, we will assume the "simplest" lattice suiLable for an inLcrconnecLion palLern. 
The two index coding scheme for a lattice.
As an example of this specification meLhod, we observe that the mesh interconncc Lion paLLern (F'igure 3(a)) can be dofLned * by the two condi- and requires a lattice of degree d=6 or (for symmetry) d=B. Notice that this specification is somewhat moTC general than that used in Figure 5 .
Torus Interconnection Patterns
Since Figure 9 illuslrates the en Lire construction.
The difficully with this interconnection pattern, of course, is that it has long daLa paths that are subject Lo propagation delay. SO,me algorithms can accept such a delay, buL generally we would like to reduce iL. Accordingly, we prefer the following more inLricaie paLM Lcrn LhnL inLerlenves the row and column processing elcrnenLs so thnL Lhere is a fixed bound on the dis Lance a signal must travel. The enLire construe lion is shown in Figure 10 .
Clearly the maximum number of s1'dlches that any data item must pass through is three. We have increased the locality oj the torus ernberlding. It is, Lherefore, more amenable VLSI implementation and can be used in an arbitrarily large lattice with only a consLanL delay.
Complete Binary Trees
Although an efficIent embedding of complete binary trees into the plane is known [10] , its direct applicaLion to interconnection paL- the spare of the new block. The goal is t.o place the spares so that they will be cOllvcnienLly located for t.he composiLon.
Define three lypes of tree cmbeddings:
Type A blocks have t.heir spare PE midway along one side adjacent to Lhe exiting edge Irom the block's rooL.
Type B blocks have their spare PE in the corner on the same side as the exiling edge from t.he block's rooL.
Type C blocks have Lheir sl'are PE in the corner on the opposiLe side of Lhe exiting edge from t.he root. Figure 12 illustrates the three types of blocks and demonstrates that they can be inductively produced using blocks or these types. This guarantees that the three data paths can always be assigned. The detailed program is omiLted.
Clearly, we have achieved our go"al of complete PE usage of this simpIe lattice. If the available IatLice were more complex, e.g., had degree 8 or mulLiple corridors, then the same embedding would work and some minor opLimizaLions would be possible.
Lacing a Corridor
AILhough we could present many more of our ernbeddings -a brondcast tree, a double tree, leaves on a line trec, shufflc exchange, elc. Notice that if Lhe swiLches had even higher crossover cupability c =4, which is the maximum for degree B switches, Lhen we could even rouLe verLical wires across Lhe laces if Lhey were needed,
Conc.:lusions
We have introduced the CHiP architeeLure and argued that its provisian lor int.ereonnecLion paUern programming alleviates many of the difficulLies cneounLered 111 parallel progl'illTI developmenL. This simplification is achieved in Lwo ways. Firsl, LIte rigidiLy of n nxcd inLereonnecLion slructure is no longer an obstacle when one wanLs to program an algorithm that uses a differenL inLerconnection paLLern, And -24-secondly, there is a clean separation between routing the data and programming the activity of the PE's.
AddiLionally we have demonsLrated thai interconnection programming is an inLeresting and challenging activiLy. We wave shown that locality an be increased by carelul sLudy of Lhe torus. We have shown tbaL it is possible La embed lhe complete binary Lree La achieve essentially cornplcte PIT; utilization. The resulL involves an illLercsling assignmenL of spare PE's. And we have shown thaL there are general techniques (e.g., corridor lacing) La be found.
Acknowledgments
It is a pleasure to thank Ching C. Hsiao Ior his original use of lacing und Paul )'IcNabb for developing Lhe software to produce these embcddings aIld for sLimulaLing discussions of Lhe binary trce embedding.
Thanl<s are due La Paul Il'lorrisse LL for programming the Larus and lacing figures awl Lo Julie Hanover for exccLlenL manuscript preparation.
Compound octagon-square lattice Chengtu, Szechwan, 1825 A.D.
