Considerations for Multiprocessor Topologies by Delagi, Bruce A. & Byrd, Gregory T.
!1111111111111[[i1 11111111111[11 1
PB96-148416 Info_ b our bu_emm.
CONSIDERATIONSFOR MULTIPROCESSOR
TOPOLOGIES
STANFORD UNIV., CA
3AN 87
U.S. DEPAR'rMENT OF COMMERCE
National Technical Informatibh Service
|
I
https://ntrs.nasa.gov/search.jsp?R=19990014293 2020-06-15T22:02:58+00:00Z
January 1987 Report NO. STAN-CS-87-1144 ........
.41so munbered KSI.-87-07
IlllllIIlllIlIIlllllllllllIIII
PB96- 148416
Considerations for Multiprocessor Topologies
by
Gregory T. Byrd :rod Bruce A. I)elagi
Department of Computer Science
,%t_.lfordU,iv0rsity
Sh, lhitd. CA 9.4305
_,. Z ._j :."
.,,.o_io iT,
U I D, pa_lm_m of ¢¢wm_ce
N_I_IM Tichot¢ Ill kffo_mMm_ Iervt_l
IpCmtlf_, V_lmm _illt
_[.CU : CLASSl;_CA;K)N ] I IT ' 1
REPORT DOCUMENTATION PAGE
LI II
ia RE_QRTSJ_(_RIT.YCLASSIFICATION
uric/as s 11.lea
I "2a SECURITYCLASSIFICATIONAUTMO'RITY '"'
[ I
2D DECLASSIFICATIONI DOWNGRADING SCHEDULE
• [ i [ . ' i r
4 PERFORMINGORGANIZATJONREPORTNUMBER(S)
STAN- CS87 -1144
' - [ iii
5o NAME 'OF PERFORMING ORGANIZATION
Computer. Science .Department
:_. ADDRESS(Ofy. State. ,ncl ZiPCo<leJ
Stanford Uni verse ty
Stanford, CA 94305
6b OFFICESYMBOL
(_I e,_Wicab/el
[ I
NAMEOFFUnO,NG,S oNsOR,NG
AoORESS s',,.a,az,PC ,i
1400 Wilson Ave
Arlington, VA 22209
• -1 r r r " T'" [
11"'TITLE (Include Security Clal.llfic4troh)
Considerations for MultiprocesSor Topologies
12 PERSONALAUTHOR(S) .....
Greg Byrd and. Bruce Delaai
13a TYPEOF REPORT
techn ica1
16 SU;_L_M ENI'ARY"NOTATION
I I -
18b OFFICESYMBOL .....
PBBE-14B41fi
IIIIIIIII!!llllllllllllllllllll
- i....
I FormADptoveO
I OM8 No.0704.0188
I Exp.Oete.'Jun30_ 198
:lb. RESTRICTIVEMARKINGS
[I "l " " " I '
3. DISTRiIUT_ON,AVAILABrUTY OF REPORT
ApprOved for.4_ublic release: .dist_'ibutlon
unl imi ted
S.-MON_TOR_NGORGANiZATiONREPORTNUMBER(S)
..... T . i [ • [7a. NAME OF MONITORING ORGANIZATION
..... I r ..............
7b. ADDRESS(C/fy, State, and ZlXCo_e)
1
I U [[ " [ i I9. PR REMENT INSTRUMENTIDENTIFICATIONNUMBER....
F30602-85-C-0012
I 1 [ -
10. SOURCEOF FUNDING hiUMB'ERS
ELEMENTNO. NO. NO. ACCESSIONh_
[
• I _ ' ........
• -- ii -
1 3; TIMECO ERED I1 i OAt=¢,FREPORT(v,,,.Mo.,...,yl I,S PAGECOUNT! FROM. ....... TO ....... i 1987 January 5..
17 cOsAt, CODE"-_il_'_m_'__ le SGBJECT T[R_S (COntinueOn reve,_ffneca,tM_ ___.tg_n_/__ _lO(k.,Utii_Nfr)............
19 ABSTRACT (CO_tittue oh tevet_e ff i1_' I_] i.@entify by block number)
Choo_inl_ a mul_iptoce_or interc<)nnectioh topOloBy
may depend on hisb-ievel considet&tions, much amthe
intended _pplicstion domain sad the expected nufn-
bet of proc_. It certminly depend= On low.level
implementstion detail=, such u pAck,KinK snd .com-
municStiont protocols. We flrlt u_e roullh _neuure_ of
cost; and performance to chatscterize =everid topOlo-
Kick. We their examine how implementation detsils
can affect the fe-litable pett'0rmance.ol_ s topology.
; -: ..... , ...... " .................................................. 'm
•_0 OISTRIBUT_ONtAvA,LAOIL,TYOF ABSTRACT 121. ABSTILACT SECURITYCLASSIFICATION
..._ U_CLASS_=!_O'UNUM'tE_..n sA_ _,_RpT rl Ot,c USERS I uhClasslfl_d
LI . .2 C .__ . Zi ._ . j C 1 __CC ..................................................................... I
DD FORM 1473, e_ MA_ 8] AFR e_t,on rely bi died untd exhluiled SECURIT_ CI, ASSIFICATION OF .TMliJ!AOE
All Othli ed,t,ohi lie 0b$oIItl
PBgB-14841S
llllllilllllt;llllllllllllllll
Knowledge Systems Laboratory
Report No. KSL-87-0T
January 1987
Considerations for NIultiprocessor Topologies
Gregory T. Byrdf
Department of Electrical Engineering
Stanford University
Stanford, CA. 94305
Bruce A, Delagi
Worksystems Engineering Group
Digital Equipment Corporation
Maynard, MA 01754
This _ofk was _upported by DARPA Con_racd F$060_.85.C.00I,_, NASA Afnes Co,=_ract
NCC f2.ffO-SI, and Boeing Contract Wf66875.
f G. Bgrd is supported by aA NSF GT'adudte FellowMtip, wi¢h additio_tai suppof't prot, tded
by th_ EE Dept,
PROTE01;EI_ UNDEFI IN]_EI_NA'_iONAL cOPYRiGHT
ALL RI(3HT$ RE,SEFIVED,
NATIONAL TECHNIOAL INFORMAi:ION SF-I_IViGE_
U.S. DEPAF;TMENT OF GOMMERCE
Considerations for Multiprocessor Topologies*
Greg Byrd t
Knowledge. Systems. Laboratory
Stanford University
Stanford, CA 94305
Bruce Delagi
Worksystems Engineering Group
Digital Equipment Corporation
Maynard, MA 01754
Abstract
Choosing a multiprocessorinterconnectiontopology
may depend 6n high-levelconsiderations,such as the
intended applicationdomain and the expected num-
ber of processors.It certainlydepends on low-level
implementation details,such as packaging and com-
munications protocols.We firstuse rough measures of
cost and performance to characterizeseveraltopolo-
gies. We then examine how implementation details
can affectthe realizableperformance ofa topology.
i Introduction--Design. Con-
straints and Opportunities
The base for development of general purpose fnul-
tiptocessorsystems as for computer systems today
generally isgiven by the design constraintsand op-
portunitiesestablishedby evolvingsemiconductor de-
sign and manufacturing processes.The VLSI design
medium brings a new perspectiveon cost:,switches
a_e cheap; _#iresare expensive.In modern micropro-
cessors,communication co_tsdominate those associ-
ated with logic.Power and coolingbudgets are spent
driving wires and overwhelmiiigly, chip area is dedi-
cated to wiring ratlter than logic [17]. TO an increas-
ing degree, the dominant delays, are associated with
driving lines rather than the accomplishmezlt of logic
functions-per se. One implication is that, all other
thingsbeing equal,smaller,simpler-processorscallbe
expected tohave shorteroperationcyclesthan larger,
more complex designs[18].They are alsolikelyto be
availablein a more recent,higher performance bes_
technology.
"Tliii worR wu supported by DARPA Contract F30602.
85.C.OO12, NASA Ames Contr&ct NCC,---2-220-SI, and Boeing
ContraCt W266875.
tSt_bpoi't6d by an NSF Graduate Fello_Vship ahd by the
Stanford Dept. of _le_.trical En_ind_ring.
At the system level,the consequence of.relatively
expensive communication isthat peri'ormhnceisen-
hanced ifthe design establishesthat whenever a lot
of information has to mov_ in a short time, itdoes
not have to move ,_ar.Significantlocalityof high
bandwidth linksisa goal. Among the highestband-
width linksin a computer system isthat connecting
the processorand memory. Early computer systems
separated these piecesand put a bottleneck between .
them to accommodate the packaging realitiesof the
time: processors were implemented with electronic
means, memory with magnetic, and theirpower re-
quirements and EMI characteristicswere best dealt
with separately.There are new realitiesnow: close
couplingofprocessorswith localmemory ispreferred.
Wit}{these design constraintsinmind, we consider
a multicomputer implementation based on a set of
processor/memory pairsconnected by a cor_munica-
tioristopology. Many topologieshave been proposed
[8]and have been compared in terms of theoretical
costand performance measures [16].We argue,how-
ever,that the realizableperformance ofthese topolo-
giesare closelylinkedto detailsofsystem packaging.
2 Interprocessor Connection
Topologies
Connection schemes between processingsitescan be
compared with respectto theircostand performance
as a function of the number of sitesconnected. For
a particularconnection scheme, ifthe cost grows no
fasterthan the number ofsitesand the performance
grows at leastas fast,that scheme can be described
as scalable.A rough measure of costisthe number of
input-output ports required forconnection. A rough
measure ofperformance isthe nun|bet of linksinthe
topology divided by the largestnumber of linksthat
must be traversed,aud thus occupied to accomplish
a transmission, in order to get from oue node in the
..... i iii iii i ii ii - 11
network to another. This indlcation of the bound oz_
the number of independent,concurrenttransmissions
we willcallthe concurrency of the network.
For some topologies,the concurrency ofa network
may understate performance as actually experienced
in a given application: to the extent that there is
locality of reference in transmissions, the number of
linksactuallytraversedmay be betterapproximated
by a constant.than some functionof the number of
connected sites.Network concurrency may also ovdr_
sta_eperformance ofone topology with respectto an-
other: to the extent that the time to traverselinks
is not the.same for alltopologies,those that have
non-uniform linkcosts(perhaps due to physi:aldis-
tance considerationsapplied to the realizedlengths .
of links)willdeliverlessperformance than the con-
currency measure suggests. This isbecause in these
cases,logicaladjacency due to high dimensionality
ismerely apparent--embedding the topology in the
dimettsionalityof space availabletends to incurjust
those expenses relatedto physicaldistances.thatthe
topology was expected to eliminate.
2.1 Topologies With Scalable Con-
currency
Several topologies are shown in Table 1 which have
scalable concurrency. As the number of sites is in_
creased, the network grows enough to support the
consequentialadditionaltraffic.In fact,by tl'/:smea-
sure of performance, the last three of these four
topologiesscaleperformance equally well. However,
as willbe described,thereare other considerationsto
weigh.
In tilecrossbar and completely connected topolo-
gies,the number of ports,a firstapproximation to
cost.grows quadraticallywith the number of nodes
irithe network. Weighing costand concurrency,then,
we might Ibreferthe banyan afldboolean/c-cube (also
known as "hypercube") topologies.
By these hleasures,,there does not seem to be a
clear-cutchoice between the banyan and the hyper-
cube. A more sophisticatedmeasure of cost would
take into account the area requiredforlayingout the
topology in a plane Jill. The banyan may have a
slightedge inthiscategoryI,but both layoutsrequire
:The area required to lay out a hypercube in a plane is
O(n _ ) [2], where n is the nurfiber of processor. Since "banyatC'
actually d_not_ a class of interconfiections it is difficult to
make a general statement about its layout. However, let us
consider a particular banyan network, the omega network [10].
which it loin .tales of perfect shuffle connections. The per-
, ri _1
I_ectshu_e hu area O( [_'i_--7"_n) [lG], so-we wouldexpect logn
perf6cLshul_es to require area O(_). which is a slightly
relativelylong wires,which isundesirableiflinktran-
sittime dominates switching time.2
A major differencebetween the two topologiesis
that switchingand routingare centralizedatthe pro-
censorinthe hypercube, whereas the switchinginthe
banyan isdistributedthroughout the network. To
the extent that storage.isrequired at the sw_tch (as
in [3]),itbecomes more economical to centralizethe
switch and utilizethe localstorageof the processor.
For thisreason,we preferthe hypercube.
2.2 . Topologies With Scalable Cost
There are alternativetopologiesnot a_ richlycon-
nected as those just considered. The topologiesin
Table 2 allhave fixeddegree connectivity,so they all
have scalablecost as measured.by port count Un-
fortunately,none of them has scalableconcurrency.
So, at leastamong the ten representativetopolo-
gies discussed,there is no topology that has cost-
performance characteristicsintrinsicallysuperior t..o
allthe others.
Concurrency for the ring and the bus topologies
does not increaseat allas the number of processors
increases. Given no guarantee of transmission source
to target locality, these seem unsuitable for systems
with a large number of processors (e.g., > 100).
The perfect shuffle and cube-connected cycles
(CCC) topologies emulate the O(log n) latency of the
hypercube, but the number of links is linear with
the number of procesSors,so concurrency does not
scale. Also, ifwe measure cost in terms of layout
n 2
area,the cost of the perfectshuffle(O(_)) and
CCC (O( zos--_r_n))[15]do not scaleand so willnot be
consideredfurther.
The tree,grid,and torus topologiesallhave fixed
degreeconnectivityand.have the optimum O(n) area
requirement.The tree has a slightlybetter.capacity
measure and a lower latencybound. Note, ho_qever,
that the tree provides no altet_atecommunication
paths (usefulin network.balancing and defecttoler-
ance) and has a bottlenecking root.a Connectiofis
might be added to provide alternstepaths, but, as
we willsee in the next section,physicallinkconsid-
erationsmay make the grid or torus a betterchoice..
better bound than for the hypercube. OtheL' types of banyans,
with different, fan-in, fan-out, arid connectivity characteristics
t_iiht have even smaller bounds.
;tSeeSection3.
We rhight be able to deal with thil by increaalng the bahd-
width of th6 lifi_/ui we l_rocee_ to._cL.the root, for example
with "fat tfeei" [i2].
3 _Link Costs--Examining The
Free Lunch
Most studies of topologiesa._ume a constant cost
for link traversalsas the number of linksincreases.
This isa usefulapproximation ifthe time to drive
and receivelinksigt_alsisconsta_R with link length
and large.compared tosignaltransittifneon the ilnk.
However, thisisincreasinglynot e.good assumption
both as the underlying featuresize of.the compo-
nent technology decreases.and as we considerlarger
numbers-of sitesin a system. Given.a fixedcircuit
featuresize,topologieswith scalableconcurrency,as
discussedin Section 2.1 sufferincreasedlinklengths
a_d thus longer signaltransittimes--with possibly
increasingdrivetimes--as the number of processors
increases.Alternatively,given a fixedvolume of cir-
cuitsinthesetopologiesand decreasingcircuitfeature
size,the number ofprocessorsinthe system increases
but so does the ratiobetween llnklengthsand feature
size. Thus relativeto the circuitdelay times which
are dependent on (and decrease with) circuitfeature
size,the linktransittimes become increasinglya more
important consideration.4
Topology has to be viewed as a dependent variable
determined principallyby the packaging technology
ofthe system. As an example,.considerthe recursive-
H layout for the binary tree {Figure I) under the
assumption that linktransittime dominates switch-
ing time. Now consider_thegrid in Figure 2, which
can be laid out in the same area. Iftransittimes
dominate, then shorterlinksand more switchingsite_
willlikelyshorten the point-to-pointcommunications
cycle time and improve the realizedcapacity of the
network,s Furthermore, additionaldata paths allow
4The dependence of com_iunication delays on signaRit_g
lengthst as circuit feature size decreaaes depends ot_ _iurap-
tions made on the thicknesi and thus the resistigity of a_so-
ciated irRerconnec_s. Uniform scaling leads to relative sig-
naliifig tifnes that ihcrease quadraticaUy v_ith distaf_ce [19].
Detailed analysis of the equations of volt_g/_ arid eui'rent ih
VLSI wire ifnplernefttatio_ (inc|t|dtns cohsideration of the
_on.linear ehara_teriitics of signal drivers) demonitfated lin-
ear dependencqm [1] but were done usufniflg.that the inter-
connect {and fi_|d oxide) thickn6sies did net decrea2ge at all
Whi_e all other diniehsior_i scaled with the circuit _eature size
of the technology [17]. Another approach imagines a hierarchy
of intetconaect of.incre&sin$ thicknesses wath distance [13] to
achieve signalling times tha_. g_ow only with the logarithm of
the distance. Yet another approach accepts resistive links but
giveri control over both rninirhum and marimum wire lengths
and use of.high irfipedafice receiverd, notes that it is possible
to counter dispersive Iosscl with reflective voltage doubl;ni_ at
the r_.ceiving end of & point to poifit link [9].
SThe usumption made here is that the meisage routing is
relatively i ndepeiident oi' th_ cofi_putir, g activities at a process-
ing site, so there is no penalty M_iocl.Med _ith being fo,_ted at
a processing site rather than & ,_'itch.
dynamic routing of messages, and additional comput-
ing resources make the grid potentially more powerful
thav the tree.
Though the torus appears tosufferfrom extremely
"longwires which "wrap around" the edges,a simple
renumbering of the-procesSorsin a grid brings each
one withintwo hops ofitslogicalneighborse (seeFig-
ure 3). Thus, we can effectivelycreatea torus by
changing the routing algorithm of a grid. Alterna-
tively,,we could keep the originaltorus connections
and lay out.the processors_ in Figtzre3(b),result-
ing inlinkswhich are at most twice as long as those
for a grid. In.the r_.mait_derof the paper, we will
speak,of the grid bearing inminedconstructionofthe
torus inthese terms__ •
4 ,. A Packaging Example
We are now faced with two topologies: one with.
scalable performance--the hypercube--and one with
scalable cost--the grid. The arguments presented
above suggest that, all else being equal, the comn_uni-
cation cycle time for the hypercube would be greater
than that of the grid, due to its long links. Even so,
the average message latency of the hypercube may
stillbe s.._aller,due to itshigh connectivity.To get
a betterunderstanding ofthe relativeperformance of
the two systems, we sl_ouldexamine how they might
actuallybe implemented in near-futuretechnology.
In the mid-1990's we would cxpect a 0.5-/_mMOS
fabricationprocessto be available[7].We willassume
that the cofnplexityof out processor isc6mparable
to today's typical32-bit microprocessor. The Mi-
croVAX 78032 chip [4],for example, isimplemented
in 3.#m technology; itmeasures about 8.5 mm on
a side. UsingO.5-/_m technology,we could expect a
similarprocessorto requitearound 1.5mm on a side.
Let us allow 256K bytes (2M bits)of localmemory
for our processor.Fujitsu'smegabit RAM using 1.4-
/_m technology takes 54.7.mm _ [6].Ifthe dimensions
of the Fujitsuchip are about 10 mm by 5.5 ram, then
a 0.5-/_mversionwould be 3.6 mm by 2.0 ram. Two
of these {sincewe want 2M bits)would be around
3.6 mm by 4 ram. As an-ai0proximatio||0then, each
processingelement,includinga processor,256K bytes
of localmemory, and switchingand routingcircuitry
could be exp._ctedto fitonto a 5 mmx 5 mm piece
of silicon..
Even as. devices shrink, die sizes continue to grow.
By the mid-O0's, the state-of-the-art _hips fnay be
as large as 15 ram on a side. Each chip would be
expected to have 400-_00 I/O pads [14]. Therefore,
eTbis approach is attributed to R. Zippel,
PR fro
we could put up to nine processingsiteson a single
die.
The dice could be flii>-m6untedo_a a silicon[51
or ceramic [g]substrate with thin-filmtransmission
linesand integratedcapacitors.In [91,the maximum
length for 5-/Jm-.thicklinesisaround.20 cm, so we
willassume a 10xl0 cm module size,on which we can
easilyplace up. to 36 dice. We will assume on the
order of I000 I/O pins per module [5].
Consider firstp_tckaginga (32x32) 1024-elementoc-
talgrid,inwhich each processorisconnected toeight
neighbors. With nine processors(arranged as a 3x3
grid)on a die,32 (bi-directional)communication links
must come offthe chip through the I/O pads, so no
more than 18pads could 3e used per channel. A mod-
ule can carry 324 processors,arranged as an 18x18
grid.The entiresystem, then,could fiton four mod-
ules(withroom to spare).The communications links
from two sidesof the 18x18 grid (105 bidirectional
channels) must go off-module. Thus, each channel
could use 10 pins_---onepin forclockand statusinfor_
mation.and four for data, ineach direction.
Now consider a 1024-element hypercube (a "10-
cube"). To allow for more complex wiring and easier
packaging,we willassume that each diecontainseight
processors,and each module will hold 32 dice, for
a totalof 256 processorsper module. (Extra space
might be used to provide redundant processorsfor
fault tolerance.) Again, only four modules are re-
quired to package all1024 processors.Each processor
has ten bidirectionalinksto itslogical.neighbors.If
the eight processorson a die are wired as a 3-cube,
then seven channels from each processormust go off-
chip. Five of "_hesechannels are connected to other
processorson the same module, but two must go off"
the module. With only ~ 1000 I/O pinsfor512 bidi-
rectionalchannels,itappears that a l-bitcombined
control/datastream isallthat can be supported for
the hypercube communications. Ifwe decrease the
number of processorsper die to four (and possibty
add more memory), we. can use separate wi_es for
controland data but the wireswillbe longer.
Note that in both casesthe module pin-outisthe
limitingfactorforchannel width,ratherthan the chip
pin-out.-Ifmote off-module I/O pins ate available,
things.willook better,but there willstillbe around
a 54o-I ratioof the number of required off-module
channels in the hypercube as compared to the grid.
As mentioned before,the average interconnectlength
for the grid willbe much shorterthan that for the
hypercube. Therefore, the grid offersshorter (i.e.,
faster)and wider communication paths than the hy-
percube when implemented.in-projected near-future
techttology.
5 Beyond Topology
As the previousexample indicates,the electricaland
physicalcharacteristicsof the circuitpackaging in a
system may dictatethe scheme used towire the nodes
together. In addition,the communications protocol,
that is,the actualsignallingon the linksare an im-
portant component ofachievableperformance. There.
are many relevantdetails--forexample:
Dynamic routing, selecting availablelinks as
needed, isu_efulin balancing load and thus al,
lows more ofcommunication resourcesofthe sys-
tem to be well used throughout a computation.
Cut-through routing, making a routing decision
on the fly as a packet is received, reduces buffer
requirements in the system and minimizes la-
tency experienced in network transit.
Local flow control, signalling transmission delays
back to the source based on local blockage in-
formation, together with single "word" buffer-
ing and transmission validation at each network
input and output port allows the source to com-
plete a validated transmission in a time that does
not depend on thesize of the network.
Point to point multicast, sending (approxi-
mately,i the same packet to multiple targets
using common resources to the largestdegree
possible--coupled with dynamic, cut-through
routing, flow control,and .word levelbuffering
and transmission validatlon--provides"virtual
busses" preciselyas and when they are tteeded.
A point-to-point protocol utilizing these mechanisms
is-described in [3].
6 Conclusion
Communications performance of practicalsystems
depends firstofallon availablepackaging technology
and second on protocol considerations.No topology
considered here has-both scalablecost and perfor-
mance, so the topology chosen must be inthe context
of the number of processorstargetted. For a thou-
sand processorsor so,given the assumptiofison mid-
1990'stechnology discussedearlier,the grid(or torus)
seems an appropriatechoice.The performance ofthe
grid willdepend on the signallingprotocol and will
be best predictedthrough applicationsimulationsde-
tailedenough to relectdesign decisionsmade at that
level.
P,_&_ 4
References
111G. Bilardl,M. Pracchi,and F. P. Preparata.
A critiqueand an appraisalof VLSI modelsof
computation. In H. T. Kung, B. Sproul,and
G. Steele,editors,VLSI Systems and Comp=,
rations,pages 81-88,Computer SciencePress,
Inc.,Rockville,MD, 1981.
[21
[3]
G. Brebner..Relatingroutinggraphsand t_o-
dimensionalgrids..In.P.Bertolazziand F. Luc-
cio,editors,VLSf:Algorithms.andAre.hitcctures,
pages221-23I,ElsevierSciencePublish_.rsB.V.,
Amsterdam, 1985.
G. T. Byrd, R. Nakano, and B. A. Delagi.A
Point.to,pointMuticastCommunicationsProto-
col. TechnicalReport KSL-87-02,Knowledge
Systems Laboratory,StanfordUniversity,Jan-
uary 1987.
[4]D. W. Dobberpuhl, R. M. Supnik, and
R. T. Witek. The MicroVAX 78032 chip,a 32-
bit microprocessor.DigitalTechnicalJournal,
(2):12-23,March 1986.
[51Capt.B.3.Donlan,J.F.McDonald, R.H. Stein-
vorth, M. K. Dodhi, G. F. Taylor, and
A. S.Bergendahl.The wafertransmissionmod-
ule. VLSI Systems Design, 7(I):54-58, 88-90,
January 1986.
[6]
[71
ElectronicNews, July.1,1985.
C. K. Lau, et.al. A high performancehalf-_
micron gateCMOS processforVLSI. In Pro.
ceedings of the 1985 International Conference on
Computer Design: VLSI in Computers,IEEE,
October1985.
[8]T. Feng.A surveyofintercornectionnetworks.
Computer,12-27,December 1981.
[9]C. W. Ho, D. A. Chance, C. H. Bajorek,and
R. E. Acosta.The thin-filmodule as a high-
performancesemiconductorpackage.IBM Jour.
hal of Research and Development, 26(3):286-
296, May 1982.
[10]
[11]
D. H. Lawtie.Accessand alignmentofdata in
an arrayprocessor.IEEE Transactionson Corn.
puiers,C-24(12):1145-I155,December 1975.
C. E. Leiserson.Area.E_icientGraph Layouts
(foeVLS/).TechnicalReportCMU-CS-80-138,
CarnegM-MelloflUniversity,August 1980.
[12]C. E. Leiserson.Fat-trees:univeralnetworks
forhardware-efficientsupercomputing.In Pro-
ceedingsof the 1985 International Conference on
ParallelProcessing,pag_ 393-402,IEEE, 1985.
[13]C. Mead and M. Rein. Minimum propagation
delaysinVLSI. In CaltechCon_ere_£¢on VLSI,
pages433--439,January 198i.
[14].D. Nelsen.PersonalCommunication.
[15].F.P. Prel)aratand J. Vuillemin.The cube-
connectedcycles:a versatilenetworkforparal,
lel computation. Communications of the ACM,
24(5):300-309, May !981.
[16] D. A. Reed .and H. D. Schwetman. Cost-
Performance bounds for multirnicrocomputer
networks.IEEE Transactions on Computers,C-
32(i):83-95,January!983.
[i7]C. L. Seitz.Ensemble architecturesforVLSI--
a surveyand taxonomy. In 198_ Conferenceon
Advanced Research in VLSI, MIT, January 1982..
[18] C. L. Seitz. Experiments with VLSI ensemble
machines. Journal of VLSI and Computer Sci.
ence, I(3),1984.
[19]C. L. Seitz.Self-timedVLSI systems.In Cal.
tec.h Conference on VLSI, pages 345-355, Jan-
uary 1979.
Number Longest
.... __Tr.OpO!OSY....... OfPO,_.. P_.th __o_curre_ncy.
Cortt_l_tely conr_ected ,.O.{n=.) 0.(l) O(n _)
Cro=,b,¢ O(nZ)` Oft) O(n)
Banyan,,. Otnloln) O(Iogn) O(n)
Boolean k-cube (n = 2') O(nlol[n) [ O(logn) O(ri)
*The mm_ber of Iii_= n O{n}.
Table 1: ScalableConcurrencyTopologies.[n = _.
processors]
Topology
Global bus
Perfect ,,hul_e
Cube-connected cycles
Binary tree
Gnd/Torus
I[.of POrt= path Concurrency Arda
O(n) O(n) O(l) [ O(n)
O(n) O(t_ Oft) i Oln)
O(n). O(logn) Ofol_.) 0(_)
0(.) [ O(Iog.) Ofr,_) I 0(_)
O(n) I,O_los.) ] O(r,_) I O(n)
Table 2: ScalableCost Topologies.[n ffi_:proces-
sors]
Figure 1: Recursive-H binary tree. Figure 2: l"wo-dim,nsional grid.
(a)
i20'
Figure 3: Torus (a) and renumbered grid (b).
(b)
P_,,
