Performance and Analysis of Segmented Multiple Bus Systems. by Kim, Jungjoon
Louisiana State University
LSU Digital Commons
LSU Historical Dissertations and Theses Graduate School
1997
Performance and Analysis of Segmented Multiple
Bus Systems.
Jungjoon Kim
Louisiana State University and Agricultural & Mechanical College
Follow this and additional works at: https://digitalcommons.lsu.edu/gradschool_disstheses
This Dissertation is brought to you for free and open access by the Graduate School at LSU Digital Commons. It has been accepted for inclusion in
LSU Historical Dissertations and Theses by an authorized administrator of LSU Digital Commons. For more information, please contact
gradetd@lsu.edu.
Recommended Citation
Kim, Jungjoon, "Performance and Analysis of Segmented Multiple Bus Systems." (1997). LSU Historical Dissertations and Theses. 6497.
https://digitalcommons.lsu.edu/gradschool_disstheses/6497
INFORMATION TO USERS
This manuscript has been reproduced from the microfilm master. UMI 
films the text directly from the original or copy submitted. Thus, some 
thesis and dissertation copies are in typewriter face, while others may be 
from any type of computer printer.
The quality of this reproduction is dependent upon the quality of the 
copy submitted. Broken or indistinct print, colored or poor quality 
illustrations and photographs, print bleedthrough, substandard margins, 
and improper alignment can adversely afreet reproduction.
In the unlikely event that the author did not send UMI a complete 
manuscript and there are missing pages, these will be noted. Also, if 
unauthorized copyright material had to be removed, a note will indicate 
the deletion.
Oversize materials (e.g., maps, drawings, charts) are reproduced by 
sectioning the original, beginning at the upper left-hand comer and 
continuing from left to right in equal sections with small overlaps. Each 
original is also photographed in one exposure and is included in reduced 
form at the back of the book.
Photographs included in the original manuscript have been reproduced 
xerographically in this copy. Higher quality 6” x 9” black and white 
photographic prints are available for any photographs or illustrations 
appearing in this copy for an additional charge. Contact UMI directly to 
order.
UMI
A Bell & Howell Information Company 
300 North Zeeb Road, Ann Arbor MI 48106-1346 USA 
313/761-4700 800/521-0600
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
PERFORMANCE AND ANALYSIS OF 
SEGMENTED MULTIPLE BUS SYSTEMS
A  Dissertation
Subm itted to the Graduate Faculty of the 
Louisiana State University and 
A gricultural and Mechanical College 
in  partial fu lfillm ent of the 
requirements for the degree of 
Doctor of Philosophy
in
The Departm ent o f E lectrical and Computer Engineering
by
Jungjoon K im  
B .S., Kyungpook N ational University, 1981 
M .S., Korea Advanced In stitu te  of Science and Technology, 1983
August 1997
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
U M I Num ber: 9808754
UMI Microform 9808754 
Copyright 1997, by UMI Company. AH rights reserved.
This microform edition is protected against unauthorized 
copying under Title 17, United States Code.
UMI
300 North Zeeb Road 
Ann Arbor, MI 48103
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
A c k n o w l e d g m e n t s
I  would like to  express m y sincere appreciation to m y m ajor advisor, D r. Ahmed 
El-Amawy, who has contributed significantly to m y professional development by 
shaping m y academic goals, m otivating me and supporting my efforts. I  am thank­
ful for his insights, suggestions and patient supervision throughout this work.
M y sincere appreciation is extended to my com m ittee members, D r. Alexan­
der Skavantzos, D r. Ramachandran Vaidyanathan, D r. G ilsik Lee, D r. Si-Qing 
Zheng and D r. Roger Seals for their encouragement, advice and valuable comments 
throughout this research. I  also thank all Electrical Engineering Departm ent staff 
for their assistance and support.
I  would like  to remember a ll pleasant memories a t LSU. I  am grateful to won­
derful Mends who made my life  here enjoyable and endurable. I  am also grateful 
for the financial support from  Korea Telecom. Thanks to them all.
I  am thankful to m y parents for their love and sacrifices for me, m y brothers and 
sister. I  must add thanks to m y parents in law for their prayers and support. To a ll 
my beloved ones in  Korea who have prayed for me, I  am in  debt. This dissertation 
is theirs too.
Finally, I  would like to thank m y wife, Yoonkyung, for her never-ending support, 
patience, encouragement and love during this long journey. M y beloved children, 
Seokwon and Seokhyun, deserve sharing this accomplishment. They have always 
been a great source of encouragement and love.
ii
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
T a b l e  o f  C o n t e n t s
A c k n o w l e d g m e n t s ........................................................................................... u
L is t  o p  T a b l e s ....................................................................................................  v
L is t  o f  F ig u r e s .................................................................................................  vi
A b s t r a c t ............................................................................................................... v iii
C h a p te r
1 I n t r o d u c t io n .................................................................................................  1
1.1 Review of Relevant L ite ra tu re .....................................................................  5
1.1.1 Variants o f M u ltip le  Bus System s.................................................  5
1.1.2 Overview of W ormhole R o u tin g .....................................................  9
1.1.3 Review of Relevant Performance A nalysis........................................ 13
1.2 M otivation and O utline of D isserta tio n .........................................................18
2 Se g m e n te d  M u l t ip l e  B u s  Sy s te m s  ............................................................... 24
2.1 P re lim in aries .......................................................................................................24
2.2 A rbitration M echanism s...................................................................................28
2.2.1 Parallel A rb itration in  M ultip le Bus System s................................. 28
2.2.2 A rbitration Mechanism for the S M B S ...............................................31
2.3 Scalability of the S M B S .................................................................................. 36
3 Q u e u e in g  A n a ly s is  fo r  Sy s te m s  w it h  Sin g l e  F l it  B u ffe r s  . . . .  37
3.1 Assumptions and N otation ............................................................................37
3.2 Performance Issu es ............................................................................................ 42
3.3 Overview o f the M o d e l..................................................................................... 45
3.4 Residence Tim e in  the N e tw o rk .....................................................................49
3.4.1 Mean F lit Residence Tim e at a Segment S w itc h .......................... 50
3.4.2 Average W aiting Tim e for a Segment B u s .................................... 56
3.5 Residence Tim e in  a Processor........................................................................ 60
3.6 Residence Tim e in  a Mem ory M o d u le ........................................................... 73
3.6.1 Memory Input Queue D elays.............................................................. 73
3.6.2 Memory O utput Queue Delays........................................................... 77
4  Q u e u e in g  A n a ly s is  fo r  Sy s te m s  w it h  I n f in it e  F l it  Bu ffe r s  . . .  81
4.1 Overview o f the M o d e l..................................................................................... 81
4.2 Residence Tim e in a Processor........................................................................ 85
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
4.3 Residence Tim e in the N e tw o rk .....................................................................89
4.4 Residence Time in a Memory M odule.................................................... 93
5 N u m e r ic a l  R esu lts  a n d  D is c u s s io n ............................................................... 99
5.1 M odel Input P aram eters................................................................................100
5.2 Validation o f the A nalytical R esu lts............................................................101
5.3 Num erical R esults............................................................................................ 104
6 C o n c l u s io n .............................................................................................................. 119
B ib l io g r a p h y ..................................................................................................................126
A p p e n d ix : T e r m in o l o g y ...........................................................................................130
V it a ......................................................................................................................................134
iv
i
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
L ist  o f  T a b l e s
3.1 Model input parameters............................................................................................41
5.1 Comparison of performance w ith simulation results: SMBS w ith  single
flit buffers, n — 10, m =  10, b =  10, g =  5, rm =  4 and t  =  3..........................103
5.2 Comparison o f performance w ith sim ulation results: SMBS w ith single
flit buffers, n =  10, m  =  10, 6 =  10, g =  7, rm =  4 and t  =  3..........................103
5.3 Comparison o f performance with simulation results: SMBS w ith in fin ite
flit buffers, n  =  10, m  =  10, b =  10, g =  5, rm =  4 and t  =  3..........................103
5.4 Comparison of performance with simulation results: SMBS w ith in fin ite
flit buffers, n =  10, m  =  10, 6 =  10, g =  7, rm =  4 and t =  3..........................104
v
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
L is t  o f  F ig u r e s
2.1 Segmented m ultiple bus system........................................................................... 25
2.2 A rb itration system structure of an MBS............................................................30
3.1 Relative segment bus traffic in  an SMBS w ith 9 segments............................ 44
3.2 Relative segment bus traffic in  an SMBS w ith 9 segments w ith  varying 
lo c a lity ........................................................................................................................45
5.1 Processing efficiency o f single flit buffer model under uniform  memory 
reference: n  =  10, m  — 10, b =  10, rm =  4 and t  — 3........................................ 105
5.2 Processing efficiency of single flit buffer model under uniform  memory 
reference: n  =  20, m  =  20, b =  10, rm =  4 and t =  3........................................ 105
5.3 Comparison of mean response tim e w ith different flit buffer sizes: n  =  10,
m =  10, 6 =  10, g =  9, rm =  4 and t  =  3..............................................................107
5.4 Comparison of processing efficiency w ith different flit buffer sizes: n =  10,
m =  10, b — 10, g =  9, rm =  4 and t  =  3..............................................................107
5.5 Processing efficiency of in fin ite flit buffer model under uniform  memory 
reference: n =  10, m =  10, b =  10, rm =  4 and t  — 3........................................ 109
5.6 Processing efficiency of in fin ite flit buffer model under uniform  memory 
reference: n =  20, m  =  20, 6 =  10, rm =  4 and t  =  3........................................ 109
5.7 Comparison of mean response tim e w ith the two different connection 
topologies (linear and ring connections): n =  10, m =  10, 6 =  10, <7 =  9,
rm =  4 and t  =  3........................................................................................................ I l l
5.8 Comparison of processing efficiency w ith the two different connection 
topologies (linear and ring connections): n =  10, m =  10, b — 10, <7 =  9,
rm =  4 and t  — 3........................................................................................................ I l l
5.9 Effect of memory reference locality for single flit buffer model: n — 10,
m =  10, & =  10, rm =  4, l / r p =  0.02 and t  =  3................................................... 113
vi
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
5.10 Effect o f memory reference locality for single flit  buffer model: n =  20,
m =  20, 6 =  10, rm =  4, l / r p — 0.01 and t = 3........................................... 113
5.11 Effect o f memory reference locality for infinite flit buffer model: n =  10,
m =  10, 6 =  10, rm =  4, l / r p =  0.025 and* =  3............................... 115
5.12 Effect o f memory reference locality for infinite flit  buffer model: n =  20,
m =  20, b =  10, rm =  4, l /r p =  0.025 and t  — 3............................... 115
5.13 Mean response tim e w ith  varying flit numbers per packet: n =  10, m  =
10, 6 =  10, g  =  9 and rm =  4............................................................................... 116
5.14 Processing efficiency w ith varying flit numbers per packet: n  =  10, m  =
10, b — 10, g  =  9 and rTO =  4...................................................................................116
5.15 Mean response tim e o f single flit buffer case w ith  varying segment buses
per segment: n  =  10, m =  10, g  =  5, rm =  4 and t  =  3..............................118
5.16 Processing efficiency o f single flit buffer case w ith varying segment buses
per segment: n =  10, m =  10, g =  5, rm =  4 and t  — 3..............................118
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
A b s t r a c t
The dissertation introduces a new class o f bus-based systems called the Segmented 
M ultip le Bus System (SM B S). One of the unique characteristics of the SMBS is that 
it  allows the exploitation o f memory reference locality even though it  is a bus-based 
nondirect network. This comes from  the architectural interconnection feature that 
the SMBS can be viewed as a large-scale m ultiple bus system (M B S) th at has been 
partitioned into sm aller partitions called segments. Each such segment is in  effect 
a small conventional MBS whose size is chosen so as to avoid bus loading problems. 
The SMBS overcomes the architectural lim itations of bus-based shared memory sys­
tems while m aintaining th e ir advantages in  terms o f high degree of fau lt tolerance, 
ease of expansion and ease o f programming, h i addition SMBS’s are scalable; unlike 
conventional MBS’s. Another interesting feature of the SMBS is that it  supports 
wormhole routing which is trad itionally used in direct network topologies.
We develop performance models to study the SMBS w ith  wormhole routing. 
Ours is the first attem pt to  adapt wormhole routing to a bus-based nondirect net­
work. In  our performance m odeling, features of both direct and nondirect networks 
are incorporated. We include the effect of blocking and pipelining properties of 
wormhole routing in the analysis. The bus group of each segment is modeled as a 
flow equivalent service center which represents a load dependent service center. Two 
performance models, one assuming single flit buffers and the other assuming in fi­
n ite flit buffers at segment switches, are developed. Using approxim ate Mean Value 
Analysis, we evaluate performance in terms of processing efficiency and request re-
v iii
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
sponse tim e. We also sim ulate the two models without applying any approximations. 
We report comparisons o f analytical results w ith simulation results to support the 
accuracy and appropriateness o f our new and novel performance models. The results 
demonstrate good m atch between sim ulation and analytical results and show good 
scalability for the SMBS.
The approach we adopt in  developing the models is comprehensive in  the sense 
th at the models incorporate features o f both direct and nondirect networks. This 
makes our models easily adaptable to several other network topologies.
ix
i
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C h a p t e r  1 
In t r o d u c t io n
Massively parallel computers are considered the most promising technology to 
achieve significant com putational power. A great deal o f attention has been paid 
to the design o f multiprocessor systems to achieve higher com putational power. In  
such large-scale systems, each processor typically comprises cache memory, local 
memory and other supporting devices. The processors can execute the same in­
struction, broadcasted from  a control un it, but operate on different data sets from  
distinct data streams (single instruction stream m ultiple data stream , or SIM D in 
short) or they can execute several independent programs simultaneously (m ultiple 
instruction stream m ultiple data stream, or M IM D  in  short).
The performance of a multiprocessor system depends significantly on its inter­
connection network. W ide variety of interconnection networks have been proposed 
for multiprocessor systems [1]. They can be classified into crossbar networks [2], sin­
gle bus and m ultiple bus systems [3, 4], m ultistage interconnection networks [1, 5] 
and direct (static) networks [6].
In  a direct network architecture, each node has a point-to-point connection to a 
set of neighboring nodes. Because nodes do not physically share memory, they must 
communicate by passing messages over the interconnection network. Systems used 
in this manner have been referred to as message passing multicomputers. The most 
commonly used direct networks are variants of the k-ary n-cube [6]. Examples of k-
1
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
2
ary n-cubes include the ring (n =  1), 2-D mesh or torus (n =  2 ), 3-D mesh or torus 
(n  — 3) and hypercube (k  = 2 ). These direct networks have been popular archi­
tectures for constructing massively parallel computers because they scale well; that 
is, as the number of nodes in  the system increases, the system gains corresponding 
performance. Furthermore, these systems allow the exploitation of communication 
locality.
A  m ajor constraint w ith direct networks is their static topologies; once a ma­
chine is b u ilt it  cannot be changed. Another problem w ith direct networks is that 
w riting message passing programs has been trad itionally difficult because various 
system prim itives should be invoked to send messages between processes executing 
on different nodes. Shared memory programs are easier to develop than message 
passing programs because the programmer does not need to be concerned w ith the 
explicit movement of data. Some recent distributed shared memory designs use low 
dimensional direct networks. Examples of such systems include H O R IZO N  [7], the 
Stanford DASH Multiprocessor [8], and the M IT  Alew ife machine [9].
However, whether a message passing direct network system or a distributed 
shared memory system is used, network latency is critical to system performance. 
Current second generation multicomputers are most distinguished by message rout­
ing. One m ethod, called wormhole routing [10], has become quite popular in  recent 
years. In  the networks that support wormhole routing, network latency is almost 
independent of path length when there is no contention and message length is rel­
atively large. Low dimensional meshes are popular topologies for such systems
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
because the negative effects o f th e ir large inter-node distance are m inim ized. In 
fact, most direct network topologies that use wormhole routing are low-dimensional 
meshes and hypercubes. The In te l Touchstone D elta [11], the In te l Paragon and 
the Symult 2010 [12] use a 2D mesh, and the M IT  J-machine [13] and the Caltech’s 
Mosaic [14] use a 3D mesh. The nCube-2/3 uses a hypercube topology.
To the contrary, nondirect networks such as the crossbar switch and multistage 
interconnection networks do not integrate processors and switches. Consequently, 
processors cannot communicate directly w ith each other. Com munication between 
two processors in  an nondirect network (as w ell as in  m ultip le bus systems) is 
achieved through a shared memory. The logical structure o f a shared memory mul­
tiprocessor allows m ultiple processors to access memory in  a single global address 
space. Shared memory systems are usually considered by m any to be sim pler and 
more intuitive to program than message passing systems where one has to deal with 
low-level details.
Among various interconnection networks for shared memory multiprocessors, 
the crossbar switch network [2] gives fu ll connectivity such th at a ll permutations 
between processors and shared memory modules are possible. However, this net­
work is unsuitable for systems w ith  a large number of processors. The cost of an 
N  x  N  crossbar switch network grows 0 ( N 2), where N  is the number of processors 
or memory modules. The crossbar is also vulnerable to failures in  communication 
paths because alternative paths do not exist, and so it  is very often im practical.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
M ultistage interconnection networks [1, 5] allow a rich subset of one-to-one si­
multaneous connections between processors and memory modules while reducing 
the cost to 0 ( N  log N ).  The disadvantage o f m ultistage interconnection networks 
is that they are not easily scalable and are subject to high communication latency. A 
m ajor common disadvantage o f the crossbar switch and m ultistage interconnection 
networks is that they are not inherently fault tolerant.
A m ultiple bus interconnection network [3, 4] is a good alternative. It  provides 
high degree of fau lt tolerance, cost effectiveness, ease of broadcasting and ease of ex­
pansion. Depending on the connections of processors and memory modules to buses, 
different types of m ultiple bus systems (M BS’s in  short) have been introduced [18 
-  23]. In  the conventional MBS [4], each processor and each memory module is 
connected to every bus. This is usually referred to as a fu ll bus connection system. 
This system becomes prohibitively costly when the size of the system increases. The 
cost of the MBS is given by 0 ( ( N  +  M)B),  where N  is the number of processors, 
M  is the number of memory modules, and B  is the number of buses.
Usually a bus-based system suffers from  severe problems unique to the bus itself, 
although it has the natural advantages of a bus. Bus loading is a m ajor problem. 
Because the capacitive loads and drive requirements of a bus are proportional to the 
number o f connections to  the bus, signal speed is lim ited by the physical capacitance 
associated w ith the bus. Thus, the maximum number of connections per bus must 
be chosen so as to avoid the bus loading problem. Therefore, the bus itself becomes 
a m ajor lim itation  to  system scalability.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
In  some cases, taking advantage of the locality of communication and/or locality 
of memory reference present in  most applications can help in  reducing the inter- 
node distance [17]. W hile many applications exhibit the locality of communication 
or memory reference, a conventional MBS cannot take the advantage of such locality 
because a bus is a m utually exclusive resource to which a ll processors are connected.
In  this dissertation, we introduce and investigate a new bus-based system called 
the Segmented Multiple Bus System  (SMBS in  short) which is targeted at overcoming 
the above mentioned drawbacks o f bus-based shared memory systems. The SMBS 
is targeted at exploiting reference locality, scalability and versatility, features that 
are not found in  any of the existing bus-based systems. In  the following section, 
we briefly review relevant literature. We describe m otivation of our research and 
outline the dissertation in Section 1.2.
1.1 R e v ie w  o f  R e l e v a n t  L it e r a t u r e
1 .1 .1  V a r ia n t s  o f  M u l t ip l e  B u s  S y s t e m s
The m ajor research effort in  m ultiple bus systems has focused on reducing connec­
tions between memory modules and buses or between processors and buses. Sev­
eral different configurations have been proposed, mostly focusing on reducing the 
number of bus connections while keeping the size of the MBS s till moderate. Exam­
ples of such configurations include the partial m ultiple bus network [3], processor- 
oriented p artia l m ultiple bus system [15], systems w ith certain connection schemes 
like trapezoidal, rhombic, staircase, cyclic and balanced interconnection [18], partial
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
I
i 6
bus network w ith hierarchical request model [19], m ulti-level bus systems [20] and 
probabilistically reduced connection m ultiple bus system [21].
Lang, Valero and Alegre [3] proposed a bus system called partial multiple bus 
system. In  the partial m ultiple bus network, memory modules are divided into equal 
sized groups and each of these groups is connected to some equal numbered, but 
different subset o f buses. The processors are connected to a ll the buses. This partial 
m ultiple bus structure was intended to reduce the connection cost while trading off 
an acceptable degree of system bandwidth. They reported a decrease in system 
bandwidth of at most 6% while reducing memory-bus connections by half compared 
to the conventional fu ll bus connection system.
In  the processor-oriented partial m ultip le bus structure [15], bus connection com­
plexity is the same as that of the partia l m ultiple bus network. In  that system, 
processors are partitioned into equal sized groups and are connected to different 
subsets o f buses, while memory modules are connected to all the buses. W ith  im ­
proved load balancing in the arbitration mechanism (a  memory module that has 
outstanding requests must be granted an available free bus), the system is claimed 
to have an increase of up to 20% in system bandwidth over the partial m ultiple bus 
system [3]. However, in the processor-oriented m ultiple bus system, im plem entation 
of the load balancing arbiter is very complicated.
Lang, Valero and Foil [18] proposed some connection schemes like rhombic, bal­
anced, staircase, cyclic, etc.. In  these systems, not a ll the buses are connected to 
all the memory modules; however, any B memory modules can be connected to B
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
7
buses by some connection scheme and a ll the buses have fu ll processor connection. 
Therefore, the throughput is the same as that of a fu ll bus connection system. How­
ever, the im plem entation o f arbitration algorithms is complex and the connection 
cost o f a large system is s till high.
The p artia l bus network w ith  hierarchical request m odel [19] divides memory 
modules into a few classes and connects more frequently addressed memory module 
classes to more buses than those that are less frequently referenced. Processors are 
connected to a ll the buses in  this system as well. Even though system bandwidth is 
comparable to that of partial connection m ultiple bus systems, the connection cost 
is higher. Because of non-uniform ity of connection, fau lt tolerance could be serious 
for some memory module classes and im plem entation o f arbiters would be complex.
In  the m ulti-level bus system [20], processors and m emory modules are divided 
into a number of m ulti-level clusters. As the cluster level goes up, connectivity of a 
bus increases and finally at the highest level, a bus has fu ll connectivity. The to tal 
number o f connections is less than that o f the partial m ultip le bus system [3] w ith  
same number o f processors. The m ulti-level bus system becomes cost effective only 
for hierarchical reference models. In  other words, the m ulti-level bus system w ill 
not perform  well under a uniform request model.
K arim  [21] proposed a probabilistically reduced connection m ultiple bus system. 
His approach was motivated by the hypothesis that there are many connections in 
conventional m ultiple bus systems to satisfy highly im probable request patterns; in 
other words, probabilistically redundant (or under-utilized) connectivity exists. His
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
8
results show that the typical memory-bus connectivity cost reduction is in  the range 
of 20% — 37%, depending on the assumed request m odel. He used several request 
models, such as the uniform  request model, hot spot request model, locality-based 
request m odel and locality-based request model w ith local hot spots [21]. However, 
the proposed system also has fu ll connectivity between processors and buses. The 
author reported another approach to reduce the connectivity cost from  both pro­
cessor or memory sides which did not show any significant cost improvement.
Basically, a ll of these p artia lly  connected systems solve the bus loading prob­
lem to some degree. They achieve some reduction in connection cost while trading 
off an acceptable and tolerable degree of performance degradation. However, these 
architectures have fu ll connectivity in  the sense that either every processor is con­
nected to every bus or every memory module is connected to every bus. This full 
connectivity between processors and buses or memory modules and buses gives rise 
to serious engineering difficulties as system size increases. They also increase the 
cost to an unacceptable level. Among these difficulties, the problem of bus loading 
is a m ajor one.
To solve the bus loading problem, a new class of m ultip le bus architectures, 
called binomial multi-bus architecture, based on binary codes, was proposed [22]. 
This class is aimed at further reduction in  the interconnection com plexity. It  was 
shown in  [22] that the re liab ility  o f this architecture class is as good as that o f a full 
bus connection system while the number of bus connections is almost half of that 
required in  the fu ll bus connection case.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
9
To interconnect a large number of processors using a minimum number o f buses 
when both the number of input and output ports on a processor and the bus fanout 
are constrained, general construction techniques o f systems based on Balanced In­
complete Block Design were introduced [23]. W hile these multibus architectures 
are aimed at further reduction in the number o f connections to buses, for tru ly  
large systems the fanout constraint cannot be overcome by resorting to these partia l 
configurations.
1 .1 .2  O v e r v ie w  o f  W o r m h o l e  R o u t in g
The performance of a direct interconnection network is greatly dependent on the 
performance of its communication network; so an efficient routing algorithm  is crit­
ical to the performance. Routing can be classified as deterministic or adaptive. 
Determ inistic routing completely determines the path only by the source and desti­
nation addresses. If  routing is adaptive, the path for a given source and destination 
pair is determ ined based on dynamic network conditions, such as failure of channels 
or blocked channels.
In  order to obtain good performance, it  is desirable that a routing algorithm  
uses a shortest possible path for each packet. Such a routing algorithm  is said to be 
minimal. A  non-minimal routing algorithm  may use a longer path for each packet, 
usually in  response to dynamic network conditions. Non-m inim al routing should be 
used carefully so as to avoid livelock situations. Livelock can occur when routing is 
continued indefinitely w ithout finding the destination.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
10
Routing algorithms can be further classified by the type o f switching technique 
they u tilize . In  storr-and-forward (or packet) switching, whenever a packet arrives 
at an interm ediate node, the entire packet is stored in a packet buffer. Then the 
packet advances to the next node when the next channel is available and the node 
has available packet buffer. In  circuit switching, a physical circuit is established 
exclusively for a message between the source and destination nodes. A fter the mes­
sage has been transm itted along the circuit to the destination, the circuit is released. 
Therefore, there is no need to have buffer space at each node.
In  virtual cut-through [30], the packet header (which has routing inform ation) is 
examined upon arrival a t an interm ediate node. The packet is buffered at the inter­
m ediate node only if  the next channel is blocked. Otherwise, the packet advances 
im m ediately without buffering. Both circuit switching and v irtu a l cut-through are 
based on the concept of cut-through, which can significantly reduce network la­
tency.
Although both v irtual cut-through and circuit switching offer significantly re­
duced network latencies when there is no contention, v irtual cut-through requires 
large buffers to store blocked packets and in  circuit switching the established physi­
cal circuit between the source and destination nodes cannot be shared among other 
messages.
Wormhole routing [10] which has become quite popular in recent years, resolves 
these drawbacks while offering sim ilar low network latency. A packet is divided into 
a number of flits (flow control digits) for transmission. The flit at the head of the
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
packet, the header flit, has routing inform ation. As the header flit advances along 
the path, if  there are no blocked channels, the rem aining flits  follow in a pipelined 
fashion. I f  the header flit is blocked a t some interm ediate node, the trailing  flits  
are blocked in  place, w ith  several channels being occupied simultaneously un til the 
header flit can move forward. The channel once occupied by a packet w ill only be 
released by the ta il flit. The pipelining property of wormhole routing could make the 
network latency largely insensitive to path length if  there is no contention. Another 
attraction o f this routing technique is that each node requires only a small first-in  
first-out (F IF O  in  short) flit buffer to store a few flits  for each channel. Low dimen­
sional meshes and hypercubes are the most popular direct network topologies used 
in  wormhole routed systems because the negative effects of their large inter-node 
distance are m inim ized.
If  a packet is blocked at every interm ediate node, v irtu a l cut-through behaves 
exactly the same as packet switching. On the other hand, if  a ll of the interm edi­
ate channels are free, v irtu a l cut-through becomes very sim ilar to circuit switching. 
Hence, cut-through switching is advantageous when the network load is low. In  the 
ideal case, when flit buffer capacity is unbounded, wormhole routing is equivalent 
to an optim ized form  o f v irtual cut-through. In  that case, wormhole routing allows 
a partia lly  buffered packet to be forwarded as soon as its outgoing channel becomes 
available.
A  situation that can postpone packet delivery forever is called deadlock. Dead­
lock can occur if  a circular wait condition happens; a packet may wait for a network
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
12
resource while holding other resources and occluding other packets from  acquir­
ing the held resources. In  packet switching and virtual cut-through switching, the 
deadlock resources are buffers. Therefore, they need extra buffers for deadlock-free 
routing. In  circuit switching and wormhole routing, the resources are channels. 
Because blocked packets hold several channels and their corresponding flit buffers 
simultaneously, wormhole routing is particularly susceptible to deadlock. Starvation 
is sim ilar to deadlock, but occurs when a packet waits for an event th at can happen 
but never does. For example, a packet m ay w ait forever to acquire a network re­
source for which other packets are always competing successfully. Starvation is only 
possible when the allocation of network resources is unfair.
G enerally methods for avoiding deadlock routing use one of two approaches: di­
mension ordering [6 ,11 ,12] and v irtual channels [31]. Dimension ordering [6 ,11 ,12] 
routes packets along the different dimensions in  a pre-specified order. One dimension 
ordering approach, LR  routing [6] (also called e-cube routing algorithm ), applied to 
hypercubes, routes a packet by selecting interm ediate nodes such that the address 
bits gradually m atch from  the most significant b it to the least significant b it (or left 
to rig h t). The xy routing algorithm  [11, 12], applied to two-dimensional meshes, 
routes a packet first along the x dimension and then along the y dimension. These di­
mension ordering approaches guarantee deadlock-free routing but are non-adaptive.
In  the other approach [31], a single physical channel is divided into m ultiple dis­
jo in t v irtu a l channels and a cyclic dependency-free routing is employed. Generally 
each v irtu a l channel has reduced bandwidth because virtual channels must share
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
13
one physical channel. However, v irtual channels make adaptive routing possible 
and produce flexible flow control.
1 .1 .3  R e v ie w  o f  R e le v a n t  P e r fo r m a n c e  A n a ly s is
In  this subsection, we review the work of several researchers on performance analysis 
of asynchronous MBS’s and direct networks w ith  wormhole routing which relate to 
our work.
Jackson [44] showed that if  a queueing network consists o f exponential queues, 
its solution is separable. The term  ‘separable’ comes from  the fact that each service 
center can be separated from the rest of the network and its solution is evaluated 
in isolation. Then, the solution of the entire network can be formed by combining 
these separate solutions. This simplifies the evaluation while ensuring accuracy.
For circuit switched MBS’s, the path between a requesting processor and a mem­
ory module is established during the entire memory operation. This simultaneous 
possession of a bus and a memory module represents a non-separable aspect which 
results in  an approximate bus-memory system.
Asynchronous operation has been studied using Markovian queueing network 
models for M BS’s [33, 34]. Marsan and Gerla [33] gave an approxim ate solution to 
a M arkovian model developed under the assumption that the memory service tim e 
and request interval are exponentially distributed. In  their solution, a ll memory 
modules and processors in the system are assumed to be identical and a uniform  
memory reference model is assumed in order to reduce the size of Markov chains.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
14
Iran i and Onyuksel [34] presented a closed-form solution for a Markovian queue­
ing m odel. Their analytical results are claim ed to be easily applicable to a system 
of any size.
Towsley [35] developed two classes o f approximate models for asynchronous 
M BS’s. These are the combined bus-memory model based on the flow  equivalence 
technique and the surrogate delays model, in  which the mean-bus queueing delay is 
added to  the mean-memory holding tim e.
Another approximate simple queueing model of asynchronous M BS’s was pre­
sented by Yang and Zaky [36]. In  this model each processor has a local memory 
and thus is able to continue processing w hile w aiting for response from  the shared 
memory. A ll these previous works on M BS’s have focused only on circuit switched 
systems [33 -  36].
For packet switched MBS’s, there are three different types of service centers, 
typically: processor, bus and memory service centers. The solution is separable if 
these three service centers have exponential queues w ith exponential distribution of 
service tim es. The bus group in  a packet switched MBS can be modeled as a sin­
gle service center called flow equivalent service center (FESC) [16]. The bus group 
is called the aggregate and the remaining service centers in  the system are called 
collectively the complement From the point o f view of the complement, an FESC 
is a single service center whose behavior is identical to the aggregate itself. Con­
sequently, the purpose of the FESC is to represent the behavior of the aggregate 
num erically.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
FESCs are represented in  queueing network models using load dependent service 
centers. A  load dependent service center can be considered as a service center whose 
service rate  is determined by a function of the number of customers in its queue. 
As viewed by the complement, this behavior corresponds to the flow of customers 
out of the aggregate and into the complement. An approxim ation for this flow can 
be obtained under the assumption that the average service rate at the aggregate 
depends only on the state of the aggregate which is defined by the number of cus­
tomers in  the aggregate [16].
Yang and Bhuyan [37] presented approximate queueing network models for both 
packet switched synchronous and asynchronous MBS’s. Basically, a ll processors, 
memory modules and buses are assumed to be identical and typically a uniform  
memory reference model is assumed in  [37]. For the synchronous system, the anal­
ysis is based on the assumption that each of the shared resources in  the system is 
represented as a single server queue. For instance, each processor has a bus access 
queue and every memory module has both an input and an output queue.
For the asynchronous system, the centrally controlled bus system is modeled as 
an FESC. For decentralized control of the bus system, each bus is assumed to work 
independently of others. For both bus control strategies (the centralized and the 
decentralized), memory access tim e and packet transmission tim e are assumed to 
be determ inistic.
D ally [38] analyzed k-ary n-cube networks of varying dimensions under the as­
sumption o f a constant wire bisection w idth constraint, which is m otivated by wiring
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
16
density lim itation in  V LS I. He has shown that low-dimensional networks (e.g ., to ri) 
have lower latency and higher hot spot throughput than high-dimensional networks 
(e.g., binary n-cube) w ith the same bisection width.
Agarwal [39] derived simple closed-form equations taking both switch and wire 
delays into account. The model includes the effects of packet size and communica­
tion locality. Under the constraint o f constant wire density and constant bisection 
w idth, a two-dimensional mesh yields the lowest latency. However, when node de­
lays are taken into account, it  is shown th at the best network would have three or 
four dimensions. Longer messages make the relative effect of network distance less 
im portant, while shorter message lengths increase the relative influence o f network 
distance and tend to favor networks w ith more dimensions. Com m u n ication  locality  
enhances the attractiveness of low-dimensional networks.
Scott and Goodman [40] examined the performance im pact of pipelined chan­
nels on k-axy n-cube networks on the choice o f dimensionality and radix. Networks 
are investigated under the constant link w idth, constant node size and constant 
bisection width constraints. Pipelined channel networks have higher optim al di­
mensionality than non-pipelined channel networks. Their radixes rem ain roughly 
constant as network size grows, decreasing slightly for some unidirectional to ri and 
increasing slightly for some bidirectional meshes. In  [38 -  40], studies of ifc-ary n- 
cubes w ith wormhole routing have shown that under reasonable assumptions, the 
optim al dimensionality is 2 to 4.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
17
Adve and Vernon [41] developed approxim ate M VA models for 2-D mesh inter­
connection networks w ith wormhole routing. The model explicitly represents the 
virtual channels used in the non-adaptive deadlock-free routing scheme developed 
by D ally and Seitz [31]. Results presented in  [41] showed that when processors have 
m ultiple outstanding requests, the system does not scale well w ith increasing system 
size because o f bandwidth lim itation. However, for a very high level of communica­
tion locality, the system scales well. For the case of four outstanding requests, for 
example, a t least 70% — 80% of each processor’s requests must be directed to its 
nearest neighbors. This may correspond to unrealistic workloads.
K im  and Das [42] modeled a deadlock-free routing scheme in  asynchronous hy­
percubes for finding the average delay of a message in the network. Their model 
can capture the probability of blocking caused by the LR wormhole routing scheme 
and the random wormhole routing scheme. They modeled a set of blocked channels 
as an M /M /1  queue. They claimed that their analytical results show more close 
agreement w ith  sim ulation results than analysis of wormhole routing first reported 
by D ally [38] and that the model can capture any message destination distribution.
H ybrid switching proposed by Shin and D aniel [43] dynam ically combines fea­
tures o f both v irtu a l cut-through and wormhole switching by buffering a small frac­
tion o f blocked packets and lim iting the number of links th at blocked packets can 
hold. This significantly reduces the buffer requirement at interm ediate nodes com­
pared to v irtu a l cut-through and provides higher throughput than that of a tradi­
tional wormhole routing network.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
18
1 .2  M o tiv a tio n  a n d  O u t l in e  o f  D isser ta tio n
A  typical bus-based system suffers from  severe problems unique to the bus itself. 
F irst, the bus is a m utually exclusive resource. Second, in  M IM D  systems, a ll pro­
cessors must participate in  an arbitration phase. Third, the bus clock cycle must 
be long enough so that signals can propagate through the entire length o f the bus. 
Fourth, the capacitive loads and drive requirements of a bus are proportional to 
the number o f connections to the bus. F ifth , signal speed is lim ited by the physical 
capacitance associated w ith the bus.
Because of these drawbacks, the number of connections to a bus has to be lim ited  
to a few dozens. Hence, the first objective of our research is to overcome the ar­
chitectural lim itations of bus-based shared memory systems while m aintaining their 
advantages in  terms of high degree of fau lt tolerance, ease of expansion and ease of 
programming.
To avoid bus loading problems in large systems it  is essential to allow only a 
certain sm all number of connections to a bus. Segmenting a large bus system into  
a number o f sm aller ones can be an attractive option. A large m ultiple bus system 
could be segmented into smaller partitions, which we call segments, if  these segments 
can be connected somehow such that bus loading effects are avoided. This makes it 
possible to overcome most problems associated w ith traditional bus-based systems. 
Each segment is in  effect a m ultiple bus system whose size is chosen so as to  avoid 
bus loading problems. The short bus results in  removing bus loading problems and 
permits m aintaining a fast bus cycle.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
In  a segmented system, several processors may access different memory modules 
simultaneously, if  they do not require an overlapping set o f memory modules. For 
instance, this would be the case if  processors access memory modules w ith in  their 
own segments. This w ill tend to increase the effective bandwidth of the segmented 
system over that of a non-segmented bus system at a lower cost. This is the concept 
underlying this dissertation.
In  applications running w ith  uniform memory access running on a typical mul­
tip le bus system, latency for a ll communications increases as system size scales. 
Hence, mechanisms used to implement uniform  memory access networks are not 
scalable. The performance of those networks is quite sensitive to memory access la­
tency, which is the delay from  the tim e a processor requests a memory access to the 
tim e when data is returned from  the memory module (or is stored in  the m odule). 
Unless a mechanism that is less sensitive to the memory access latency is developed, 
scalability of the system w ill be hampered.
The hierarchical request model proposed by Chen and Sheu [19] improves the 
partial m ultiple bus network by connecting more frequently addressed memory mod­
ules to more buses than those th at are less frequently referenced. For the m ulti-level 
bus system proposed by S. Mahmud [20], it  has been shown that a processor ac­
cesses its nearest memory modules more frequently than others. However, these two 
networks have two m ajor drawbacks:
1. The connection com plexity of certain buses is the same as that of a conven­
tional MBS.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
20
2. The connection patterns lack regularity.
Both studies described above are done in  the context of applications w ith non- 
uniform  request patterns running on systems which can best support uniform  mem­
ory access.
In  contrast to  these networks, the segmented system we propose in  this work 
can be classified as a non-uniform memory access network although each segment 
has regular connections. W ith  such a non-uniform  memory access network, applica­
tions where memory reference patterns tend to  favor near memory modules can run 
efficiently. Therefore, the segmented system can u tilize  memory reference locality  
which is usually not the case in conventional bus-based systems.
This dissertation introduces a new class o f systems called Segmented Multiple 
Bus System  (SM BS) based on the fundam ental ideas described above. The SMBS 
targets multiprocessor systems w ith increased scalability while keeping the natural 
advantages o f a bus-based system. The SMBS provides a global address space. A ll 
processors can transparently access a ll memory locations. This makes program m in g  
in tu itive  and easy. The SMBS exploits the natural advantages of the bus by using 
relatively sm all bus segments, here called segment buses.
One of the unique characteristics of the SMBS is that it  allows the exploitation  
of communication locality even though it  is a bus-based nondirect network. By 
exploiting the locality of memory reference inherent to many applications, scala­
b ility  o f the SMBS could be drastically improved. Another interesting feature of 
the SMBS is th at it  supports wormhole routing which is mostly used in direct net­
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
21
work topologies. Together, both segmentation and wormhole routing could make 
the system largely scalable even though it  is a bus-based system. This is a m ajor 
hypothesis underlying the work we report here.
Another im portant aspect o f this research is that we develop novel performance 
models for wormhole routed SMBS’s. This, to our knowledge, is the first attem pt 
to adapt wormhole routing to a bus-based (non-direct) network. We w ill therefore 
incorporate features o f both direct and nondirect networks in our performance mod­
eling. Among direct network features we include are the effects of blocking and that 
of the pipelining property of wormhole routing. The bus system in  each segment is 
modeled as a flow equivalent service center (FESC) [16] representing m ultip le servers 
w ith  a single central queue. We should note that this central server representing 
the bus system is a load dependent server. This is a unique modeling feature for 
m ultiple bus systems which we w ill encounter frequently.
In  our work we conjecture that packets quickly delivered to their destination w ill 
reduce the contention between other packets for network resources and that simul­
taneous communication traffic can thus be reduced. For that reason, a transient 
packet which passes between segments w ill be given priority over others in  the bus 
assignment process.
The models we derive are closed m ulti-class queueing network models. For anal­
ysis of this type of closed queueing network, w ith features that violate separable 
model assumptions, we use Approximate Mean Value Analysis [16]. O ur models are 
very complex and feature aspects o f both direct and nondirect networks w ith non­
Re produced with permission of the copyright owner. Further reproduction prohibited without permission.
22
product form  queueing behavior. Development o f such models to reflect network 
behavior accurately is a m ajor contribution of this dissertation.
The rest o f this dissertation is organized as follows. Chapter 2 introduces the 
Segmented M ultip le Bus System (SM BS) and discusses its distinguishing features. 
Architectural features o f the SMBS such as its interconnection structure and arbi­
tration  mechanism are described, h i Chapter 3, we develop a performance model 
of the SMBS w ith single flit buffers for finding the mean response tim e of a re­
quest issued to a shared memory module in an asynchronous environment. We 
state the assumptions regarding system workload and operation, and describe key 
performance issues. In  the model, representation o f the packet pipelin ing  property 
of wormhole routing and blocking caused by fin ite  flit buffers are considered. In  
Chapter 4, we develop a performance model for the SMBS w ith infinite flit buffers. 
W hen buffer capacity is unlim ited, wormhole routing is equivalent to an optim ized 
virtu a l cut-through scheme [30]. The infinite  flit buffer model does not consider the 
mean residual residence tim e of a ta il flit and blocking delay because buffer space 
at a segment switch is unlim ited.
Chapter 5 describes our event-driven sim ulator, presents sim ulation results to 
validate our analytical models and presents the analytical results for our perfor­
mance analysis. By comparing system performance w ith the single flit buffer model 
against that of the in fin ite flit buffer model, we shall investigate how much system 
performance can be enhanced by increasing flit buffer sizes. We study the perfor­
mance under a uniform  memory reference pattern and then examine the impact
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
23
of memory reference locality on the mean response tim e and scalability. We also 
address how performance changes by varying the number of segment buses and the 
number o f flits  per packet. Given a fixed sized packet, the number o f flits  per packet 
is dependent on the bus w idth which is a system design param eter. O ur results 
w ill show th at network latency is more influenced by bus contention than network 
distance. Thus, a sm all number of flits per packet and a wide bus w idth should help 
to reduce the probability o f contention and consequently the expected waiting tim e 
for a segment bus. Chapter 6 summarizes the research reported in  this dissertation 
and presents ideas for future research.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C h a p t e r  2 
S e g m e n t e d  M u l t ipl e  B u s  S y st e m s
In  this chapter, the architecture of the Segmented M ultip le Bus System (SM BS) 
along w ith  its distinguishing features are introduced. Architectural features of the 
SMBS such as its interconnection and arbitration mechanism are described.
2 .1  P r e l im in a r ie s
The SMBS may be classified as a general purpose asynchronous M IM D  machine 
w ith shared memory. The SMBS consists of N  processors and M  shared memory 
modules which are evenly divided into g identical segments numbered from  0 to  
<7 — 1. Each segment is thus a relatively sm all m ultiple bus system that is composed 
of n =  N /g  processors, m  =  M /g  memory modules, and b segment buses. Here we 
shall refer to a bus in  a segment as a segment bus. The segment bus is bidirectional.
In  the SMBS we w ill use wormhole switching [10] (better known as wormhole 
routing). A t both ends o f a segment bus we u tilize  segment switches. The segment 
switch is unidirectional and is a switching device w ith its own flit buffer. Although 
the segment bus is bidirectional, the bus behaves as a unidirectional bus when it is 
involved in  an inter-segm ent access in which a memory request (or a reply) may be 
transferred to a neighboring segment through a unidirectional segment switch. The 
segment switch supports wormhole switching and has intelligence so that it  could 
handle memory requests across segments. The segment switch checks the destination
24
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
25
address o f a memory request (or reply) and forwards the request (o r reply) to the 
appropriate memory module (o r processor). Aside from  routing functions, segment 
switches also serve the function o f isolating (or buffering) adjacent segments. Thus, 
in SMBS’s, bus loading is not an issue if  segments are not too large. In  each segment, 
the switches are divided into two groups of equal size but opposite direction . We 
call the flit buffers in  the first group o f segment switches upstream flit buffers and 
the flit buffers in  the second group downstream flit buffers, based on the direction 
of flow they support.
upstream flit buffer
downstream flit buffer
Figure 2.1: Segmented m ultiple bus system.
Figure 2.1 illustrates the interconnection of the SMBS, where each segment is as­
sumed to have a fu ll bus connection.
We define a load memory module as a memory module d irectly accessible by a 
segment switch to which the memory module is connected by a segment bus. Other
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
26
memory modules are referred to as non-local w ith respect to the given segment 
switch. From a processor’s point of view , memory modules connected by the seg­
m ent bus th at the processor is connected to , are called local and the other memory 
modules are referred to as non-local
The bus cycle is also considered the flit time. I t  is equal to the tim e needed to take 
a flit from  one switch to the next over a segment bus or to a local memory module. 
Passing flits  between two adjacent segment switches uses a handshaking protocol 
sim ilar to the one described in [32]. A  dedicated single-bit Request/Acknowledge 
line is associated w ith two adjacent segment switches. A  flit is moved to the adja­
cent flit buffer only if  it  is empty.
To reserve bus bandwidth, only those buses th at have a non-empty flit buffer at 
the sending side and an empty flit buffer a t the receiving end may participate in 
the bus assignment process in any given cycle. Each memory module has an input 
buffer and an output buffer. This makes it  possible for memory modules to serve 
requests continuously without having to w ait to transm it the response packet back 
to the requesting processor before serving other requests.
When a memory request is issued by a processor, the request w ill traverse seg­
ment switches if  the requested memory module is non-local or w ill be sent directly  
to the memory module if  the module is local. To make such determ ination, each 
segment switch should “listen” to the segment bus. However, this may cause the 
segment switch to suffer the unnecessary overhead of decoding every request. In  
order to resolve this problem, we assume th at the request has one extra b it which
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
27
indicates whether it is an intra-segment (local) request or an inter-segment (non­
local) request. This eliminates unnecessary buffering o f the request at a segment 
switch.
The connection topology of the system could be a linear connection o f g segments 
or a ring w ith end-around connections, h i a ring connection, each processor may 
reach any distributed shared memory module by traversing at most [ f j  interm edi­
ate segment switches. The physical symmetry of the system can lead to a balanced 
utilization  of segment buses. When segments are connected in  a ring topology, trav­
eling along one direction may be shorter than the other. However, because segments 
are connected in a ring-like topology, deadlock m ay occur when wormhole routing 
is employed. Deadlock comes from a circular w aiting for network resources as we 
discussed in Chapter 1. We can sim ply solve this deadlock problem by unbounding 
the size of flit buffers. We consider only the case o f in fin ite flit buffers in  a ring-like  
connection topology. However, the SMBS connected in  a linear fashion would be 
deadlock-free even when the size of a flit buffer is fin ite.
The arbiter o f the system is naturally asynchronous because the SMBS w ill op­
erate in  an asynchronous M IM D  mode. W hile an asynchronous arbiter is more 
suitable for the system, it is not reliable because inputs may change even if  the 
arbiter is not in  a stable state. Designs for reliable asynchronous arbiters have been 
developed and synchronization of asynchronous inputs has been studied. How­
ever, the synchronization process is s till not perfect [25 -  28]. Details on arbiter
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
28
re liab ility  is out o f the scope of this work. In  the next section, we w ill discuss the 
arbitration scheme o f the SMBS in detail.
The SMBS is o f interest for several reasons. B y distributing the shared memory 
modules among the segments, memory reference locality can be exploited to reduce 
network traffic and network latency. Further, the segmented bus structure resolves 
the bus loading problem associated w ith  trad itional m ultiple bus systems. Because 
segment buses can be made short w ith lim ited number of connections, data transfer 
over them  can be very fast. Moreover, the overall bandwidth o f the SMBS increases 
proportionally w ith  the number of segments although increasing the number of seg­
ments also increases latency by a corresponding am ount. This architecture s till 
keeps many of the advantages of a trad itional m ultiple bus system like shared mem­
ory (logically), high degree of fault tolerance, etc.. An SMBS which can support a 
very large number of processors, can u tilize  wormhole routing which is less sensitive 
to an increase in  the number of segments. This results in  a system w ith increased 
scalability. We w ill elaborate on this aspect la ter in  Chapter 5.
2 .2  A r b it r a t io n  M e c h a n ism s
2 . 2 . 1  P a r a l l e l  A r b it r a t io n  in  M u l t ip le  B u s  S y s te m s
One of the most im portant design aspects o f a m ultiple bus multiprocessor system is 
the arbitration mechanism, which provides the control function for the system. We 
assume that a ll memory modules are single-port memories, so that at most one access 
can be made to a module at a tim e. The conventional connection in a traditional 
m ultiple bus system (M BS) calls for a ll N  processors and M  memory modules to
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
29
be connected to a ll B  buses. For the MBS , circuit switching is usually assumed. 
Thus, the path is set up un til a response message returns. For such a system the 
arbitration system could carry out the following functions in  parallel [24]:
•  Processor selection when simultaneous m ultiple requests for a shared memory 
module occur.
This is solved by using an A -user 1-server arbiter, denoted as A  T 1- The 
number of such arbiters functioning in  parallel is M .
•  Assignment of buses from existing free buses to shared memory module re­
quests.
I t  is necessary to assign b buses (b < B )  to * requests (» <  M ). Each request 
corresponds to a request made by a processor to a given memory module.
The general structure of the arbitration system in a parallel arb iter im plem entation 
is shown in  Figure 2.2. Processor selection and bus assignment are carried out in  
parallel in  order to reduce arbitration and assignment delay.
The processor selection function is implemented using as m any A T I arbiters 
as memory modules. The responses o f the A T I arbiters, together w ith the output 
of the M  T B  arb iter, make it  possible to obtain the final grant signals in  parallel. 
Processor k  requesting memory module * (rn  =  1), obtains the final grant signal 
(Aic =  1) if  the arbiter for memory module t selects processor k  (an  =  1) and 
memory module t is assigned to bus j  (bij — 1 for some j )  in the bus arbiter.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
30
N1 1M NM MB
N t l N t l M t  B
r11 rN1 r lM  rNM
N 1 1 : N-u m t  1 -M rvar mamory arbiter
M tB  : M-iJMr B -iw w r bus arbiter
rid  : Raquaat from prof i or k to mamory modula I
R | : R»qu > it for mamory modula I
Bid : Mamory modulo I — Ignad lo  procat to rk raquat t
b | : Memory i accaaa by but j
A |( : Grant (mamory modulo and bus) to procaasor k
Figure 2.2: A rb itration system structure of an M BS.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
31
2.2.2 A r b it r a t io n  M e c h a n is m  f o r  th e  SMBS
The SMBS is a cascaded connection o f g  identical segments. Each segment is sim ilar 
to a conventional m ultiple bus system except that these segments are connected via 
a set o f segment switches. A  memory request horn a processor m ay be directed 
toward a local memory module or it  may be transfered to a neighboring segment 
through a segment switch. The form er is called an intra-segment access and the 
la tte r is called an intersegment access. In  the case of an inter-segment access, the 
packet for a memory request traverses (possibly several) segment switches and is 
eventually sent to the target memory module. A response packet from  the memory 
module takes a return path which could be different from the forward path. We 
assume that a transient packet a t a segment switch has prio rity over other packets 
in  bus assignment. This is intended to reduce network contention.
Instead of using a central arbitration system, which would hamper the scalability 
goal for the SMBS, we employ a distributed arbitration mechanism. Each segment 
has an identical arbitration system and arbitration in each segment is independent 
of those in  other segments. Hence, we can consider arbitration w ith in  a segment 
only.
Resources w ithin a segment consist o f n processors, m  shared memory modules, 
6 segment buses, and a to tal of 2b segment switches which are located on the bound­
ary o f the segment. Notice that there are b segment switches on each side of the 
segment (26 to ta l) but only half o f those can access segment buses and memory 
modules in the segment. For sim plicity, we assume that the number of buses in a
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
!
32
segment is even. The segment buses are divided into two groups: an upper half bus 
group and a lower half bus group. Each bus group is associated with, a bus arbiter. 
The upper bus group (lower bus group) consists o f the first half (second h alf) o f seg­
ment buses which are bidirectional. The segment switches which are unidirectional 
are also divided into two groups o f equal size but opposite direction.
Two sources of conflict exist, which are the same as those in  a conventional MBS. 
First, more than one request can be made to the same memory module. Second, 
available buses may not be enough to accommodate a ll the requests. As in  a con­
ventional M BS, the arbitration scheme in a segment employs a processor selection 
function and a bus assignment function. These two functions can be implemented 
partia lly  in  parallel, like those in  an M BS, as we explain next.
In  each segment, a two-phase arbitration scheme is employed to resolve the con­
flicts mentioned above. In  the first phase, a segment bus is assigned to a segment 
switch if  a request exists at the switch. The request could be caused by a transient 
packet or a term inating one. In  the second phase, both processor selection and bus 
assignment are needed. In  the case o f an intra-segment access, these two functions 
can be implemented in parallel as would be the case in an MBS. Processor selection 
is needed when simultaneous m ultiple requests for a shared memory m odule occur. 
In  this phase at most b outstanding requests from  segment switches (selected in the 
first phase) and at most n requests from  local processors can coexist. Contention for 
a given memory module is solved by using an (n  +  6)-user 1-server arb iter. Thus, a 
to tal of m  such arbiters is needed per segment. I f  the arbiter selects a request from
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
33
a segment switch, that passed, the first phase, the request is im m ediately assigned 
to the memory module. I f  the selected memory request is from  a processor, that 
request must obtain the fin al grant from a bus arb iter in  the same way as in the 
parallel arbitration of the MBS discussed earlier.
Together w ith memory module arbitration, bus assignment for free buses is em­
ployed in  the second phase. Each segment has two (2m + n)-user 6 / 2-server arbiters 
for bus assignment: one for the upper bus group and one for the lower bus group. 
Each such arbiter assigns a t most 6 /2  buses by selecting requests from  (2m +  n) 
request inputs. Among the 2m requests, up to m requests could come from  memory 
response packets and up to  m requests could come from  ou tstan d in g  memory re­
quests selected by the memory arbiters. Notice that in  each segment, there may be 
(2m  +  n ) different request inputs for bus assignment: n for inter-segment accesses 
by processors, m  for requests from  memory output queues for reply packets, and 
m for outstanding memory requests from intra-segment accesses. However, notice 
that the maximum number of requests that could actually participate in  the bus 
assignment part o f the second phase is (m  +  n) because the n inter-segment ac­
cesses from  processors and the m outstanding memory requests from  intra-segment 
accesses must originate at the same set of n  processors.
The bus assignment process is designed so as to im m ediately assign a bus (if 
available) to a memory module w ith  a response packet or to a processor request­
ing an inter-segment access. I f  a bus is assigned to an outstanding memory re­
quest, however, this bus assignment, together w ith the output of the corresponding
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
34
(n +- 6) 1 1 memory arbiter, is used to produce the final grant. Thus, arbitration for 
an intra-segment access is done in  parallel as in  the traditional MBS. The complete 
arbitration system that each segment of the SMBS needs to have is b u ilt w ith m  
arbiters o f the (n  +  6)-user 1-server type and two arbiters of the (2m  +  n)-user 6 /2 - 
server type.
Even though the arbitration system carries out processor selection and bus as­
signment in  parallel, it is critical that each arb iter be fast, efficient, simple and of 
modular structure. For our system we need two types of arbiters which we can 
generically denote as: iV | 1 and M  T B.
Pearce, Field and L ittle  [25] presented an implementation of an asynchronous 
2 1 1 arb iter module suitable for im plem enting AT-input arbiter trees. An N  f  1 
arbiter can be constructed from  2 1 1 arbiter modules w ith an arbiter tree of depth 
l°g 3 AT. This arbiter tree is fast because the arbitration tim e grows as 0 (lo g 2Ar). 
When each 2 \  1 arbiter receives simultaneous requests the arbiter alternates priori­
ties between the two inputs to m aintain fairness o f arbitration. When a m ulti-input 
arbiter tree is constructed from the 2 1 1 arbiter modules presented in [25], the delay 
of each level would be 3A , where A  denotes the nominal gate delay. Thus, the to tal 
delay for an N 1 1 arbiter tree is 3 A  log2 N . Therefore, the corresponding delay of 
an (n  +  6) 1 1 arbiter tree for our arbitration system is 3A  log2(n  +  6).
Lang and Valero [29] presented im plementations of M ] B  arbiters. The M  ] B  
arbiter consists of M  combinational arbiter modules that produce the B  bus as­
signments. The arbiter is cyclic in order to obtain a fair policy and the delay is
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
35
proportional to M , the number of arbiter modules. Hence, the arbiter design is 
simple but relatively slow. If  each combinational module can be implemented by a 
programmable logic array (P L A ), w ith a delay of 3 A , the to ta l delay of the arbiter 
would be about 3A M . For our arbitration system, we can approximate the corre­
sponding delay of a (2m  +  n) 11 arbiter to be 3A (2m  +  n ).
Turning to the performance of the arbitration system we just described for the 
SMBS, we observe that the to ta l delay of the arbitration system is the sum of the 
delays for the two phases. Delay for the first phase of the arbitration can be ig­
nored because the tim e for a bus arbiter to assign a segment bus to a segment 
switch ( if  a request exists at the switch) is relatively sm all. The second phase of 
the arbitration process consists of two parallel functions: processor selection and 
bus assignment. Based on the above, the delay for the processor selection function 
which is implemented by (n  +  6) 1 1 arbiters is 3A  log2(n  +  6) and the arbitration  
delay of a (2m  +  n ) f  |  arbiter, used for bus assignment, is 3A(2m  -f  n).
There are two different cases for arbitration in the second phase:
1. A processor selection and a bus assignment functions in the case of an in tra­
segment access, which could be implemented in  parallel; and
2. A bus assignment function only, in  the case of an inter-segment access.
Hence, the worst delay o f the second phase arbitration is given by
m ax(3A  log2(n  +  6), 3A (2m  +  n )).
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
36
I f  b is smaller than n or m , the delay would be 3A (2m  +  n). Thus, the arbitration  
delay in  each segment becomes equal to the delay of the (2m  +  n) 11 arbiter.
In  conclusion, the arbitration delay in  the SMBS is dependent not on system size 
but on the size o f a segment because arbitration a t each segment works indepen­
dently. A rbitration delay is not a function o f the number of segments the system  
consists of. This is a  very attractive feature of the SMBS.
2.3 S c a l a b i l i t y  o f  t h e  SMBS
Each segment can be viewed as a conventional M BS. Thus, constraints on bus 
loading apply to a segment in  exactly the same way it  applies to an M BS. However, 
once each segment is expanded to the allowed maximum number of connections per 
bus such that a given speed is reliably m aintained, we can increase system size by 
taking advantage of the topological characteristics o f the SMBS. The system is a 
cascaded connection o f segments. For further scale-up of an SMBS w ith a linear 
connection, segments can be added at the ends. In  a ring connection of segments, 
the ring can be opened and additional segments can be inserted into the ring.
An SMBS which supports wormhole routing is like ly  to be less sensitive to the 
increase in the number of segments. Unlike a conventional MBS, our proposed 
segmented bus system can take advantage of the locality o f memory reference. This 
gives the designer more options when it comes to system scalability. Scalability 
performance issues w ill be discussed later in  Chapter 5.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C h a p t e r  3  
Q u e u e in g  A nalysis f o r  Sy st e m s  w it h  
S in g l e  F l it  B u f f e r s
This chapter develops a performance analysis model for studying SMBS’s w ith single 
flit buffers employed at each segment switch. We w ill first state a ll modeling as­
sumptions and discuss input model param eter selection in  detail. Using approximate 
Mean Value Analysis we then develop equations for mean residence tim e a packet is 
expected to incur at each queueing network service center. Residence times at three 
distinct service centers are evaluated; nam ely we w ill evaluate residence times at a 
segment switch, a processor and a memory module.
3 .1  A ssu m p t io n s  a n d  N o ta tio n
Tw o types of messages are generated in  the shared memory SMBS: a request and 
a reply. A  request message represents a memory read or a memory write  request. 
A  reply message is either an acknowledgment message or a data reply message. A 
data reply message carries data sent back to a processor from  a memory module 
in  response to an earlier read request. For sim plicity, each message is assumed to 
be a single packet of a constant size, although our analysis can be extended to the 
case of different request and reply sizes. The read and w rite requests are assumed 
to be treated equally in  the sense that a processor w ill rem ain id le after subm itting 
a request un til it  receives either data for a read request or an acknowledgment for
37
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
I
38
a w rite request. Thus, the total number of packets in the system at any tim e is 
constant and is equal to the number of processors in the system. Hence, the entire 
system behaves like a closed queueing network.
The model we develop is general enough so as to allow analysis of systems with 
uniform  as w ell as non-uniform memory reference patterns. Mem ory requests w ithin 
a segment are assumed to be uniform ly distributed. However, at the segment level, 
references are assumed to be distributed arb itrarily. Each memory module is as­
sumed to have an input buffer and an output buffer so th at the memory module 
may serve requests continuously without having to wait to transm it  reply packets 
back to their requesting processors. The sizes of the memory input and output 
buffers are assumed unbounded. A reply packet is assumed to be im m ediately con­
sumed as soon as it  arrives at the requesting processor.
Requests from  different segments issued to a certain memory module can expe­
rience different average response times because of a non-uniform memory reference 
pattern or due to the non-symmetry of the network. However, requests generated by 
processors w ithin the same segment w ill experience the same mean response times 
because the requests are indistinguishable. Hence, our proposed modeling approach 
employs a closed m ulti-class queueing network model to represent the SMBS. The 
number of packet classes is equal to the number of segments in the system. The 
service centers of the model represent the processors, memory modules and segment 
switches (together w ith their associated segment buses) of the SMBS. We make the 
following additional assumptions:
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
39
1. Each processor generates request packets w ith exponential distribution of 
inter-request interval w ith a mean of rp.
2. Read and w rite accesses in  a memory module require the same service interval 
and take a constant tim e, rm.
3. A  transient packet has p rio rity  over other packets for bus assignment. This 
assumption is based on the expectation that quickly delivered packets w ill 
reduce contention for network resources.
4. A rb itration tim e is included in  a bus cycle tim e and a fa ir selection policy is 
utilized  in  each arbitration phase.
5. Bus service tim e is determ inistic and is equal to a bus cycle tim e which is also 
equal to the flit tim e.
6. The queueing discipline a t each memory queue is first-com e first-served 
(FC FS ). In  the case when requests arrive simultaneously, these requests are 
served in  random order.
As we discussed earlier, the SMBS supports wormhole routing. A  segment switch 
begins forwarding a packet as soon as the header flit is received, provided that the 
flit buffer in  the next segment switch can accept that flit and the next segment bus 
is available. Thus, the flits of a packet are transm itted from one segment switch 
to the next in  a pipelined fashion and may occupy several segment switches and 
their associated segment buses along the path from  source to destination. Only the 
header flit contains routing inform ation. I f  the header flit is blocked because the flit
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
40
buffer in  the next segment switch along the path is fu ll or the associated segment 
bus is occupied by another packet, a ll the trailing  flits  of the blocked packet are 
blocked in  place. Therefore, the segment switches and segment buses occupied by 
the flits  are also blocked. I f  more than one flit can be buffered at a flit buffer, flits 
behind the header flit can catch-up at the flit buffer un til the available buffer space 
is filled . A t this point, they become blocked and can continue only after the header 
flit is unblocked. In  the case that a header flit arrives at a target memory module, 
the rem aining flits  w ill always catch-up w ith it  a t the target memory module. We 
assume this method of routing throughout the dissertation.
W hen flit buffer capacity is unlim ited, the wormhole routing scheme is equiva­
lent to an optim ized form of v irtu a l cut-through switching. In  v irtu a l cu t-th rou gh 
switching, if  the header flit is blocked at a switch, the entire packet has to  be buffered 
before the packet is forwarded. However, under the condition o f unlim ited flit buffer 
capacity, wormhole routing allows a partia lly  received packet to be forwarded as 
soon as the header flit can advance.
Jackson [44] showed that when a queueing network consists o f exponential 
queues, its solution is separable. In  separable (or product form ) queueing net­
work models, each service center can be separated from  the rest of the network and 
its solution could be evaluated in isolation. Then the solution of the entire network 
can be form ed by combining these separate solutions. Our closed queueing network 
model, however, has features that violate separable model assumptions such as non­
exponential queues, simultaneous possession of resources, etc.. Therfore, we w ill use
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
41
Table 3.1: M odel input parameters.
Parameter Description
9 Number o f segments in an SMBS
n Number o f processors in a segment
m Number o f memory modules in a segment
b Number o f segment buses in  a segment
L Packet length
h Length o f a flit
t Number o f flits  per packet
Wg Segment bus bandwidth
TP Mean tim e of request intervals
Memory service tim e
P « Probability that a request of class s is directed to  
a target memory module in  segment d
Approximate Mean Value Analysis [16]. This technique has been shown to  provide 
efficient and accurate solutions for non-separable models [16, 41, 45].
Model input parameters are summarized in Table 3.1. The SMBS consists of 
g identical segments numbered from  0 to g — 1. Each segment is composed of n 
processors, m memory modules and b segment buses. The length o f a packet is L. 
Each packet consists of t flits  and each flit has length L j. Wg denotes the segment 
bus bandwidth.
As we stated earlier, we assume th at memory requests w ith in  a segment are 
uniform ly distributed. Processors in  the same segment are indistinguishable be­
cause they a ll experience the same mean response tim e. Therefore, a ll requests from  
processors in  the same segment are of the same class. A request generated by a
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
42
processor in  segment s (i.e ., a class s request packet) directed to a memory module 
in  segment d  w ill have probability P ^ . The subscript sd  in  P,d signifies a request 
of class s  directed to  a target memory module in  segment d  (in  the forward path). 
Sim ilarly, the subscript ds in  Pd, w ill signify a reply to a class s request directed to 
the requesting processor in  segment s  from  the target memory module in  segment 
d  (in  the return path). I f  s  and d  are the same this would indicate an intra-segment 
access. I f  s  is different from  d, the request would correspond to an inter-segment 
access.
Sometimes, a segment switch in segment i  w ill sim ply be called segment switch 
i. This convention can be applied to any system component like a processor, a seg­
ment bus or a memory module. An integer param eter which appears w ithin square 
brackets, (for exam ple, s in  i?[s] or i in i2mem[*]) denotes the class of the request. 
The flit number w ill always appear w ithin parentheses, as in  r ,(it).
3 .2  P e r f o r m a n c e  Issu e s
We shall study the performance of SMBS systems in  detail under the model de­
scribed in Section 3.1. We evaluate several parameters such as response tim e, pro­
cessing efficiency, scalability of the SMBS under various configurations (various sys­
tem  sizes, flit buffer sizes and connection topologies) and under different workloads 
(varying memory request rates and uniform  or non-uniform memory reference pat­
terns). We begin by exam ining a baseline SMBS: a linear connection of segments. 
We shall study baseline systems for both the single flit buffer and the in fin ite flit 
buffer cases w ith a uniform  memory reference workload. We shall also study baseline
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
43
systems w ith memory reference locality. Further we shall study the performance of 
an SMBS w ith a ring connection o f segments w ith in fin ite flit buffers.
The flit buffer size is a design param eter that has cost and performance im plica­
tions. The effect of flit buffer size on system cost is steadily dim inishing because the 
price of memory is steadily decreasing. We shall compare system performance w ith  
single flit buffers against system performance with in fin ite flit buffers. These are the 
two extreme cases that establish bounds on the performance of systems w ith fin ite  
buffer sizes. The comparison w ill also help us to see how much system performance 
can be gained by increasing flit buffer size.
For interconnection networks w ith pipelined routing, like wormhole routing, 
modeling the performance w ith larger fin ite buffers is a difficult problem . This 
difficulty comes from  the pipelining property of wormhole switching and the fact 
that blocking can take place due to  the finite-sized flit buffers. In  the SMBS w ith  
finite-sized flit buffers, for example, the remaining flits o f a packet whose header flit 
is blocked simultaneously occupy several segment buses and switches. In  addition, 
while the remaining flits  are catching-up, the header flit may advance as soon as 
the next segment bus and flit buffer are available. A ll that makes it  very d ifficult to 
develop an analytical model for such cases. Therefore, we study performance using 
simulation results only for the case of an SMBS w ith packet-sized flit buffers.
Our assumed model allows us to  handle many types of memory request distribu­
tions. Figure 3.1 shows the relative segment bus traffic o f an SMBS w ith  9 segments 
under the assumption that bus contention does not exist. Notice th at segment bus






Figure 3.1: R elative segment bus traffic in  an SMBS w ith  9 segments.
traffic is very unbalanced; the traffic is more congested in  the m iddle than near the 
ends.
Applications exhibiting memory reference locality can take advantage of archi­
tectural features o f the SMBS to realize performance gains. Applications w ith  
low locality place increasing bandwidth demands on the system. Increase in  the 
bandwidth requirem ent in  turn causes contention effects to become more profound. 
Hence, applications exhibiting reasonably high locality are likely to experience re­
duced network latency and contention. Figure 3.2 shows the relative segment bus 
traffic for an SMBS w ith  9 segments when memory reference locality is considered. 
Here, 75% locality means that probabilistically, 75% o f requests are directed to local 
memory modules w ith in  a segment and the remaining 25% are evenly distributed to
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
45
non-local memory modules (modules outside the segment). As can be clearly seen, 






Figure 3.2: R elative segment bus traffic in  an SMBS w ith  9 segments w ith varying 
locality.
3 .3  O v e r v ie w  o f  t h e  M o d e l
The mean response tim e o f a request issued to a target memory module w ill be used 
as a performance measure. Processors belonging to  different classes may experience 
distinct response tim es because of the non-sym metry o f the network as well as, 
possibly, non-uniform  memory reference patterns.
Consider a request o f class s for a memory module located in segment d. A 
processor w ill generate a request packet of class s  traveling in its forward path from  
s to d  and w ill receive a reply packet of the same class from  the targeted memory 
module in segment d. The mean residence tim e experienced by the request in its
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
46
forward trip  to the target memory and that experienced by the reply in  the return 
trip  is dependent on the request’s class. Hence, the mean response tim e of a request 
from  a processor in  segment s (a  class s  request) is
R[s] =  fZprocW +  flnetwor*[«] +  Rmem[*] S =  0, . .. , g -  1. (3.1)
/2[s] is the sum of the mean residence times (queueing and service) for a class 
s request packet in  the processor iZproefs], in  the network Rnetwork[s\ and in the 
memory module f2mem[.s]. In  short, we call R[s] the mean response tim e of a class s 
request. The overall mean response tim e is given by
i H
A = - £ £ [ « ] .  (3.2)
97^o
f2proc[̂ ] denotes the mean residence tim e that the header flit o f a class s packet 
experiences at the head of a processor output queue. This includes the mean waiting 
tim e for a segment bus plus the tim e for transferring the header flit to the first 
segment switch (in  the case o f an inter-segment access) or the tim e for transferring 
the header flit on a segment bus to a local target memory module (in  the case of an 
intra-segment access).
The residence tim e in  the network is the weighted sum of the mean residence 
times in the forward path (for a request packet) and in the return path (for a reply).
i




R n e tw o rk[■*] =  P e l(R»d,netw ork +  P it,n e tw o rk ) s  =  0, . . ., 1, (3*3)
4=0
where — 1 . R,d,network denotes the average residence tim e in  the network
of a class s  request packet from  the first segment switch on the forward path to the 
target memory in  segment d. R,d,network denotes the average residence tim e in the 
network of a class s  reply packet from  the first segment switch in the return path 
to the processor that originated the request.
The residence tim e at a segment switch consists of the mean delay spent in  the flit 
buffer w aiting for a segment bus and the tim e for transferring the header flit to the 
next switch (fo r interm ediate segment switches) or to the target memory module (for 
the final segment switch in the forward path) or to the original requesting processor 
(for the final segment switch in  the return path).
The residence tim e in  the network includes the catch-up tim es which occur both 
in the forward and return paths. The catch-up tim e in the forward path is defined 
as the mean delay un til a ll the remaining flits reach the destination after the header 
flit arrives a t the target memory module. The catch-up tim e in the return path 
is defined as the mean delay u n til a ll the remaining flits reach the processor after 
the header flit returns to the requesting processor. In  the case of an intra-segment 
access, the residence tim e in  the network is the sum of the catch-up tim es in the 
forward and return paths because a packet does not traverse any segment switches.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
48
The residence tim e in a memory module is given by
a - x
^mem[*] =  2J «,mem« "f" ‘f2d|a,mem0«t) •* =  0 , . . .  , ̂  !• (3*4)
d=0
Rd\i.mem,'. is the sum o f the average delay that a class s request packet experiences 
in the input queue of the target memory module (in  segment d) and the memory 
service tim e, rm. consists of two parts. For an inter-segment access, it
includes the mean queueing delay of a class s reply packet in  the memory output 
queue, the mean waiting tim e for a segment bus at the head o f the memory output 
queue and the tim e for transferring the header flit to the first segment switch in  the 
return path. For an intra-segment access, it consists of the mean queueing delay at 
the memory output queue, the segment bus waiting tim e at the head of the memory 
output queue and the tim e for transferring the reply header flit back to its local 
processor.
We w ill also u tilize  another performance m etric known as processing efficiency to 
evaluate SMBS’s. Processing efficiency is defined as the fraction of tim e a processor 
is busy. I f  a processor generates request messages w ith the mean interval, rp, then 
the mean processor utilization of class s, Pproc[s\, can be expressed as
Pproc[a] — ^  s =  0 ,. . . ,  y — I ,  (3.5)
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
where i2[a] is the mean response tim e of a class s request. Notice that the above 
expression does not distinguish among processors in the same class (segment) which 
is consistent w ith  our model.
3 .4  R e s id e n c e  T im e  in  t h e  N e t w o r k
In  this section, we evaluate network residence tim e for the SMBS. In  the case of an 
inter-segment access, R td,network and R^*,netw ork correspond to the average residence 
times for a packet of class s in the network in the forward and return paths, respec­
tively. The mean residence tim e for a class s packet in  the forward path consists of 
the following two elements:
•  The sum of the mean residence times of the header flit of a class s packet 
at all segment switches in  the forward path from s to d, 5 3 r i , j i ( l )  , where
I
refers to the mean residence tim e of the header flit o f a class s packet 
at segment switch i  in  the forward path from s to d.
•  The catch-up tim e, defined as the mean delay un til a ll the rem aining flits  
reach the destination after the header flit has arrived at the addressed target 
memory module in  segment d {L̂ J  )•
Therefore, the mean residence tim e in the forward path can be w ritten  as
R td ,netw ork ~  ^ 1 ̂ *t,arf(I) "b tr.
i  W b
The lim it o f the summation index in the first term  is dependent on the fact that 
a request may take a downstream path (d —s <  0 ) or an upstream path (d — s > 0 ).
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
50
For a forward path, the mean network residence tim e is given by
f J 2  rM *(1) +  ~~t£r~  if  d  — « >  0 ,
I isH-l
R-Bd,network —  *
£  r„J(,( l )  +  if  d - s <  0 .
i= d + l W B




^  - -  if  d  — s > 0 ,
WB
Rd.B , network
t  r , * ( l )  +  i { J - s < 0 .
i= d + l  W B
(3.7)
In  the case of an intra-segment access (s =  d), r t-lJ(f ( l)  and rv|(b (l) are equal to
zero.
3 .4 .1  M e a n  F l it  R e s id e n c e  T im e  a t  a  S e g m e n t  S w it c h
Suppose th at the k th flit, k  > 1 , resides at segment switch t. Recall that a segment 
switch in segment i w ill sim ply be called segment switch t. A  segment switch on 
the path to a given destination which is traversed last by the header flit w ill be 
called the last segment switch. I f  the last segment switch is k  — 1 or more hops away 
from the current segment switch t, the mean residence tim e of the kth flit at the 
la tter segment switch can be estimated as the mean residence tim e of the header 
flit at segment switch i +  k  — 1  in  the case of an upstream forward path or that at 
segment switch t — k +  1 in  the case of a downstream forward path. This method 
for calculating the kth f lit ’s residence tim e by estim ating the header flit’s residence
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
51
tim e is sim ilar to that used in [41]. This is clearly due to the pipelining property of 
the wormhole routing technique. Here, the number of hops is defined as the number 
of segment buses that a header flit must traverse to reach the destination. If  the 
header flit has already reached the addressed target memory module, the residence 
tim e o f the kth flit at a segment switch is the same as a single flit transfer tim e on 
a segment bus.
Let r.-^ ifc) denote the mean residence tim e of the kth flit o f a class s request 
packet at a segment switch in  segment i in  the forward path from  s  to d. Then 
r,>*(fc) is given by
r , x ( * )  =
r i+ i- i,ji( l)  if  d  is k — 1  or more hops upstream away from  t, 
r,-i+ i,j< < (l) if  d is k or more hops downstream away from  t, (3.8)
■kf- otherwise.
. Wb
Sim ilarly, let r,-,,*,(&) denote the mean residence tim e of the k tK flit of a class s reply 
packet at a segment switch in segment t in the return path from  d  to s. r,-,^(fc) can 
be expressed as
r,-+fc-i,<fr(l) if  s is it — 1  or more hops upstream away from  i, 
r i-i+ i,< fr(l) if  s is k  or more hops downstream away from  i , (3.9)
fy -  otherwise.
The mean residence tim e of the header flit o f a class s packet at segment switch t 
in the forward path from s to d, r,*|(l̂ (l), consists of the following three components:
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
52
1. The average w aiting tim e for a segment bus at segment i
2. The mean residual residence tim e of the ta il flit (o f a preceding packet) at the 
next switch along the path.
3. The tim e for transferring the header flit to the next segment switch: a flit tim e
Note th at when the header flit arrives at the last segment switch (either segment 
d  on an upstream forward path or segment d  +  1  on a downstream forward path), 
n,«rf(l) w ill not include the mean residual residence tim e of a ta il flit because the 
flit finds an unbounded memory input queue. S im ilarly, on a return path, r t|(b (l)  
w ill not include the mean residual residence tim e of a ta il flit when i equals s for 
upstream flow or when i  equals s + 1  for downstream flow.
The calculations of those terms are dependent on whether the header flit takes an 
upstream or a downstream path. Depending on "D (=  d  — s), the path w ill consist 
of segment buses either from  the upper bus group ( if  'D >  0) or from the lower 
bus group ( if  D  <  0 ). Hence, r,iJ(< (l) can have two different values, depending on 
whether the header flit takes an upstream path or a downstream path. We shall use 
r;iJ(< (l) w ithout using an explicit indication as to whether the path is an upstream  
or a downstream path. When an explicit distinction is necessary, the superscripts 
+  and — w ill be used to denote upstream and downstream flow, respectively.
Let r,(Jk) denote the average residence tim e of the ktk flit at a segment switch in 
segment «. u,-(Jb) is defined as the kih flit’s mean utilization  of the segment switch. 
Upstream flit buffers are utilized when a packet takes an upstream forward path or
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
53
an upstream return path. Downstream flit buffers are utilized when a packet moves 
downstream (whether it  is a forward or a return path). A t upstream  flit buffer i, 
ry(fc) and ti*(ifc) are expressed as
E  E  +  E  13 P ldr i ,d * { ty
*=° deDf . *=o deD~ .
n (k )  =  ---------- £ --------------------------------- , (3.10)
E  E  P » t  + 5 3  53 P *d
•= °d€Df.t •=«<*€DTtd
ti (L) — V ' V ' nP*dri>*d(k) v-i nP*dfi,dt(k) /o 11 \ 
'W  ‘  S ^ ^ T 5 W - + S ^ T + 1 W -  (3U )
A t downstream flit buffer t, r,-(ib) and u,-(ib) are expressed as
53 E  P Idr i^ d ( k )  +  53 53 P»dr i,da{Js)
*=° ig£»~ . *=° deDf
’* .(*) =  ---------- £ -------------------— - S i---------------- , (3.12)
E  E  iV * +  E  E  P u
•=°
, / t \  _  y 1 n P *d fi,,d {k )  y-k n P nfr,-,jt (fc)
• = ° r' + f i W  T" + flW  ’
where
(3.13)
Dt,*d =  (d  | packets traveling from  s to d  w ill visit upstream flit  buffer *}
=  {d  | packets traveling from s  to d  w ill visit downstream flit buffer »}.
N ext, we discuss the derivation of each term  in  the flit residence tim e. For this 
discussion we assume that a packet is on its upstream forward path. We call a
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
54
header flit w aiting for a segment bus at segment switch * the tagged header flit, in  
order to  distinguish it  from others. The tagged header flit must w ait for a segment 
bus to  be available if  a ll buses axe occupied by other packets. The w aiting tim e for 
a segment bus, U 7 ,-( l), w ill be discussed later.
Once the tagged header flit acquires a segment bus, it may observe the ta il flit 
of another packet in  service at the next flit buffer (which is located at segment 
switch * +  1). The residual residence tim e of a ta il flit is a random variable whose 
distribution depends on the distribution o f the original residence tim e. To estim ate 
the residual residence tim e, we assume th at the tagged header flit is equally likely 
to arrive a t a flit buffer at any point during the residence tim e interval of a ta il flit 
in  the next flit buffer along the path. Under this assumption, the residual residence 
tim e is given by
r.+ ift) <r2 
2  ^  2 r,-+i(£)
where a2 is the variance in  the average residence tim e of a ta il flit at segment switch 
* +  1 .
A  reasonable determ ination of the residual residence tim e is not easy because 
the first and second moments of the residence tim e distribution (which is unknown) 
must be calculated. Hence, we assume th at the residence tim e of each flit at flit 
buffer i is determ inistic  and that its value is given by r,(fc ). Under this assump­
tion, the residual residence tim e would be equal to one half of the residence tim e,
i.e ., r (-+ i(i)/2 . The error introduced by this assumption would be sm all and could be
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
55
ignored because the mean residual residence tim e is a small component in  the mean 
residence tim e of the tagged header flit.
The term  for the mean residual residence tim e relates directly to the pipelining 
property of wormhole routing. The tagged flit m ay only observe the ta il flit of a 
preceding packet in  the flit buffer of the next segment switch, if  it  observes any. 
Hence, the mean residual residence tim e can be calculated by m ultiplying the mean 
utilizatio n  o f the flit buffer by a ta il flit, u,-+1 (f), and the mean residual residence time 
o f a ta il flit, r,-+1 (t)/2 . Note th at when a segment switch in  segment i becomes the 
last segment switch on the path to the destination (t =  d on an upstream path and 
i =  < f+ l on a downstream path), r ,^ ( l)  w ill not include the residual residence time 
o f the ta il flit because the flit finds an unbounded memory input queue. Therefore, 
the residence tim e, r t-,J(* ( l)  on an upstream forward path is given by
[ +
r.x(l) =
for i =  d.
(3.14)
The residence tim e, r,-,* ,(!) on a downstream return path is given by
" •(1 ) +  w l for * =  s +  1 .
(3.15)
i
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
1
56
Sim ilarly, the residence tim e, r y ^ l)  on a downstream forward path is given by
for » =  </ +  ! ,
(3.16)
and ri,4 , ( l )  on an upstream return path is given by
Wi(l} +  W l for i = s.
(3.17)
3 .4 .2  A v e r a g e  W a it in g  T im e  f o r  a  S e g m e n t  B u s
In  computing the average w aiting tim e for a segment bus, we must examine the 
arbitration scheme. We s till m aintain the assumption that the tagged header flit 
is on its upstream forward path; that is, the flit is in  an upstream flit buffer of 
a segment switch connected to a bus in  the upper bus group. That im plies that 
the flit w ill be involved in the upper bus group arbitration. The tagged flit at a 
segment switch is always involved in  the first phase o f arbitration. According to our 
m odel, if  a segment bus is not occupied by another packet, the concerned segment 
bus w ill always be available to the tagged flit. Hence, waiting for a segment bus is 
experienced by the tagged flit only when the segment bus is occupied. For such a 
case, the tagged flit must wait un til the segment bus is released.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
57
Let utiPoti,(fc) denote the mean utilization o f a processor output queue by the ktk 
flit o f class i. Also le t b,iIBmI(1 ) denote the mean u tilizatio n  o f a memory output 
queue in  segment i  by the ktK flit.
Now, le t us try  to find the average waiting tim e. N ote that the waiting tim e 
w ill be different for different segments. We begin by exam ining the last segment. 
Suppose that the tagged header flit is blocked at the segment switch of segment 
<7 — 1 . This would occur when a local processor or a local memory module occupies 
the segment bus. For this case, the average waiting tim e is given by
w,
The two terms in  the first pair o f parentheses represent the probability that when 
the tagged header flit arrives at a segment switch, it  observes its associated segment 
bus occupied by the kth flit o f a memory request packet from  a processor in  segment 
g — 1  or the kth flit o f a reply packet from a memory module in  segment g — 1 , 
respectively. The reason why the summation index k  begins w ith  uk  =  2” is that a 
transient packet at a segment switch has priority over other packets (i.e ., the packet 
can only observe an already occupied segment bus).
Let the flit observed by the header flit when it  arrived at a segment switch in 
segment g — 1 be the kth flit. Then the terms in  the second pair o f parentheses 
correspond to the mean residual segment bus access tim e observed by the tagged 
flit including the k th flit it  observed on arrival. A t segment g — 1 , the mean residual 
tim e is contributed to by one half the probability of an intra-segm ent access: .
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
58
The average waiting tim e for a segment bus a t segment g — 2 is expressed as
( Ug~2.Po«t(2) (  Pg-24-2 (  L j  . .
V 2 +  2 ) \  2 V2Wh+ l  V B/
n  ( r 9~ l.Cg—2)(g—1) ( 1 )  , /1  n \  ^  1
P .- * . - .  ^ j  +  (* ~  2 ) /
Suppose that the tagged header flit observes the second flit of a request or a reply 
packet occupying the segment bus of segment g — 2. In  the case of an intra-segment 
access, the mean residual bus access tim e o f the remaining flits , including the second 
flit is, +  (£ —2 )^A-. In  the case of an inter-segment access, the tagged header flit 
w ill, on average, wait for a period equal to  the average residence tim e o f the header 
flit a t segment switch g — 1  plus the residual bus access tim e of the rem aining flits. 
The last term  in  the above expression represents the mean tim e elapsed until the 
segment bus is released by the rem aining flits  assuming that the incoming tagged 
flit observed the kth flit (Jfc >  3) occupying the segment bus.
Before we generalize, we go one step further and express the w aiting tim e at 
segment g  — 3 as
1
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
59
“>,-3(1) =
( Ug-3.Po» t(2 ) U j-3 ,m 0<i t ( 2 ) ^  f - f j - 3 j - 3  /  . t ) \ lL L \
V 2  +  2  /  i  2  \2W b  { ] WB)
+  / W ,  + ( ‘ - 2 ) ^ )
+ w  + 0 - 2 ) ^ ) }
+ / “g-3,Po..(3) . V 3 ,m „ , ( 3 ) \  f  f  P,-3J-3 „  \  /  L f  L f  \{  2  +  2  J 1 1 - 2 ~ +  P’- ^ - V  \ m  + ( 3)W i )
+
Generalization of the expression for the average w aiting tim e for a segment bus of 
the upper bus group at segment i  is thus given by
V ''  ̂( « .> « .,(*) . Ul> « l ( ^ 0  \  f  Pii (  L f . f .  » \ L f  \




k=aj . ,  ( ^  ♦ ( , £ ,  '  ♦ t )  ( a t  ■*'* -  *>fe)
(3.18)
S im ilarly, the expression for the average w aiting tim e for segment bus i  o f a lower 
bus group is given as
W i(l) =
£  {  2 + -----2----- j l  +
(3.19)
3 .5  R e sid e n c e  T im e  in  a  P r o c e sso r
Suppose that a memory request a t the head o f a processor output queue is looking 
for a segment bus of an upper bus group in  segment t. This request could be an 
intra-segment or an inter-segment access request and must participate in  the second 
phase of the arbitration.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
61
In  the case of an intra-segm ent access, the processor must undergo memory ar­
b itration  as w ell as bus arbitration. A  request from  processor i  participates w ith  
probability ptiProc|mem =  UitP<mt(l)P n /m  in  memory arbitration. Up to n such re­
quests could participate in  memory arbitration. To sim plify the notation, we w ill 
drop the subscript i  denoting segment t in  the expressions to follow . I f  a segment 
switch at segment i contains a flit directed to a memory module in  segment t, then 
this switch must also participate in  memory arbitration. The probability that such 
a segment switch w ill have to participate in memory arbitration is given by
P ivmp\mem — * - l 1 (3.20)
£  J lP .d1=0 i=i
The above probability represents the fraction of segment switch u tilizatio n  con­
tributed by term inating packets (packets term inating at the segment under consid­
eration). In  a lower bus group, the probability that a segment switch w ill participate 
in  memory arbitration is given by
« .-+ i(i) £ l  ^
Prwi0\mem — J • (3.21)
£  £ P .d*=i+l d=0
We w ill assume that the requests from  processors and segment switches are inde­
pendent Bernoulli trials. Therefore, we can consider a distribution which deals w ith  
the number o f requests issued to  local memory modules by processors or segment 
switches, separately. Let /i,- denote the number of requests issued to local memory
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
62
module i  (1 <  i  < m ) by processors. We w ill conveniently denote the state of 
memory requests by processors using an m -tuple vector /Tp =  ( p i , . . . ,  fim) in  which 
0  <  m  < n  for 1 <  » <  m and 0  <  ft =  5 3 i<,-<tfl /it- <  n. Each Hi represents 
the number of processor requests issued to memory module i  in  the segment un­
der consideration. Thus, the set o f a ll feasible states corresponding to  the number 
of requests issued to  each of m  local m emory modules by n processors in  a given 
segment is defined as
SnP(m , n) =
{ ( / i i , . . . ,  Hm) | 0  <  p,- <  n for 1  <  t <  m and 0  <  /i =  P* ^  n )*
K »<m
(3.22)
Notice th a t /i corresponds to the to ta l number of requests issued by processors 
to the m  memory modules in the segment under consideration. As we mentioned 
earlier, processor requests are modeled as Bernoulli trials. Thus, the requests follow  
a m ultinom ial distribution, denoted by Pjt,(n) and can be expressed as
p  /  \    n ‘P y ro c |m e m (l m P proc|m em )
P itp W -  ( n - / * ) !  n Pi! { '
l<i<m
Let Vi and & denote the number o f requests issued to local memory module i 
( 1  <  i < m ) by the segment switches o f the upper and lower bus groups, respectively. 
The states representing the number of requests issued to each of the m  local memory 
modules by the 6 / 2  upper segment switches and 6 / 2  lower segment switches w ill be
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
63
represented as m -tuple vectors V, =  (i/l5 . . . ,  um) and =  (£1 , . . . ,  6 »), respectively, 
where each Vi (£,-) corresponds to the number of requests to memory module i  from  
the upper (lower) bus group. The set o f a ll feasible states corresponding to the 
number of requests issued to  each of m  local memory modules by the 6 / 2  upper 
segment switches is thus
$ ;.(m , =
{ ( iq , . . . ,  I'm ) I 0 <  I/i <  £  for 1 <  * <  m and 0 <  v  =
* l<i<m ^
(3.24)
Sim ilarly, the set o f a ll feasible states corresponding to the number of requests issued 
to each of m  local memory modules by the 6 / 2  lower segment switches is given by
{(fi>  • • I 0  <  & <  ^  for 1 <  * <  TO and 0 <  (  =  £  & -  £ } '
1  l< i< m  *
(3.25)
As w ith processor requests, the requests by the 6/2 upper segment switches and 
6 / 2  lower segment switches w ill also follow m ultinom ial distributions, denoted by 
Pa,( j )  and pgt ( | ) ,  respectively, and can be expressed as follows.
_ /^ \  _  2*P«»«p|mem(̂  ~  mP*w«,|mem)2 "  f
2 ---------------( |  — y)! n  ■>.!-----------  (326)
K i< m
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
64
i  _  l'P*iC|a|Tngm(̂  m P»"|0|mem) J *
2  ( £ - { ) !  n  f.! ’
K »< m
W ithout loss of generality, we can assume that the m tk memory module is the 
tagged memory module because the m  memory modules in  a segment are indis­
tinguishable from  one another for the purpose o f our analysis. Let Ktag denote a 
vector representing the number of requests issued to the tagged memory module by 
processors and segment switches using the upper and lower bus groups. A feasible 
state for module m can be described by a ternary vector K tag =  (/im, in
which 0  <  fim < n, 0  <  vm < f , 0  <  < j  and 0  <  K tag = (im + vm +Zm < n  + b.
C learly K tag corresponds to the total number o f requests from any source to module 
m . The set of a ll feasible states which represent the number of requests issued to  
the tagged memory module by n processors, 6 / 2  upper segment switches and 6 / 2  
lower segment switches in a segment can thus be defined as
c  r 6 b ^ -  
2 ’ 2 ~
| 0 <  ftm < n, 0 <  vmt ^  and 0 <  K tag < n  +  6 }
(3.28)
where K iag =  ^  +  vm +  £m.
To the tagged memory module, a to tal o f Ktag memory requests is issued. There­
fore, the probability that a request from a processor succeeds in memory arbitration
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
65
given th at up to n processors and b segment switches m ight compete is given by
Pp.«c|mem =  1 + 7 ^  ~  C3-29)
tag
Now le t us find the cases which require participation in  the upper bus group 
arbitration o f segment i. Each inter-segment request by a processor w ill participate 
in  bus arbitration w ith probability
3 — 1
Pproc|bi« =  ui,p«mr(l) ^  -ft/* (3.30)
i=»+l
Sim ilarly, a reply packet from  a memory module takes part in  bus arbitration w ith  
probability
Pmem\bu* — ui,mo«t(l) ^  * (3.31)
In  the above equation, the probability Pu/2  represents the proportion of cases 
in  which a reply packet (o f an intra-segment access) participates either in  the upper 
bus group or in  the lower bus group arbitration.
As discussed before, in  the case of an intra-segment access, an outstanding re­
quest from  a processor which succeeds in  memory arbitration must take part in 
bus arbitration to eventually get access to a segment bus. Therefore, the effective
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
66
probability that a processor successfully participates in  bus arbitration is given by
Pa
Pproc\bustf f  — ~2_Ui,Po«(l)Pj>.»c«:|mem +  Pproc\buM- (3.32)
The first term  gives the probability that an intra-segment memory request w ill 
succeed in  memory arbitration and w ill participate in  bus arbitration. The second 
term  represents the probability th at an inter-segment request participates in bus 
arbitration.
The delay experienced in a processor output queue by a class s header flit, 
basically, consists o f the following two components:
1 . The mean w aiting tim e of the header flit o f a class s packet directed to d  in  
order to compete for a segment bus at the processor output queue in segment
2. The flit tim e for transferring the header flit to the first segment switch on the 
path to d, for an inter-segment access, or the tim e for transferring the header 
flit to  its addressed local memory module, for an intra-segment access ( ^ ) -
In  the case o f an inter-segment request, the residence tim e must include the mean 
residual residence tim e of a ta il flit in  service because the header flit may observe 
the ta il flit o f a preceding packet when it  arrives at the first segment switch in the 
path from  s  to d  (7 <*|«,Po-,). Hence, the mean residence tim e for the header flit of
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
67
class s  at the processor output queue is given by
i  1 - 1  r
■fiproeM =  ^  Pidsd\»,p<mt +  P K<T<ll»,po«t +  T j r  3 =   , g  — 1. (3.33)
<fc= 0  d=o Wb
<&•
The mean residual residence tim e, 7d|«,Paat, is calculated in  the same way as in 
the previous section and can be expressed as
d ~  3 >  ®»
(3.34)
u ,( t )—^  if  d — s < 0 .
£
Consider a class s request. W hen such a request arrives at the head of a processor 
output queue, it  w ill observe fCt,Pomt other requests occupying the heads of their 
respective queues, where IC»,p<mt is an integer random variable. The occupied queue 
heads are from a m ix of processor output queues and memory output queues. Here, 
the term  “occupied queue head” im plies a state in  which a request that already 
arrived at the head of an output queue waits for segment bus assignment. If  ACa>Poat 
is smaller than the number of available free segment buses in a bus group, the request 
doesn’t  need to w ait for a segment bus assignment. However, if  ICa,P<mt is equal to 
or greater than the number of free segment buses, the request should w ait un til a 
free segment bus is available. Suppose th at t is the number of free buses observed 
by the header flit when it  arrives a t the head of a processor output queue. If  
is greater than i, the request should w ait (IC,tPout — i)  service completions before 
it can get access to  a segment bus. This is because when a bus becomes free, the
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
68
next request to  be granted the bus is selected according to a FCFS policy. Cyclic 
selection is used for simultaneous arrivals.
The mean segment bus access tim e is the tim e for transferring a request or a 
reply packet on a segment bus. For an intra-segment request, the segment bus 
access tim e w ill be L /W b - For an inter-segment request, the bus access tim e is the 
sum of the transfer tim e of a header flit on a segment bus and the mean residence 
tim e of the kth flit for 1  <  k < t  at the first segment switch. This is equal to the 
tim e elapsed un til a segment bus is released. The mean segment bus access tim e at 
segment s  w ill be denoted by or t#,6wlo, where the subscript busup implies a
segment bus in  the upper bus group and bus{0 implies a segment bus in the lower 
bus group. These times are given by
= ------------—  p i  — --------------------------(3.35)
and
‘ . M . „ ---------------—  in  — --------- -•  (3.36)
o= 0
Note that the mean segment bus access tim e is independent of whether the packet 
is a request or a reply packet.
In  the firs t phase o f arbitration, a segment bus is assigned to a request if  the 
header flit o f the request is in a flit buffer o f the corresponding segment switch. 
Therefore, a t the beginning of the second phase, a segment bus from the upper bus
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
69
group in  segment s  is free w ith probability ( l  — 5 3 L ri ti*(Jb)/f) and a segment bus 
from the lower bus group is free w ith probability ( l  — u« + i(& )/i) 1 on average.
This probability is reflected in  the inflated (larger than the nominal value) bus access 
tim e. Hence, the expected tim e interval between consecutive bus service completions 
in the upper bus group in  segment s  is given by
t»,buampacce»M =  “ ----------------  . (3.37)
o (1 _  7  £ “ • ( * ) )
£  k = l
Here, u,(fc) is the u tilization of an upstream flit buffer in  segment s by the kth flit. 
Sim ilarly, the expected tim e interval between consecutive bus service completions 
in the lower bus group in  segment s is expressed as
 . (3.38)
1(1 -  7  E  “ « ■ •(*))
£  1 k = l
Therefore, the mean tim e until the request of class s  can get access to a segment 
bus while AC,iPo%, other queue heads are occupied, is given by
T \K ^ J L  =
(£»,Po.» -  ^ )f.,h « .pacee.. +  at  the upper bus group,
at the lower bus group.
(3.39)
The second term  in  the RHS of each of the above expressions represents the mean 
residual bus access tim e. Equation (3.39) indicates that the bus server is load-
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
70
dependent; th at is, the mean bus service tim e depends on AC,tPoet.
Now consider the random variable fCg,Pomt and assume th at a request at the head 
of a processor output queue participates in  the upper bus group arb itration . Re­
call that in  each segment there is a to ta l o f (2 m  +  n) different request inputs for 
bus assignment because up to n  inter-segment requests from  processors, up to m  
requests from  memory output queues for reply packets, and up to m  outstanding 
memory requests chosen from  memory arbiters can participate. However, the actual 
number of requests that can participate in  bus assignment o f the second phase is 
at most (m  +  n ) because up to n  inter-segment requests from  processors and up to 
m outstanding memory requests chosen by memory arbiters must originate at the 
same set of n processors. Hence, the value of is bounded above by (m + n  — 1 ).
Let 7/s,p denote the number of occupied processor output queue heads observed 
by a request th at has just arrived at the head o f a certain processor output queue 
in  segment s (not in  the set of r/, p queues). S im ilarly, we define r/im  as the number 
of occupied memory output queue heads observed by a request th at has just arrived 
at the head o f a processor output queue in  segment s.
The random variable fC,,PoM can be expressed as the sum of non-negative inte­
gers 11,#  and ifom, and is bounded above by (m  +  n — 1). The set of a ll feasible 
combinations o f ij,tP and is
=  { ( w  "h 1 *.» ) I 0  — — n  1  and 0  <  if»,« — m }-
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
71
Then, has a m ultinom ial distribution given as follows
|  _  ( m  +  n  1)*Pproc\bu»tf fP m e m H u s H  ~  P p r o e ~  Pmem|4««)^m + n  1 V* 'p
*’P~ ‘ ~  (m  +  n -  1  -  l/.j, -
(3.40)
The delay encountered by the header flit o f a class s request at the head o f a processor 
output queue un til a segment bus is allocated to it  is given by
m + n—l  ___
=  r[AC.,Po., =  t] P[fC.,Pout =  *]. (3*41)
»=&/2  •̂•Poat
Now, we return to the calculation of unknown u ,iPomt( l) .  u,,p<mt( l)  is sim ply the 
ratio of the residence tim e of a class s header flit at the head of a processor output 
queue to the round trip  tim e of a class s request. Hence,
<3 «>
The residence tim e of the header flit of class j  at the head of a memory output 
queue in segment s w ill be denoted by ra|y>moiit( l )  and w ill be discussed in the next 
section. Then, the mean utilization o f a memory output queue in segment s by a 
header flit is
j - l  nPj. _ ,
(I )= £ ^ r r i g r -  (3-43>
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
72
Using the pipelining property of wormhole routing, the residence tim e of the k th
Thus, r ,lpaM(& ), it >  1 , is the same as the residence tim e of the (it — 1 )** flit o f class 
i  at the flit buffer o f the first segment switch in  the forward path. r,-,m<wt(jb) is the 
same as the mean residence tim e o f the (it — l ) 4* flit at the flit buffer o f the first 
segment switch in  the return path. r ttPomt(k ) and r,-,m<mt(fc), for it >  1 , are given as 
follows:
flit (k > 1 ) at the head of each output queue is the same as the residence tim e of 
the (k  — l ) 4* flit at the flit buffer of the first segment switch in the supposed path.
PiiypZ C  Pi&T\,i&{k 1 ) +  Pidri+l ,id(k — 1 )
r * .P o«(k ) ~ t—1 J - l
Pa +  H  Pid +  £  Pid
d=0 d=i+l
=  Pa-fiT  +  53  Pidr i,id(k -  1) +  J 3  Pidri+i,id(k -  1) (3.44)
” 8  .B d=Q d s i+1
and
The utilizations u,-,p<mt(it) and u,-,m<m,(ib) can be calculated easily as
(3.46)
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3 .6  R e s id e n c e  T im e  in  a  M e m o r y  M o d u l e
The residence tim e in a memory module is the sum o f mean memory input queue 
delay and mean memory output queue delay and is given by
a- 1
ftnem M  =  ^  '  P ri(^|«,mem,n "1" Rrf|«,memo*t) S  = 0 , 1.
d=0
3 .6 .1  M e m o r y  I n p u t  Q u e u e  D e l a y s
Mean memory input queue delay, Rd\a,memin, includes the mean tim e a class s request 
spends in  the input queue of the addressed target memory module in  segment d 
^<1 the memory service tim e (rm). We assume that a request is queued 
for memory access only when its ta il flit arrives at a memory input queue.
-ftrf|«,memin =  wd\m,min +  Tm
=  "h ®d|»,m7m "h (3.48)
wd\s,min consists of the queueing delay in  the memory input queue (g(*|a,m,nTfn) 
and the mean residual lifetim e of a memory request in service The mean
memory input queue length, observed by a class s request packet when it
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
74
arrives at the memory input queue in segment d, is given by
( s - 4 9 )
In  calculating the mean queue length, one should not count the contribution by 
the class s request th at has just arrived. However, this fact can be ignored because 
the contribution is expected to be very sm all considering typical values for n and 
g. Notice that there exist g  classes in the model and for each class n requests are 
generated on average. The probability th a t a memory module is busy serving a 
request when a class s packet arrives at the memory module in  segment d is given 
by
* *■ “ = £  ’■ .+ « [” ']■ (3-50) 
We now turn our attention to the derivation of the mean residual lifetim e of a 
memory request at a memory module in  service. Recall that a memory request is 
queued up only when its ta il flit arrives a t the memory input queue. Hence, the  
residual lifetim e of a memory request should be calculated as seen by the ta il flit 
rather than as seen by the header flit. As before, we assume that the header flit is 
equally likely to arrive at a memory input queue at any point during the memory 
service tim e interval. Define te to be the amount of tim e which has already elapsed 
in  serving a request when the header flit observed a memory request in  service. 
For sim plicity, each component in  the calculation of the residual lifetim e w ill be 
expressed in terms o f the flit tim e ( j ^ ) .
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
75
Suppose that te is less than one flit tim e. In  this case, the residual lifetim e seen 
by the header flit would be (rm — 5 ^ )  on average. I f  a constant T  is set to 
the residual lifetim e can be expressed as (T  — \)w ^-  However, the actual residual 
lifetim e should be calculated as seen by the ta il flit. The probability that the request 
in  service observed by the header flit w ill s till be in  service when the ta il flit arrives 
at the input queue is given by
T  — -  1  2
where t is the number of flits  per packet. Therefore, when 0  <  te < the mean 
residual lifetim e seen by the ta il flit is
r
Sim ilarly, for ^  <  te <  the mean residual lifetim e is given by 
K  ^  <  <e <  ^  vvir^ ’ mean residual lifetim e is expressed as
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
This procedure continues while the ta il flit can observe the request in service. In  
other words, the ta il flit can observe a request that is s till in  service only while
Conditioned on this, k  is given by T  — t  and is assumed to be a positive integer. 
Since the header flit is equally likely to arrive at a memory input queue at any 
point during the memory service tim e interval (rm), the mean residual lifetim e of a 
memory request can be determined as
-  i £ 0 - 4 - ‘> © '
(r  _\Tm Wo )
2  r.
WB,
L  (3 .5 2 )
Thus, the mean residual lifetim e of a memory request in  service, as seen by the ta il 
flit, is given by
(r  — L~Li I 2
7m =  S S - j J t l .  (3.53)
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
77
3 .6 .2  M e m o r y  O u t p u t  Q u e u e  D ela y s
Mean memory output queue delay, >s the sum of the mean queueing
delay, and the mean residence tim e of the header flit at the head of a
memory output queue, r^r Tltiut( l) .  Thus,
=  wd\M,Tn<r,tt d* r<f|»,»»o««(l)* (3.54)
wd\t,m0̂ t is the tim e un til the header flit o f a reply packet of class s arrives at the 
head of a memory output queue in  segment d.
As we discussed earlier in the case o f r ,tPo-t( l) , r^|,tmotrt( l )  includes the mean 
w aiting tim e for the header flit of a class s packet at the head of the memory output 
queue to get access to a segment bus in segment d (s^|(,m<mt), the mean residual 
residence tim e for the ta il flit in  service (7 d|,,m<mt) (as the header flit arrives at the 
first segment switch in the return path ), and a flit tim e for transferring the header 
flit to its first segment switch for an inter-segment access or to its processor for an 
intra-segment access.
The mean queueing delay is given by
t
t e l
+  £  “« « » » .(* ) (  +  £  r w O ) )  . (3.55)
te l V J=fc+1 /
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
78
The mean memory output queue length observed by the header flit of a class s reply 
packet when it arrives a t a memory output queue in  segment d  is given by
( 3 m )
As before, since the amount contributed to  the queue length by a class s  request 
th at has just arrived is insignificant, it  w ill be ignored. The mean residence tim e of 
the kth flit of a reply packet at a memory output queue in  segment d, r j<m<tat(k) for 
k  >  1 , was discussed in  Section 3.5. Here, we rew rite it  for convenience as
L  r <* ~ 1 g~l
r d ,m „,(k) = PddJTT- +  $ 3  P*drd,<u(k — 1) +  ^  Ptdfd+l,da(k — 1). (3.57)
The mean residence tim e experienced by the header flit of a reply packet (of any 
class) at a memory output queue in segment d  is expressed as
(■> \   nP id , nP ,i_  L f  . .
isO  ,= 0  B
td
In  Equation (3.55), the second term  shows the mean residual service tim e of a 
reply packet in service. The arriving packet at the memory output queue observes 
the kth flit utilizing the memory output queue w ith probability
« * ...* . ,( * )  =  £  l(t> - (3-59)
b'-=Q t P T  “ Is  J
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
t
79
In  this case, the mean residual residence tim e of that packet is
rd,m0 . V ' _ /
A » ^  ^1«O St U j*
*  y=ft+i
The residence tim e of the header flit o f a class s reply packet a t the memory 
output queue in segment d, r*|,,motlt( l ) ,  is given by
ril*»m «n(l) sd|»,mo«« ^  fV g" (3.60)
Calculation of r<*|,imoilt( l )  could be done in  a very sim ilar way as in  Section 3.5, 
where the residence tim e of a header flit at a processor output queue was estimated. 
Sim ilarly, we define 0^p as the number of occupied processor output queue heads 
observed by a reply packet when it  arrives at the head of a memory output queue 
in  segment d. We also define as the number of occupied memory output queue 
heads observed by a reply packet when it  arrives at the head o f a memory output 
queue in  segment d.
The random variable /Cd,moat can be w ritten as the sum of non-negative integers 
and Ojm, and is bounded above by (n  +  m  — 1). The set o f a ll feasible values 
for can be expressed as
=  {(#<*,p +  0d,m) | o <  Qd'P < n  and 0  <  0d,m < m  -  1 }.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
f
80
Then, £rf,moM follows a m ultinom ial distribution given by
p fr  (TO +  » -  1 ) ^ ^ -  JVoe|6weif ~  P«em|6u,)(m+n- l - ^ - tf«f "*)
(3.61)
The delay, is given as follows
m+n—l
5 rf|s,mo«s =  53  =  *1 S  =  *] (3.62)
»=i/ 2
where
(£.*,»»„.» — 7})tj,bu,upacctMM +  td̂ M-£ e?x" - at the Upper bus group,
(fCd,mout — ^)td,bustoaccet, +  t^ ^ ê ea‘ at the lower bus group.
The mean residual residence tim e, 7 j|,im<m„  is calculated in the same way as in 
the previous section and can be expressed as
intone
^  if  d — s >  0 ,
(3.63)
u,<+i (* )—" " — if  d — s <  0 .
Here, “(d\,,m<mt becomes zero for s =  d as we discussed earlier.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C h a p t e r  4  
Q u e u e in g  A na ly sis  f o r  S y st e m s  w it h  
In f in it e  F l it  B u f f e r s
This chapter develops a performance analysis model for studying SMBS’s w ith in­
fin ite  flit buffers. We develop equations for mean residence times at three distinct 
in fin ite  queueing centers: the segment switch, processor and memory module.
4 .1  O v e r v ie w  o f  t h e  M o d e l
M odel assumptions for this case are very sim ilar to those stated in  Chapter 3 for the 
single flit buffer model, w ith the exception o f buffer size, o f course. The assumptions 
are as follows:
1. Each processor generates request messages w ith exponential distribution of 
the inter-request interval w ith a mean of rp.
2. Read and w rite accesses in  a memory module require the same service interval 
and take a constant tim e, rm.
3. A  transient packet has prio rity over other packets for bus assignment. This 
assumption is based on the expectation that quickly delivered packets w ill 
reduce contention for network resources.
4. A rbitration tim e is included in the bus cycle tim e and a fa ir selection policy 
is utilized in each arbitration phase.
9
81
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
82
5. Bus service tim e is determ inistic and is equal to a bus cycle tim e which is also 
equal to the flit tim e.
6 . The queueing discipline at each memory queue is first-com e first-served 
(FC FS ). In  the case when requests arrive simultaneously, these requests are 
served in  random order.
In  addition to the above mentioned assumptions, we assume that the capacity of a 
flit buffer is unlim ited. The flit buffer’s queueing discipline is FCFS as w ell. There­
fore, the whole packet can be buffered at one flit buffer when a header flit is blocked; 
unlike the case of a fin ite  flit buffer when the trailing  flits are blocked in  place and 
occupy simultaneously several segment switches. Under the condition of u n lim ited  
flit buffer capacity, a header flit can go forward without the entire packet being 
buffered as soon as the next segment bus and segment switch are available. This 
guarantees that a network w ith in fin ite flit buffers w ill, in general, offer less network 
latency than a network w ith fin ite  flit buffers. We assume wormhole routing w ith  
in fin ite flit buffers throughout this chapter.
We m aintain the assumption that memory requests w ithin a segment are uni­
form ly distributed and segment referencing is distributed arb itrarily . Each memory 
module is assumed to have an input buffer and an output buffer so th at the memory 
module may serve requests continuously w ithout having to w ait to transm it reply 
packets back to their requesting processors. The sizes of memory input and output 
buffers are unlim ited. A reply packet is assumed to be im m ediately consumed as 
soon as it  arrives at its requesting processor.




Unlike the single flit buffer model, the in fin ite buffer model does not consider the 
mean residual residence tim e o f a ta il flit and the blocking delay because a packet 
arriving a t a segment switch w ill not experience shortage o f buffer space.
Consider a class a request for a memory module in  segment d. A  processor w ill 
generate a class a request which w ill travel (in  a forward path) from  a to d  and 
w ill receive a reply packet, o f the same class, from  the targeted memory module in 
segment d  (in  a return path). The residence tim e experienced by the request w ill 
be the sum of mean residence tim e (queueing and service) in  the processor, in  the 
network and in  the memory module. Hence, the average response tim e of a class s 
request becomes
f2[s] =  Rproc[s] +  Rnetwork[®] =  0,. .. ,y 1. (4*1)
f 2proc[®] denotes the mean tim e that the header flit of a class a packet spends at the 
head of a processor output queue. This is the sum of the mean w aiting tim e for a 
segment bus and the tim e for transferring the header flit to the first segment switch 
(in  the case of an inter-segment access) or the tim e for transferring the header flit 
on a segment bus to a local target memory module (in  the case of an intra-segment 
access).
Ane««vor/t[®] is the weighted sum of the mean residence times in  the forward path 
(for a request packet) and in  the return path (fo r a reply). Rnetworkis] is given by
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
84
9-1
Pnettoork [■*] =  2 J  P *d (.P t4 ,n e tw o rk  4  Pds,network) ® =  1 , (4-2)
4—0
where =  1 . Ra,network denotes the average residence tim e in  the network
of a class s  request packet from  the first segment switch in the forward path to the 
target memory in segment d. Rtu^network denotes the average residence tim e in the 
network o f a class s  reply packet from  the first segment switch in  the return path 
to the requesting processor.
The residence tim e at each segment switch consists of the mean queueing delay 
at a flit buffer queue and the tim e for transferring the header flit to  the next switch 
or to the target memory module (fo r the final segment switch in the forward path) or 
to the original requesting processor (for the final segment switch in  the return path). 
Once the header flit arrives at a target memory module in  the forward path, the 
rem aining flits w ill follow (catch-up) and reach the target memory. Sim ilarly, once 
the header flit arrives at the requesting processor in  the return path, the remaining 
flits w ill follow (catch-up) and reach the processor. This is known as catch-up tim e.
Amem[<>] is the residence tim e in  a memory module and is given by
0 - 1
■ftnemfs] =  ^ ' P rf(fl^|i,memif, 4* =  0, . . . ,<7 1. (4-3)
4=0
iZmemfs] is exactly the same as in  the single flit buffer model case. R*|,imem,» is the 
sum of the mean delay for a class s request packet experienced in  the memory input 
queue in segment d and the memory service tim e, rm. R^\t<mem<mt is the average
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
85
delay th at the reply packet of class s spends in  the memory output queue. In  the 
case of an inter-segment access, this consists of the mean queueing delay at the 
memory output queue, the mean waiting tim e for a segment bus at the head of 
the memory output queue and the tim e for transferring the header flit to  the first
segment switch. In  the case of an intra-segment access, Rj\.  „aat consists of the
mean queueing delay, the segment bus w aiting tim e and the tim e for transferring 
the reply header flit back to its local processor.
We w ill also u tilize the same performance m etrics in  this case, as those utilized
in the single flit buffer model: the overall mean response tim e (R) and processing
efficiency (pproc)- These are given by
*  =  ! £ * [ » ]  (4.4)
97£t
and
I  3 -1
P proc =  “  X^/*proc[-s], (4-5)
9 m=0
where
* ~ w =
4 .2  R e s id e n c e  T im e  in  a  P r o c e sso r
Residence tim e of a header flit in a processor output queue, basically, consists of the 
following two elements:
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
86
1. The mean w aiting tim e for a header flit o f a class s request directed to d  in  
order to compete for a segment bus at a processor output queue in segment s
2. The flit tim e for transferring the header flit to the first segment switch in the 
forward path to rf, for an inter-segment access, or the tim e for transferring the 
header flit to its addressed local memory module, for an intra-segment access
Unlike the case of the single flit buffer model, a header flit does not observe any 
residual residence tim e of a ta il flit in  service at the next segment switch because 
the size of a flit buffer is unbounded. Hence, the mean residence tim e of the header 
flit of class s at the processor output queue is given by
9-1 L f
Rproc[$] =  ^  P idsd\t,po»t "I" r r r  S =  0, — 1. ( ^ * 6 )
d=0 &
Consider a class a request. As we discussed in the single flit buffer model in 
Chapter 3, we assume that a request packet observes AC*lP<mt other requests occupying 
the heads of th e ir respective queues when the request packet arrives at the head of 
its processor output queue, where is an integer random variable. W e follow
the same analysis as in  the single flit buffer model of Chapter 3. If  C ,lPoat is sm aller 
than the number of available free segment buses in  a bus group, the request doesn't 
need to wait for a segment bus assignment. However, if  £«,Poat is equal to or greater 
than the number of free segment buses, the request must w ait until a free segment
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
87
bus is available. Suppose th at i  is the number of free buses observed by the header 
flit when it  arrives at the head o f a  processor output queue. If  AC,iPotil is greater than 
i, the request w ill w ait ( £ a,p<Mlt — >) service completions before it  can get access to 
a segment bus. This is because when a bus becomes free, the next request to be 
granted the bus is chosen according to  a FCFS policy.
The mean segment bus access tim e is the tim e for transferring a request or a 
reply packet on a segment bus. The segment bus access tim e w ill be simply L[W b 
because here we do not have to consider the blocking effect caused by a fin ite  flit 
buffer capacity.
In  the first phase o f arb itration , the segment bus is assigned the header flit of 
a request in  the corresponding segment switch if  such request exists. Therefore, at 
the beginning of the second phase, the probability that a segment bus (supposedly, 
of the upper bus group in  segment t) is busy serving a transient packet, denoted by 
is given by
-  -  S j  (4-7>i,ad ^ i,td
where
D ftd =  (d  | packets traveling from  s to d w ill visit upstream flit buffer t}
D~td =  {d | packets traveling from  s to d w ill visit downstream flit buffer t} .
This probability reflects a larger bus access tim e than the nom inal value of
Hence, the expected tim e interval between consecutive bus service completions in
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
88
the upper bus group o f segment s  is given by
L
t , , tu s mpaccess =  J S  • (4 .8)
2 ^  ~  u»m )
Sim ilarly, the expected tim e interval between consecutive bus service completions 
in  the lower bus group of segment s  is expressed as
L
ta,&i»(0 acces« =  *g ^ ---------   (4 .9)
2 ( I  u»+l,6w«)
Therefore, the mean tim e u n til the class s request can get access to a segment 
bus while lC,,Pomt other queue heads are occupied, is given by
r[AC,,p_ ]  =
(£».po. .  — +  **’***^ *^ “ * at the upper bus group,
(4.10)
The equation shows that the bus server is load-dependent; that is, the mean tim e 
depends on £ i|Potl, .
Now consider the integer random variable /C,ljw . We use the same variable 
definitions and notation (w ithout any modifications) as those defined in the single 
flit buffer model. Thus, AC,tPo>, is given as
. _ (m +  n ~  — Pproc\bv*ef f  ~  Pmem|6itf )̂ m+" 1 V‘ 'P
(4.11)
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
89
The delay experienced by the header flit of a class s packet a t the head of the 
processor output queue, u n til a segment bus is allocated to it, is given by
m +n—1 ___
3<i\*,po»t — 5 3  T[fCt,Pomt =  *] J 3  =  *]- (4-12)
»'=4/2
4 .3  R e s id e n c e  T im e  in  t h e  N e t w o r k
The residence tim e in the network is the weighted sum of the mean residence times 
in  the forward path for a request packet and in the return path for a reply packet, 
as given below:
j - i
Rnetwork[■*] =  ^ ' Pid(Rtd,network Rds,network) 3 =  Oj • • • j 9  (4.13)
feO
where P*i =  1 .
R»d,network corresponds to the average residence tim e in the network o f a class s 
request packet from  the first segment switch in the forward path to an addressed 
target memory module in  segment d. Sim ilarly, Rd»,network corresponds to  the aver­
age residence tim e in  the network of a class s reply packet from the first segment 
switch in the return path to the original requesting processor.
The mean residence tim e for a class s packet in  the forward path (or in  the return 
path) consists o f the following two elements:
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
90
I .  The sum o f the mean residence tim es of the header flit of a class s packet at a ll
segment switches in the forward path from  s  to d, E r ,> * ( l)  (or in  the return
»
path from  d  to s, • * ( ! ) ) .
2. The catch-up tim e, at the addressed target memory module in  the case 
o f the forward path or at the requesting processor in  the case o f the return  
path.
Therefore, the mean residence tim e in the forward path can be w ritten as
R*d,netw ork — '
E U i )  +  L J ±  d d _ s > Q  ̂
»=*+! WB
(4.14)
£  r t>* ( l)  +  L ^ -~  if d -  s <  0 . 
.=4+1 YVB
In  the return path, the mean residence tim e is given by
R<u,







5 3  r*V*»(l) +  ** * J ~Sm i l  d -  s < 0.
In  the case of an intra-segment access (d =  s), r ttSj ( l )  and r,-,4 , ( l )  are equal to zero.
Suppose that a class s packet waits for a segment bus at segment switch i in  the 
upstream forward path from s to d. We call such packet the tagged packet The 
mean residence tim e of the header flit o f the tagged packet at segment switch i in  
the forward path, r t-,Alj( l) , consists of the following three elements:
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
91
1. The mean w aiting tim e for packets which were queued earlier at the flit buffer
queue.
2. The mean residual service tim e for a packet if  the tagged header flit finds the
segment bus busy.
3. The tim e for transferring the tagged header flit to  the next segment switch: a 
flit tim e
Again, please observe that in  this case we do not include the mean residual residence 
tim e of the ta il flit of a previous packet because the flit buffer is assumed unbounded. 
Thus, the residence tim e, r t-,,d (l), in  the upstream forward path is given by
ri,<fj(l) in  the upstream return path has the same expression as that o f r,-tJ(f ( l)  in  
the upstream forward path.
4"  PbuMi ,busy
2 W B W b '
(4.16)
The mean queue length seen by the arriving tagged header flit at the flit buffer
queue in segment i, denoted by is given by
(4.17)
where
D ftd = {d | packets from  s to d w ill visit upstream  flit buffer »} 
D~ti  = {d | packets from  s to d w ill visit downstream flit buffer *}.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
92
Once the tagged header flit arrives at the head o f the flit buffer queue, the header 
flit may w ait for a segment bus. When the segment bus is occupied by a local pro­
cessor or a memory module, the tagged header flit must w ait un til the segment bus 
is released, h i Equation (4 .16), ^  is the mean residual service tim e o f a packet. 
Pbu*i,buty denotes the probability that the tagged flit observes segment bus i  busy 
serving a request from  a local processor or a memory module.
Let the integer random variable $ s denote the number o f requests th at w ill par­
ticipate in  bus assignment o f the second arbitration phase a t segment s. Recall that 
in  each segment the actual number of requests that participate in  bus assignment 
is at most n +  m: up to n inter-segment requests from  processors and up to m 
outstanding memory requests chosen from memory arbiters. Hence, is bounded 
above by n +  m .
Let <f>,'P and 4>,,m. denote the number of processors and memory modules par­
ticipating in bus arbitration, respectively. can be expressed as the sum of the 
non-negative integers <j>,iP and whose sum is bounded above by n +  m . The set 
of a ll feasible combinations of <f>,tP and is given as $ , =  | 0  <
4>*,p 5: n and 0 <  4>M,m < m }. I f  we assume that the trials performed by processors 
and memory modules are independent Bernoulli tria ls , then has a m ultinom ial 
distribution given as follows
. _  (n m )~Pproc|hu€ĵ m*em|hn(l ~  Pprot3fnutg ~  Pmem|btn)*,H‘m
(fl +  m — <f>tiP —
(4.18)
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
(
93
where the probabilities Pproc|6u*t̂  and pmem|faM are the same as those discussed in  
Chapter 3.
If  i requests, > <  | ,  participate in  bus assignment of the second arbitration phase, 
the tagged header flit w ill observe the corresponding segment bus being occupied 
w ith probability i / | ,  on average. I f  the number of requests is greater than the 
number o f segment buses per bus group ( 6 / 2 ), the tagged header flit w ill always find 
the corresponding segment bus busy. Therefore, the probability that the header flit 
observes the corresponding segment bus busy is estim ated as
I ” 1 j  n+m
=  £  *>[*. =  *1 t +  £  P [*>  =  *']• (4.19)
i= 1 2 ,= i
Sim ilarly, the residence tim e, r,-,J(* ( l)  in the downstream forward path (or r,-,,k (l) 
in the downstream return path) is given by
rt,*<i(l) 2Wb Wb ' (4.20)
As before, we state r ,-,,j(l) w ithout using an explicit indication as to whether the 
path taken is an upstream or a downstream path.
4 .4  R e s id e n c e  T im e  in  a  M e m o r y  M o d u l e
The residence tim e in  a memory module is given by
9-1
•RnemM P»d{.Rd\t,memin d" Rd\i,memoat') S — 0, ...,</ 1. (4.21)
d=Q
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
!
94
The residence tim e in  a memory module is the same as that derived in the single 
flit buffer case, except that the residence tim e for the header flit o f a reply packet at 
a memory output queue does not include the ta il flit residual residence tim e. Mean 
memory input queue delay, includes the mean tim e that a class s  request
spends in  the input queue o f the addressed target memory module in  segment d 
and the memory service tim e (rm).
As we did in the single flit buffer case, we assume here that a request is queued 
for memory access only when its ta il flit arrives at the memory input queue. Mean 
memory input queue delay, Arf|«,mem,n>18 given by
Rd\M,memi„ — wd\»,min 4* Tm
=  4* Ud\a,mfm 4“ (4*22)
wd\s,min. consists of the queueing delay in the memory input queue and
the mean residual lifetim e of a memory request in  service The mean
memory input queue length, q(f|,tml„> observed by a class s request packet when it 
arrives at a memory input queue in segment d  is given by
( 4 M )
As in Chapter 3, the mean queue length does not include the amount contributed by 
the class s request that has just arrived. The probability that the memory is busy 
serving a request when a class s  packet arrives at a memory module in  segment d,
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
The mean residual lifetim e of a memory request in  service, ym (derived in  Chapter 
3), is given by
(T _
n .  =  -  ■ (4 25)
Mean memory output queue delay, is the sum of the mean queueing
delay (ie j|.» oat) and the mean residence tim e o f the header flit a t the head of the 
memory output queue (r*|,,momt( l) ) .  Thus,
(1 ). (4.26)
The mean queueing delay is given by
W rfla .m o a t —
t e l
+  E “ i k — £  r* — C »)V  (4-27)
t e l  \  £  j = k + l  )
The mean memory output queue length which is observed by the header flit of a 
class s reply packet when it  arrives at a memory output queue in segment d  is given
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
The mean residence tim e o f the ktK f lit  o f a reply packet at a memory output in 
segment d, rd<m<mt(k) for k > 1  is given by
r <i,mo..(fc) =  +  53  ̂ *dr d,<u(k — 1) +  53 PtdTd+\,<u{k — 1). (4.29)
W b  j= o  M =d+ l
The mean residence tim e of the header flit o f a reply packet at a memory output 
queue in  segment d  is expressed as
r *m o.«(l) =  53  ~ ~ Sd\M,m(mt +  77T- (4.30)
s=0 m  W B
In  Equation (4.27), the second term  shows the mean residual service tim e of a
reply packet in  service. The arriving packet at the memory output queue in  segment
d observes the k tK flit utilizing the memory output queue w ith probability
u<f|«,mmit ( & ) =  5 3  /  'R rZ } • (4-31)
*'=0 rp +  " Is J
In  this case, the mean residual residence tim e of the packet in service at a segment 
bus in segment d  is
(^ ) , V ' f
n ’ ^  d,
£ / = * + 1
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
97
The residence tim e of the header flit of a class a reply packet at a memory output 
queue in  segment d, r rf|̂ mo-e( l) ,  is given by
rrf|«,mo«*(l) =  3<<|*,ma«t +  • (4.32)
r<f|a,moat ( l)  includes the mean waiting tim e for the header flit o f a class a packet to 
get access to  a segment bus in  segment d (s,flrWaM,)  and the flit tim e, for transferring 
the header flit to its first segment switch for an inter-segment access, or to the 
requesting processor for an intra-segment access.
Let Oj'p denote the number of occupied processor output queue heads observed 
by a reply packet when it  arrives at the head of a memory output queue in segment 
d. Let 0d,m denote the number of occupied memory output queue heads observed 
by a reply packet when it arrives at the head of a memory output queue in segment 
d.
The number can be w ritten as the sum of non-negative integers 6d,p and
9d,m whose sum is bounded above by (n +  m — 1 ). The set of a ll feasible values for 
£d,mM  is given by
fcd,m oat =  {(0<*.p +  | 0  <  9d,p < n  and 0  <  9d,m < m  -  1}.
i
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
98
Then, follows a m ultinom ial distribution given by
p ry . (TO n  ~~ )̂*Pproc|6««,Jyi,inem|6«#(̂  ~~ Pvroc\b**tff ~  Pmem|6n»)̂ m+n 1 *** 9d'm^
1 rf,mo" 'J (rn + n - I - 61, - 0^ ) 10̂ 10^1
(4.33)
The delay, s j|,,mo. t , is given as follows
m+n—1
5 rf|«,mo«t =  5 3  =  *1 53  =  *] (4.34)
*=6 / 2
where
(ACrf,m<mt — —)td,bua%,aeeeaa +  JM»n^cceM» ^  ^  upper bus gTOUp,
{ K d ,m — 2 )^ ,6*M|0 acee«« +  ^  ^  JQWer bUS gTOUp.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C h a p t e r  5 
N u m e r ic a l  R e s u lts  a n d  D is c u s s io n
In  this chapter, we verify the correctness of our analytical models and obtain numer­
ical data to  study SMBS performance. Model input param eter values used in our 
experiments are presented. We use event-driven simulations to  verify the correctness 
of our analytical models. We w ill show that the results obtained from  analytical 
models are very close to those obtained from sim ulations.
We evaluate the performance of the SMBS under various configurations (various 
system sizes, segment configurations, flit buffer sizes and connection topologies) and 
under different workloads (varying memory request rates, number of flits  and ref­
erence patterns). We study the performance under uniform  memory reference and 
then exam ine the im pact of memory reference locality on scalability for both the 
single flit buffer and in fin ite  flit buffer models.
We address how the performance changes w ith different flit buffer sizes: namely 
single flit buffers, packet-sized flit buffers and in fin ite  flit buffers, and also study 
the performance of two SMBS configurations: a ring connection of segments w ith 
in fin ite flit buffers and a linear connection of segments w ith single flit buffers and 
in fin ite flit buffers. Performance analysis w ill show th at network latency is more 
influenced by bus contention than network distance. Sets o f data obtained from  
both the packet-sized flit buffer model and the ring connection model are based on 
event-driven simulations.
99
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
I
100
Finally, we study the performance im pact of the number of flits per packet and 
the number o f segment buses per segment. O ur results w ill show th at the SMBS 
favors sm all number o f flits  per packet and wide bus w idth.
5 .1  M o d e l  In p u t  Pa r a m e t e r s
Model parameters which have to be specified num erically can be classified into 
two basic categories: those that describe a particular system configuration and 
those that describe expected workloads. W hen we model a design that has not 
been b u ilt, param eter values w ill have to be estim ated. In  particular, determining 
workload param eter values properly is a d ifficult part o f the analysis process. System 
configuration and workload parameters used in  our model are listed in  Table 3.1 in 
Chapter 3.
For our purposes here, a baseline system is configured as an SMBS w ith a linear 
connection of segments. Each segment is composed of 1 0  processors (n  =  1 0 ), 1 0  
memory modules (m  =  10) and 10 segment buses (b =  1 0 ). For both the single 
flit buffer and in fin ite  flit buffer cases, the baseline system is studied. To examine 
the im pact of increasing system size on system scalability, we increase the number 
of segments in the SMBS for two different segment configurations: one for n =  10, 
m  =  10 and b =  10, and the other for n =  20, m  =  20 and b =  10. For studying 
the effect of the number o f segment buses per segment on performance, we vary the 
number of segment buses in  a segment from  6  to  1 2 .
The flit tim e of w ill be used as our basic tim ing un it. Thus, we normalize 
^  to one unit o f tim e, where Lj  is the length o f a flit and Wb is the segment bus
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
101
bandwidth. The mean memory request interval (rp) and memory service tim e (rm) 
w ill be in  units o f flit tim e. Considering a 32-bit wide segment bus and the fact 
that message length (packet length) is expected to  be about 1 0 0  bits on average in  a 
shared memory multiprocessor [46], we assume basic packet length to be three flits 
(t =  3 ). However, to study the effect o f number of flits per packet, the number of 
flits  per packet w ill be varied from 3 to 5.
The mean tim e o f the request interval, rp, is varied from 30 to 2 0 0  tim e units. 
Values o f rp higher than 200 show very little  performance improvement. For values of 
rp <  30 tim e units, some segment buses, most likely  in  m iddle segments, are nearly 
saturated particularly in  experiments dealing w ith  a baseline system w ith finite flit 
buffers. For such low values of memory request intervals, our model does not show 
a reasonable evaluation functionality. However, through extensive simulations, the 
given range for rp is shown to be reasonable enough to study the performance of our 
model. The mean memory service tim e, rm, is set to 4 units of tim e.
5 .2  V a l id a t io n  o f  t h e  A n a l y t ic a l  R e s u lts
We used event-driven simulations to verify the correctness of our analytical mod­
els. The simulations modeled the actual behavior o f the SMBS for both the single 
flit buffer and in fin ite flit buffer cases, w ithout any sim plifying assumptions. The 
sim ulator implemented wormhole routing and bus assignment priority of a transient 
packet exactly. O ur sim ulation program (about 3,700 lines long) was w ritten in the 
C programming language. It  was driven by the m ultiplicative congruential random 
number generator [47, 48].
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
102
The chosen 95% confidence interval was observed to be w ithin 10% o f the mean; 
that is, if  the confidence interval half w idth was w ithin 1 0 % of the mean, the desired 
accuracy was considered satisfactory. Sim ulation estimates were derived by using 
the method of batch means. Simulation results were finally obtained by averag­
ing the results o f ten batch simulations. Each processor generated 1 , 0 0 0  memory 
requests in  each batch. Considering 1 0  indistinguishable processors in  the basic 
segment configuration, each batch is equivalent to the case obtained by generating 
1 0 , 0 0 0  memory requests of each class.
To deal w ith  the start-up transient of the sim ulation, we truncated the first 10% 
of memory requests. This was observed to be long enough to make any start-up 
transient negligible. In  terms of execution tim e, a sim ulation estim ate o f a baseline 
system requires about twenty hours of CPU tim e on a Sim SPARC-5 computer; 
however, num erical evaluations based on our analytical model took less than ten 
minutes each.
Simulation results are presented along w ith analytical results in Tables 5.1 -  5.4. 
Tables 5.1 and 5.2 present comparisons of analytical performance measures w ith  
simulation results for SMBS’s of two different sizes w ith single flit buffers. For the 
case of in fin ite flit buffers, the results are presented in  Tables 5.3 and 5.4 for two 
different sizes.
As can be seen, the differences between sim ulation results and analytical results are 
insignificant for both the single flit buffer and in fin ite flit buffer cases. Both are 
shown for rm =  4 and t =  3, and two different system sizes: g =  5 and g =  7. In
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
103
Table 5.1: Comparison o f performance w ith sim ulation results: SMBS w ith single 
flit buffers, n =  10, m  =  10, 6  =  10, g  =  5, rm =  4 and t  =  3.
Request Processing efficiency Response tim e
interval, rp Analysis Sim ulation Analysis Sim ulation
40 72.226 72.472 ±  .030 15.381 15.193 ±  .023
50 76.981 77.147 ±  .021 14.950 14.812 ±  .018
70 82.885 82.923 ±  .017 14.453 14.416 ±  .017
1 0 0 87.658 87.604 ±  .011 14.078 14.150 ±  .015
150 91.582 91.498 ±  .009 13.786 13.937 ±  .017
2 0 0 93.615 93.524 ±  .006 13.640 13.849 ±  .014
Table 5.2: Comparison of performance w ith sim ulation results: SMBS w ith single 
flit buffers, n =  10, m  =  10, b =  10, g =  7, rm =  4 and t  =  3.
Request Processing efficiency Response tim e
interval, rp Analysis Sim ulation Analysis Sim ulation
40 67.668 67.612 ±  .051 19.111 19.161 ±  .045
50 73.466 73.663 ±  .035 18.058 17.876 ±  .032
70 80.511 80.587 ±  .031 16.944 16.862 ±  .033
1 0 0 86.079 86.027 ±  .011 16.172 16.242 ±  .014
150 90.575 90.450 ±  .008 15.608 15.837 ±  .015
2 0 0 92.877 92.744 ±  .003 15.338 15.647 ±  .008
Table 5.3: Comparison of performance w ith sim ulation results: SMBS w ith in finite  
flit buffers, n  =  10, m  =  10, b =  10, g =  5, rm =  4 and t  — 3.
Request 
in terval, rp
Processing efficiency Response tim e
Analysis Sim ulation Analysis Sim ulation
30 66.877 67.011 ±  .016 14.857 14.768 ±  . 0 1 1
40 73.359 73.505 ±  . 0 1 2 14.525 14.417 ±  .009
50 77.754 77.878 ±  . 0 1 1 14.304 14.202 ±  .009
70 83.304 83.363 ±  .008 14.028 13.969 ±  .008
1 0 0 87.870 87.884 ±  .009 13.803 13.785 ±  . 0 1 2
150 91.678 91.655 ±  .008 13.616 13.655 ±  .014
2 0 0 93.668 93.648 ±  .003 13.518 13.563 ±  .007
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
104
Table 5.4: Comparison o f performance w ith  sim ulation results: SMBS w ith  in fin ite  
flit buffers, n =  10, m  — 10, 6  =  10, g  =  7, rm =  4 and t  =  3.
Request 
in terval, rp
Processing efficiency Response tim e
Analysis Sim ulation Analysis Sim ulation
30 63.454 63.634 ±  .023 17.277 17.144 ±  .017
40 70.472 70.813 ± . 0 2 0 16.759 16.486 ± .0 1 6
50 75.295 75.621 ± .0 1 8 16.404 16.118 ± .0 1 6
70 81.440 81.682 ± . 0 1 1 15.952 15.697 ±  . 0 1 1
1 0 0 86.521 86.649 ± .0 0 7 15.577 15.407 ±  . 0 1 0
150 90.763 90.805 ±  .004 15.264 15.188 ±  .007
2 0 0 92.980 92.989 ± .0 0 4 15.099 15.078 ±  . 0 1 1
a ll these cases, the analytical and sim ulation results match very well and are w ithin  
5% o f one another.
5 .3  N u m e r ic a l  R esults
To examine how SMBS’s scale w ith increasing system size, we evaluated processing 
efficiency of the systems w ith  single flit buffers under uniform memory reference as 
a function of the number o f segments (<7 ) for different request rates (1 / r p). The 
results are shown in  Figures 5.1 and 5.2 for two different segment configurations: 
one for n =  1 0 , m  =  1 0  and b =  1 0 , and the other for n =  2 0 , m =  2 0  and 
b =  1 0 , respectively. Both figures show th a t the performance in both cases is poor as 
processing efficiency decreases rapidly w ith  increasing number of segments (system  
size) a t moderate or high request rates ( l / r p >  0.02 in  Figure 5.1 and l / r p >  0.01 
in Figure 5.2). We observe that the low processing efficiency is prim arily due to 
the increase in bus contention as request rate increases, as we can easily im agin e  
particularly since buffer size is fin ite.






Figure 5.1: Processing efficiency o f single flit buffer model under uniform  memory 
reference: n — 10, m  =  10, b =  10, rm = 4 and t  =  3.
oo
70
s 7 10 11 12 13 IS0 14
Ho. of —flwrti
Figure 5.2: Processing efficiency of single flit buffer model under uniform  memory 
reference: n =  20, m  =  20, 6  =  10, rm =  4 and t =  3.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
!
106
As we mentioned earlier, the pipelining property of wormhole routing could make 
the network latency largely insensitive to  path length if  there is no contention. Adve 
and Vernon [41] showed that poor performance in the torus network w ith  wormhole 
routing is m ainly caused by the inherent latency of communication rather than by 
contention in  the network. Thus, performance in  such low dimensional direct net­
works is like ly  to be latency-lim ited rather than bandwidth-lim ited when processors 
block after each request.
In  an SMBS w ith  fin ite flit buffers, however, bus contention tends to contribute 
to blocking to  a greater extent. W hen contention is an issue, the pipe lin in g  prop­
erty contributes greatly to low system performance, along w ith the fin ite size of flit 
buffers. Hence, the effect of bus contention in  the SMBS is expected to be more 
profound because processors must share segment buses w ith memory modules as 
well as w ith segment switches. The effect o f bus contention on performance in bus- 
based systems could be as significant as distance related latency in direct networks. 
Hence, a plausible way for reducing the probability of contention is to increase flit 
buffer size.
To study how performance changes w ith  different flit buffer size, we examined the 
two extreme cases th at set bounds on the performance of systems: namely systems 
with single flit  buffers and systems w ith  in fin ite  flit buffers. We also studied the 
behavior of systems w ith packet-sized flit buffers based on event-driven sim ulations, 
along w ith the two extreme cases. Figures 5.3 and 5.4 show respectively, a compari­
son of the behavior of mean response tim e and processing efficiency, w ith different




0.006 0.01 OL01S 002 0.025 0.09 0.035
RaquMtiato(1/^}
Figure 5.3: Comparison of mean response tim e w ith different flit buffer sizes: n 





Figure 5.4: Comparison of processing efficiency w ith  different flit buffer sizes: 
10, m =  10, b =  10, g =  9, rm =  4 and t  =  3.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
108
flit buffer sizes. In  both figures, there are no significant performance changes among 
different flit buffer size cases when request rates are low ( l / r p <  0.015). As the 
request rate becomes higher, the performance gap between the infinite flit buffer 
and fin ite flit buffer cases becomes progressively larger. However, the performance 
gap between the single flit buffer and packet-sized flit buffer cases is relatively sm all 
for different request rates. The performance degradation o f the fin ite flit buffer case 
as request rate increases is largely attributed to increased blocking caused by the 
fin ite buffering spaces. Thus, we conclude that the performance of the SMBS w ith  
fin ite flit buffers deteriorates by contention rather than by network distance. In  an 
SMBS, increasing flit buffer size reduces the probability of contention and provides 
performance im provem ent. Furtherm ore, this property allows the cut-through ad­
vantage of wormhole routing to be fu lly  exploited.
We examine how w ell the SMBS w ith in fin ite flit buffers scales. Figure 5.5 shows 
that the scalability o f the in fin ite  flit buffer model under even uniform memory ref­
erence is good as processing efficiency decreases slowly w ith  increasing number of 
segments. For larger segment size in  Figure 5.6, we observe th at processing efficiency 
exhibits a slow linear down-slope, even though its overall absolute value is generally 
lower than that in  the case in Figure 5.5 (w ith  sm aller segment size). Although 
performance decrease becomes larger at high request rates ( l / r p >  0 .0 2 ), the lin ­
ear down-slope in  Figure 5.6 shows clearly that decrease in  processing efficiency is 
m ainly due to the increased network distance that a packet must encounter rather 
than bus contention.






Figure 5.5: Processing efficiency of in fin ite flit buffer model under uniform memory 




65 7 9 10 11 12 13 14 15
No.ofssgniinli
Figure 5.6: Processing efficiency of infinite flit buffer model under uniform memory 
reference: n =  20, m =  20, b =  10, rm =  4 and t =  3.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
110
N ext, we compare the performance o f the two SMBS configurations, a ring con­
nection o f segments and a linear connection o f segments under a uniform  memory 
reference m odel. The ring connection offers an average o f about 33% reduced net­
work delay over a linear connection. Figures 5.7 and 5.8 depict the change in mean 
response tim e and processing efficiency of a linear connection and a ring connection 
of 9 segments for different request rates.
In  Figure 5.7, the difference between the two lines in the in fin ite  flit buffer case 
indicates the amount of the reduced network delay contributed by the reduced net­
work distance due to the symm etry of the ring connection. O f course, the reduced 
network delay also includes the effect o f balanced utilization of segment buses con­
tributed by the symmetry of the ring connection. We can observe from  the figures 
that increase in  contention due to higher request rates is not sufficiently compen­
sated by some reduction in  network distance. Figure 5.8 depicts the processing 
efficiency difference between the two connection topologies under uniform  memory 
reference. Figures 5.7 and 5.8 assure us, once again, that (segment bus) contention 
in  the SMBS is as significant as the effect o f network distance latency in  direct net­
works.
The results presented thus far assume a uniform memory reference model. 
We next examine the impact of memory reference locality on performance. Ap­
plications exhibiting memory reference locality can take advantage of architectural 
features of the SMBS to realize performance gains. An application w ith high local­
ity  is expected to reduce bandwidth demands on the system. This in  turn could






O.OOS 0.01 aois o.ass 0.03 0.035
RaquMtnfep/g
Figure 5.7: Comparison of mean response tim e w ith the two different connection 
topologies (linear and ring connections): n =  10, m =  10, b =  10, g =  9, rm =  4 
and t  =  3.
100
*
♦ WWHbulh>«Mifcif  eoincSon 
o itn^i ■ IbulWwSi lnMreoCTClion
OJ01 0.01 s 0.02 0.02S 0.035
Figure 5.8: Comparison of processing efficiency w ith  the two different connection 
topologies (linear and ring connections): n =  1 0 , m =  1 0 , 6  =  1 0 , g =  9 , rm =  4  
and 1 =  3.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
112
cause the contention effect to be reduced significantly. As we assumed earlier, for a 
given locality, a fraction of requests is directed to local memory modules w ithin the 
originating segment and the rem aining requests are evenly distributed to  non-local 
memory modules (outside the originating segment). Figure 5.9 shows processing 
efficiency o f a single flit buffer model as a  function o f the number of segments for 
varying memory reference localities. The figure is shown for n =  10, m  =  10, b =  10, 
l / r p =  0.02, rm =  4 and t  =  3. Figure 5.10 depicts processing efficiency of a single 
flit buffer model for l / r p =  0 . 0 1  and larger segment size: n  =  2 0 , m  =  2 0  and 
b =  1 0 . In  both figures, processing efficiency improves substantially w ith increasing 
locality because locality reduces bus contention as w ell as average network distance.
As can be seen in Figures 5.5 and 5.9, the performance of the single flit buffer 
case w ith locality of 0.4 is comparable to that of the in fin ite flit buffer case under 
uniform  memory reference. For larger segment size, as seen in  Figures 5.6 and 5.10, 
the performance of the single flit buffer case w ith  locality of 0 . 6  is as good as that 
of the in fin ite  flit buffer case. From the above observations, we conclude that al­
though the scalability of single flit buffer models is generally poor because of high 
bus contention, the performance o f the single flit buffer models w ith proper locality  
is significantly improved. For example, the performance of single flit buffer models 
w ith  locality =  0.4 to 0.6 (depending on segment size) is as good as that of the 
in fin ite flit buffer models under a uniform  memory reference assumption. As seen 
earlier, the scalability of the in fin ite flit buffer models is observed to be good.




Figure 5.9: Effect of memory reference locality for single flit buffer model: n =  1 0 , 
m =  10, 6  =  10, rm =  4, l / r p =  0.02 and t — 3.
L o a « y -
-












S O  7 9 9 10 11 12 13 14 IS
No. of a
Figure 5.10: Effect of memory reference locality for single flit buffer model: n =  20, 
m =  20, b =  10, rm =  4, l / r p =  0.01 and t  — 3.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
114
Figures 5.11 and 5.12 depict processing efficiency o f in fin ite flit buffer models of 
the two different segment sizes w ith  varying memory reference localities. In  both fig­
ures, processing efficiency is improved w ith increasing locality rates. We notice that 
increasing memory reference locality improves processing efficiency more rapidly in 
the case of a single flit buffer model than in  the case o f an in fin ite flit buffer model. 
The positive effect of memory reference locality in  the single flit buffer case, on per­
formance, is more profound than that in the in fin ite flit buffer case because network 
contention in  the single flit buffer case is much higher.
The number of flits per packet, t, is another factor that can affect system per­
formance. In  direct networks, when there is no contention and packet length is 
relatively large, network latency is almost insensitive to  network distance. Contrary 
to direct networks, the SMBS is more susceptible to network contention because it is 
a bus-based system. Hence, the SMBS favors a sm all number of flits per packet be­
cause the sm all number of flits  reduces the possibility of blocked packets occupying 
simultaneously several network resources. Figures 5.13 and 5.14 show performance 
degradation w ith  increasing the number of flits per packet. As request rates become 
high, performance deterioration is severe as we m ight expect.
We assumed earlier that a basic segment consists o f 10 processors, 10 memory 
modules and 10 segment buses. Although it is d ifficult to find any relationship to 
performance among system configuration parameters n , m , b and g, it  is interesting 
to observe how performance changes w ith varying system parameters. Here, we 
observe the effect of the number of segment buses on performance.






«5 7 9 12 13 14 IS10 11
No-ofsegmniB
Figure 5.11: Effect of memory reference locality for in fin ite flit buffer model: n =  10, 








Figure 5.12: Effect of memory reference locality for infinite flit buffer model: n  =  20, 
m =  20, b =  10, rm =  4, 1 / t p =  0.025 and 1 =  3.





aoos aoi osa 0.025 0.03
R «|U M tm to(1/g
Figure 5.13: Mean response tim e w ith varying flit numbers per packet: n =  1 0 , 
m  =  10, b =  10, g =  9 and rm =  4.
1*3
aoos aoi 0.015 0.025 0.03 0.035
Figure 5.14: Processing efficiency w ith varying flit numbers per packet: n  =  10, 
m  =  10, b =  10, g =  9 and rm =  4.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
117
Figure 5.15 shows mean response tim e comparison, for the single flit buffer case, 
w ith  varying number o f segment buses per segment. This is shown for n =  10, 
m  =  10, g  =  5, rm 4 and t  — 3 under a uniform  memory reference assumption. 
In  Figure 5.15, we observe that the mean response tim e increases rapidly when the 
num ber o f segment buses in  a segment is reduced below 8  for the given parameter 
setting. W hile it  is d ifficu lt to predict quantitatively the threshold a t which abrupt 
performance degradation occurs, we can explain it qualitatively. We can explain the 
abrupt change by stating that shortage of segment buses increases bus contention 
which in turn  increases blocking. Increased blocking further increases contention. 
This cycle o f negative effects stimulates the abrupt performance degradation. There­
fore, we can point out, a t least, the fact that the performance of an SMBS is very 
sensitive to the effect o f bus contention.
Figure 5.16 shows a comparison of processing efficiency for the single flit buffer 
case under a uniform memory reference assumption w ith  a varying number of seg­
m ent buses per segment.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
118
I
0.005 aois0.01 0.025 0.03 0.035
Figure 5.15: Mean response tim e of single flit buffer case w ith  varying segment buses 
per segment: n =  10, m =  10, g =  5, rm =  4 and t  =  3.
*
aoos aoi 0j015 002 0.035
RaquMlnto(Vg
Figure 5.16: Processing efficiency of single flit buffer case w ith varying segment 
buses per segment: n  =  10, m =  10, g =  5, rm =  4 and t  =  3.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C h a p t e r  6 
C o n c l u s io n
We have introduced and investigated a new class of bus-based multiprocessor sys­
tems called the Segmented M ultip le Bus System (SM B S). The new architecture of­
fers increased scalability while m aintaining the bus-based shared memory system’s 
advantages in  terms of high degree of fau lt tolerance, ease of expansion and ease of 
programming.
One o f the unique characteristics of the SMBS is that it  is composed of smaller, 
bus-based, fu ll connection partitions called segments. A set o f segment switches 
positioned between adjacent segments serve the function of electrically isolating (or 
buffering) the adjacent segments. Thus, in  the SMBS, the bus loading problem as­
sociated w ith traditional m ultiple bus systems is not an issue if  segments are not 
too large, although each segment is a fu ll connection m ultiple bus system. Another 
interesting feature o f the SMBS is that it  supports wormhole switching which is 
trad itionally used in  direct network topologies.
Moreover, through segmentation, a very fast bus cycle is possible and the overall 
bandwidth of the SMBS is increased proportionally w ith the number of segments. 
However, increasing the number of segments also increases latency by a correspond­
ing am ount. Together, both segmentation and wormhole routing make the system 
scalable even though the SMBS is a bus-based system.
119
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
120
The segmented system can be classified as a non-uniform memory access network 
although each segment is a fu ll bus connection subsystem. In  such a non-uniform  
memory access network, applications whose memory reference patterns tend to fa­
vor near memory modules can run efficiently. Therefore, the segmented system can 
u tilize  memory reference locality which is usually not the case in  conventional bus- 
based systems. By exploiting the locality of memory reference inherent to many 
applications, scalability o f the SMBS could be drastically improved.
One of the most im portant design aspects o f a m ultiple bus system is the arbitra­
tion mechanism. In  the SMBS, we have employed a distributed parallel arbitration  
mechanism. In  each segment, a two-phase arbitration scheme has been employed 
to resolve memory module and segment bus access conflicts. In  the first phase, the 
segment bus is assigned to a segment switch if  a request exists at the segment switch. 
In  the second phase, both processor selection and bus assignment are carried out 
in parallel in  order to reduce arbitration and assignment delay. Each segment has 
an arbitration system th at is identical in  a ll other segments and arbitration in  each 
segment is independent of those in other segments. Hence, the arbitration delay 
in the SMBS (although it  is relatively large) is dependent not on system size but 
on the size of a segment; i.e ., arbitration delay is not a function of the number of 
segments in  the system. This is another attractive feature o f the SMBS.
The SMBS is a cascaded connection of segments. To scale-up an SMBS w ith  
a linear connection, additional segments can be added at the ends. In  a ring con­
nection of segments, the ring can be opened and additional segments inserted into
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
121
the ring. This simple and m odular way o f system extension gives the designer more 
flex ib ility  in  terms of system scalability.
We have developed accurate queueing models for wormhole routed SMBS’s w ith  
single and w ith infinite flit buffers employed at each segment switches. This, to our 
knowledge, has been the first attem pt to  adapt wormhole routing to a bus-based 
nondirect network. Using approxim ate M V A , we have analytically evaluated mean 
residence times at a segment switch, a processor and a memory module. These 
were utilized  to evaluate mean response tim e of a memory request and processing 
efficiency which we have used as performance metrics.
O ur analytical queueing models are novel and new because they include features 
of both direct and nondirect networks. We have included the effect o f blocking and 
th at o f pipelining associated w ith wormhole routing. The blocking effect has been 
captured in computing the mean w aiting tim e for a header flit in order to compete 
for a segment bus at a segment switch. In  the calculation of the flit residence tim e 
at a segment switch, we have taken in to account the flit’s movement in  relation to 
the other flits  in the same packet by using pipelining property o f wormhole routing. 
The bus group of each segment has been modeled as a flow equivalent service center 
representing m ultiple servers w ith a single central queue. The central server has 
been considered to be load dependent.
To reduce bus contention, we have assumed that a transient packet has priority  
over others in  the context of bus assignment. The effect of this p rio rity  has been
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
122
reflected in  larger bus access tim e for non-transient packets than for transient pack­
ets.
To check the effect of the assumptions and approximations introduced in  the 
analytical queueing models, we have executed extensive simulations without any 
approximations. The results obtained from  analytical models have been found to 
be very close to those obtained from  simulations. We have investigated how system 
performance changes w ith different flit buffer sizes and w ith different connection 
topologies. We have addressed how system performance changes by varying the 
number of segment buses and the number of flits  per packet. We have also studied 
the performance under a uniform memory reference pattern and examined the im­
pact o f memory reference locality on the system’s scalability. The positive effect of 
memory reference locality in the single flit buffer case has been found to be more 
profound than that in  the infinite flit buffer case. This is because segment bus con­
tention is inherently higher in the single flit buffer case. We can summarize our 
conclusions on system scalability as follows.
•  Under a uniform  memory reference assumption, an SMBS with single flit 
buffers does not scale well, however, the system w ith in fin ite flit buffers shows 
good scalability.
•  Memory reference locality improves system performance for both the single 
flit buffer and in fin ite flit buffer cases because locality reduces bus contention 
as well as average network distance. The improvement in  performance of the
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
123
single flit buffer case is more substantial than that of the in fin ite  flit buffer 
case because bus contention is inherently higher in  the single flit buffer case.
•  The performance o f the single flit buffer models w ith locality in  the 0.4 to 
0 . 6  range (depending on segment size) is comparable to th a t o f the infinite  
flit buffer models under a uniform  memory reference assumption (which show 
good scalability).
We have made several observations which compare our results w ith  the generally 
known results in the context o f wormhole routed low dimensional direct networks. 
Such observation can be summarized as follows:
•  W hile performance in low dimensional direct networks is generally network 
distance lim ited, performance in an SMBS with fin ite flit buffers is likely to 
be bus contention lim ited .
•  Increasing in flit buffer size in  the SMBS reduces bus contention and results 
in  significant performance improvement.
•  I f  the number of flits per packet is relatively large, the negative effects of long 
network distance in low dimensional direct networks are m inim ized. Contrary 
to  direct networks, the SMBS is likely to be adversely affected by an increased 
number of flits per packet since it  would lead to high levels of bus contention.
•  A  sm all number of flits per packet and a wide bus w idth w ill help to reduce 
the probability of contention and consequently improve the performance of the 
SMBS.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
124
There are s till several issues open for future research. We summarize these as follows.
•  W e have developed and evaluated system performances for two extreme cases: 
the single flit buffer case and the in fin ite flit buffer case. I t  would be worthwhile 
to  develop an analytical performance model for the SMBS w ith  any finite-sized 
flit  buffers. Although this is a difficult problem, solving this problem would 
help in  determining the flit  buffer size required to m atch the performance of 
an in fin ite  flit buffer case to  w ithin a certain percentage.
•  The performance of an SMBS was shown to be very sensitive to the effect 
o f bus contention. Thus, it is im portant to study the im pact and role of 
caches on reducing bus traffic, and consequently on bus contention. Research 
on performance improvement in  SMBS’s caused by introducing caches would 
complement our research.
•  The approach we adopted in  developing the analytical models is rather com­
prehensive in the sense th at the models incorporate features o f both direct and 
nondirect networks. This approach can possibly be applied to  several other 
network topologies which are classified as direct and/or nondirect networks. 
For instance, if  the shared memory in  an SMBS is distributed to the proces­
sors (becomes local to processors), the model would be easily applicable to the 
resulting message passing environm ent.
•  Developing a deadlock-free routing algorithm  for a ring connection of segments 
w ith  fin ite  flit buffers would be interesting work. In  a direct network, deadlock
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
can be generally avoided using one of two approaches: dimension ordering [6 , 
1 1 , 12] and virtual channels [31]. However, the fact that a segment bus is 
a resource shared by many and th at it  is arbitrated among m any requesters 
m ight require a new way of thinking.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
B ib l io g r a p h y
[1] T . Y . Feng, “A Survey o f Interconnection Networks,” IEEE Comput. Mag., 
Vol.14, No. 12, December 1981, pp. 12-27.
[2] W . A . W ulf and C . G . B ell, “C.mmp - A M ulti-M ini-Processor,” Proc. Fall 
Joint Comput. Conf., A F IP S , December 1972, pp. 756-777.
[3] T . Lang, M . Valero, and I .  Alegre, “Bandwidth o f Crossbar and M ultiple-Bus 
Connections for Multiprocessors,” IEEE Trans. Comput., Vol. C-31, No. 1 2 , 
December 1982, pp. 1227-1234.
[4] T . N . Nudge, J. P. Hayes, and D . C. Winsor, “M ultip le  Bus Architectures,” 
IE EE  Comput., June 1987, pp. 42-48.
[5] H . J. Siegel, Interconnection Networks for Large Scale Parallel Processing: The­
ory and Case Studies, Lexington Books, Lexington, M A , 1984.
[6 ] H . Sullivab and T . R . Bashkow, “A  Large Scale, Homogeneous, Fully Dis­
tributed Parallel Machine,” AC M  Proc. 4th Annual Int. Symp. Comput. Archi­
tecture, IEEE, March 1977, pp. 105-124.
[7] J. T . Kuehn and B . J. Sm ith, “The HORIZO N Supercomputing System: A r­
chitecture and Software,” Proc. Supercomputing ’88, November 1988.
[8 ] D . Lenoski, J. Laudon, K . Gharachorloo, A . G upta, J. Hennessy, M . Horowitz, 
and M . Lam, “Design of the Stanford DASH Multiprocessor,” Comput. Syst. 
Lab. T R  89-403, Stanford U n iv., December 1975.
[9] A . Agarwal, B. H . Lim , D . A . K ranz, and J. Kubiatowicz, “A PRIL: A Processor 
Architecture for Multiprocessing,” ACM  Proc. 17th Annual Int. Symp. Comput. 
Architecture, June 1990, pp. 104-114.
[10] W . J. D ally and C . L. Seitz, “The Torus Routing C hip,” Distributed Comput., 
V o l.l, No.3, 1986, pp. 187-196.
[11] In te l Corporation, A Touchstone DELTA System Description, 1991.
[12] C . L . Seitz, W . C . Athas, C . M . Flaig, A. J. M artin , J. Seizovic, C . S. Steele, 
and W . K . Su, “The Architecture and Programming of the Ametek Series 2010 
M ulticom puter,” Proc. 3rd Conf. Hypercube Concurrent Computers and Appli­
cations, Pasadena, C A, January 1988, pp. 33-36.
126
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
127
[13] M . D . Noakes, D . A . W allach and W . J. Dally, “The J-Machine Multicom puters: 
An Architectural Evaluation,” 20th Annual Int. Symp. Comput. Architecture, 
M ay 1993, pp. 224-235.
[14] W . C . Athas and C. L . Seitz, “Multicom puters: Message-Passing Concurrent 
Computers,” IEEE Comput. Mag., Vol.21, August 1988, pp. 9-24.
[15] H . Jiang and K . C . Sm ith, “PPM B : A  Partial M ultip le Bus Multiprocessor 
Architecture for Improved Cost Effectiveness,” IEEE Trans. Comput., Vol.41, 
No.3, M arch 1992, pp. 361-366.
[16] E . D . Lazowska, J. Zahotja, G . S. Grahm , and K . C. Sevcik, Quantitative Sys­
tem Performance - Computer System  Analysis using Queueing Network Models, 
Englewood Cliffs, NJ: Prentice-H all, 1984.
[17] K . L . Johnson, “The Im pact o f Communication Locality on Large-Scale M ul­
tiprocessor Performance,” AC M  Proc. 19th Annual Int. Symp. Comput. Archi­
tecture, M ay 1992, pp. 392-402.
[18] T . Lang, M . Valero, and M . A . F io l, “Reduction of Connections for M ultibus 
Organization,” IEEE Trans. Comput., Vol. C-32, No.8 , August 1983, pp. 707- 
715.
[19] W . T . Chen and J. P. Sheu, “Performance Analysis of M ultip le Bus Intercon­
nection Networks w ith H ierarchical Requesting Model,” IEEE Trans. Comput., 
Vol.40, No.7, July 1991, pp. 834-842.
[20] S. M . Mahmud, “Performance Analysis of M ultilevel Bus Networks for H ier­
archical Multiprocessors,” IE EE  Trans. Comput., Vol.43, No.7, July 1994, pp. 
789-805.
[21] M . N . K arim , “Design and Analysis o f Reduced Connection M ultip le Bus Sys­
tems: A  Probabilistic Approach,” P h .D .’s Dissertation, E lectrical and Com­
puter Engineering D ept., Louisiana State University, December 1996.
[22] A . Ghafoor, A . L. Goel, J. K . Chan, T . M . Chen, and S. Sheikh, “R eliab ility  
Analysis of a Fault Tolerant M ulti-B us Multiprocessor System,” Proc. 3rd IEEE  
Symp. Parallel and Distributed Processing, Dallas, T X , December 1991, pp. 
436-443.
[23] A . Varm a, “Combinatorial Design of Bus-Based Interconnection Structures,” 
Research Report RC  12550 (#54752), IB M , Yorktown Heights, New York, 
September 1986, pp. 309-312.
[24] E . Lugue, D . Rexachs, J. Sorribes and A . R ipoll, “A M odular A rbitration  
System for M ultiple Buses Multiprocessors,” Microcomputers, usage and design, 
Eurom icro 1985, pp. 579-585.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
128
[25] R . C . Pearce, J. A . Field, and W . D . L ittle , “Asynchronous A rbiter M odule,” 
IEEE Trans. Comput., Vol. 0 2 2 , No.9, September 1975, pp. 931-932.
[26] W . W . Plum m er, “Asynchronous A rbiters,” IEEE Trans. Comput., Vol. 0 2 1 , 
N o .l, January 1972, pp. 37-42.
[27] P. Corsini, “n-User Asynchronous A rb iter,” Electronics Letters, V o l.ll, N o .l, 
January 1975, pp. 1-2.
[28] L. M orin and H . F . L i, “Design o f Synchronisers: A  Review,” IE E  Proc. Vol. 136, 
P t.E , No.6 , November 1989, pp. 557-564.
[29] T . Lang and M . Valero, “M-Users B-Servers A rb iter for M ultiple Buses M u lti­
processors,” Microprocessing and Microprogramming, Vol.10, 1982, pp. 11-18.
[30] P. Kerm ani and L . Kleinrock, “V irtu a l Cut-Through: A  New Com puter Com­
m unication Switching Technique,” Comput. Network, Vol.3, M ay 1987, pp. 547- 
553.
[31] W . J . D ally  and C. L. Seitz, “Deadlock-Free Message Routing in  Multiprocessor 
Interconnection Networks,” IEEE Trans. Comput., Vol. C-36, No.5, M ay 1989, 
pp. 547-553.
[32] W . J. D ally  and P. Song, “Design o f a Self-Tim ed V L S I M ulticom puter Commu­
nication Controller,” Proc. I n f l  Con/. Computer Design, IEEE, Los Alam itos, 
C A , 1987, pp. 230-234.
[33] M . A . Marsan and M . G erla, “M arkov Models for M ultip le Bus Multiprocessor 
Systems,” IEEE Trans. Comput., Vol. C-31, No.3, March 1982, pp. 239—248.
[34] K . B . Iran i and I. Onyuksel, “A Closed-Form Solution for the Performance 
Analysis o f M ultiple-Bus Multiprocessor Systems,” IEEE Trans. Comput., Vol. 
C-32, N o .ll, November 1984, pp. 1004-1012.
[35] D . Towsley,“Approximate Models o f M ultip le Bus Multiprocessor Systems,” 
IEEE Trans. Comput., Vol. C-35, No.3, March 1986, pp. 220-227.
[36] Q. Yang and S. G . Zaky, “Communication Performance in M ultiple-Bus Sys­
tems,” IEEE Trans. Comput., Vol.37, No.7, July 1988, pp. 848-853.
[37] Q . Yang and L. N . Bhuyan, “Analysis of Packet-Switched M ultiple-Bus M ul­
tiprocessor Systems,” IEEE Trans. Comput., Vol.40, No.3, March 1991, pp. 
352-357.
[38] W . J. D ally, “Performance Analysis o f k-ary n-cube Interconnection Networks,” 
IEEE Trans. Comput., Vol.39, No . 6  June 1990, pp. 775-785.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
129
[39] A . Agarw al, “Lim its on Interconnection Network Performance,” IEEE Trans. 
Parallel and Distributed Systems, Vol.2, No.4, October 1991, pp. 398-412.
[40] S. L . Scott and J. R . Goodman, “The Im pact o f Pipelined Channels on ifc-ary 
n-cube Networks,” IEEE Trans. Parallel and Distributed Systems, Vol.5, N o .l, 
January 1994, pp. 2-16.
[41] V . S. Adve and M . K . Vernon, “Performance Analysis o f Mesh Interconnection 
Networks w ith  Determ inistic Routing,” IEEE Trans. Parallel and Distributed 
Systems, Vol.5, No.3, M arch 1994, pp. 225-246.
[42] J. K im  and C . R . Das, “Hypercube Communication Delay w ith  Wormhole 
Routing,” IE EE  Trans. Comput., Vol.43, No.7, July 1994, pp. 806-814.
[43] K . G . Shin and S. W . Daniel, “Analysis and Im plem entation o f H ybrid Switch­
ing,” IE E E  Trans. Comput., Vol.45, No.6 , June 1996, pp. 684-692.
[44] J. R . Jackson, “Jobshop-Like Queueing Systems,” Management Science, Vol.10, 
N o .l, October 1963, pp. 131-142.
[45] D . L . W illick  and D . L. Eager, “An A nalytical Model of M ultistage Intercon­
nection Networks,” Proc. ACM  SIG M ETRIC S Conf. on Measurement Modeling 
Comput. Syst., M ay 1990, pp. 192-202.
[46] D . Chaiken, C . Fields, K . K urihara, and A . Agarwal, “Directory-Based Cache- 
coherence in  Large-Scale Multiprocessors,” IEEE Comput. Mag., Vol.23, June 
1990, pp. 41-58.
[47] P. A . Lewis, A . S. Goodman, and J. M . M ille r, “A Psuedo-Random Number 
Generator for the System /360,” IB M  System s Journal, Vol.8 , N o.2, 1969, pp. 
136-146.
[48] H . Kobayashi, A n Introduction to System  Performance Evaluation Methodology, 
Addison-Wesley, 1978.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
A p p e n d ix : T e r m in o l o g y
.ft[.s] Mean response tim e of a class s request.
Hpne[<] Mean residence tim e th a t the header flit o f a class s  packet experiences at 
the head of a processor output queue.
Rnetwork[&\ Weighted sum of mean residence times of a class s packet in  the forward 
path and in the return path.
Rad,network Mean residence tim e in  the network of a class s request packet from  
the first segment switch in  the forward path to the target memory module in 
segment d.
Rda,network Mean residence tim e in  the network of a class s  reply packet from the 
first segment switch in the return path to the requesting processor.
fJmemH Mean residence tim e o f a class s packet in  a memory module.
Rd\a,m em in Sum of the mean delay that a class s request packet experiences in the 
input queue of a target memory module in  segment d  and the memory service 
tim e, rm.
Rd\a,mem^t Mean delay that a class s reply packet experiences in the output queue 
of a target memory module in  segment d.
Pproc[s\ Mean processor u tilizatio n  of class s.
p«><f(l) Mean residence tim e o f the header flit of a class s packet at a segment switch 
in  segment i in the forward path from s to d.
**«,<!*( 1) Mean residence tim e of the header flit of a class s  packet at a segment switch 
in  segment t in the return path from  d to s.
ri,*d(k) Mean residence tim e o f the kth flit of a class s packet at a segment switch
in  segment i  in the forward path from s to d.
ri,da(k) Mean residence tim e o f the kth flit of a class s packet at a segment switch
in  segment i in the return path from  d to s.
r,-(fc) Mean residence tim e of the kth flit at a segment switch in segment t.
u,-(fc) Mean utilization of a segment switch in segment i  by the ktk flit.
u ;,'(l) Mean waiting tim e for a segment bus in segment t by a header flit.
UitPom,(k) Mean utilization of a processor output queue by the kth flit of a class i 
packet.
130
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
I
131
u<,*»«•*(&) Mean utilization of a memory output queue in  segment i by the kth flit.
r ;lPoat(lb) Residence tim e of the kth flit of a class i  packet a t a processor output 
queue.
Pprocfmem Probability th at a processor participates in  memory arbitration in  the case 
o f an intra-segment access.
Ptwnp Irnem Probability that a segment switch in an upper bus group participates in  
memory arbitration.
Pswi0|mem Probability that a segment switch in a lower bus group participates in  
memory arbitration.
Pproc|6«« Probability that a processor for an inter-segment access participates in bus 
arbitration.
Pmem\bus Probability that a memory module participates in bus arbitration.
Pp.«cc(mem Probability that a request from  a processor succeeds in memory arbitra­
tion in  the case of an intra-segm ent access.
Pproc\bv*eff Probability that a processor successfully participates in bus arbitration.
Mean waiting tim e of the header flit of a class s  packet directed d  in  order 
to compete for a segment bus at a processor output queue in segment s.
ld\t,Po%, Mean residual residence tim e of a ta il flit in service that the header flit of a 
class s packet observes when it  arrives at the first segment switch in the path 
from  s to d.
t»,6u««, Mean upper segment bus access tim e at segment s.
t»,bu*to Mean lower segment bus access tim e at segment s.
t*,b*t%pticxxM Expected tim e interval between consecutive bus service completions in  
the upper bus group of segment s.
ta,tnuto*cce— Expected tim e interval between consecutive bus service completions in  
the lower bus group of segment s.
Wd\*,min Mean tim e that a class s request packet experiences in  the input queue of 
a target memory module in  segment d.
Delay experienced by the header flit of a class s packet at the head of a 
memory output queue in  segment d until a segment bus is allocated to it.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
132
<Id\t ,m in  Mean memory input queue length that is observed by a class s request packet 
when it  arrived at a memory input queue in  segment d.
Qd\»,mn , Mean memory output queue length that is observed by a class s reply packet 
when it  arrived at a memory output queue in  segment d.
Probability th a t a memory module is busy serving a request when a class s 
packet arrived at the memory module in  segment d.
wd\t,momt Mean queueing delay un til the header flit o f a class s reply packet arrives 
at the head o f a memory output queue in  segment d.
r 4 *.moat( l)  Residence tim e of the header flit o f a class s  reply packet at the head of 
a memory output queue in segment d.
“fd\»,mo*t Mean residual residence tim e o f the ta il flit in  service that a class s reply 
packet observes when the header flit arrived a t the first segment switch in  the 
return path from  d  to s.
£*.P<»t Integer random variable that represents the number of occupied queue heads 
observed by a class s request packet when it  arrived at the head of a processor 
output queue.
r[£,tPoJ Mean tim e u n til a class s packet packet can get access to a segment bus 
while Ka,Pomt other queue heads are occupied.
Tft,p Number o f occupied processor output queue heads observed by a request that 
has just arrived at the head of a processor output queue in  segment s.
fft,m Number o f occupied memory output queue heads observed by a request that 
has just arrived a t the head of a processor output queue in  segment s.
£<f,moat Integer random variable that represents the number of occupied queue heads 
observed by a reply packet, when it  arrived at the head of a memory output 
queue in segment d.
Mean tim e un til a reply packet can get access to a segment bus in  segment 
d  while fCdiTn<mt other queue heads are occupied.
Number o f occupied processor output queue heads observed by a reply packet 
when it arrives at the head o f the memory output queue in  segment d.
6d,m Number o f occupied memory output queue heads observed by a reply packet 
when it arrives at the head of a memory output queue in segment d.
7 m Mean residual lifetim e of a memory request in  service at a memory module.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
133
«*,bus Probability that a segment bus in  segment * is busy serving a transient packet 
at the beginning o f the second arbitration phase.
q ijw Mean queue length seen by an arriving header flit a t an infinit e  flit buffer queue 
in  segment t.
Pbvifru* Probability th a t a header flit observes segment bus i  busy serving a request 
from  a local processor or a local memory m odule.
i ,  Num ber of requests th at participate in  bus assignment o f the second arb itration  
phase at segment s.
4>,'P Num ber of processors participating in bus assignment of the second arbitration  
phase at segment s.
Number of memory modules participating in  bus assignment of the second 
arbitration phase at segment s.
i
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
V it a
Jungjoon K im  was bom in  Taegu, Korea. He received the Bachelor o f Science degree 
in  electrical engineering from Kyungpook N ational University, Korea, in  1981. A fter 
he got the M aster o f Science degree in  electrical engineering from  Korea Advanced 
In stitu te  o f Science and Technology in  1983, he has worked as an instructor at 
Kyungpook National University from  1983 to 1984. Since November 1984, he has 
joined Korea Telecom Research Laboratories, where he has worked in  the Electronic 
Switching System Division. He is currently on leave of study to pursue the Doctor 
of Philosophy degree at Louisiana State University.




Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
DOCTORAL EXAMINATION AND DISSERTATION REPORT
Candidates Jungj oon Kim
Major Field: Electrical Engineering
Title ot Dissertation: Performance and Analysis of Segmented Multiple
Bus Systems
Approved:
E X A M IN IN G  C O M M IT T E E :
Date of Araei nation:
June 18, 1997______
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
