1.

Introduction
Packaging is a very important issue in the design of large multiprocessor systems, which may contain hundreds or even thousands of powerful processors and memories. Using today's VLSI technologies, such systems require innovative packaging approaches that may include multi-chip modules and multi-layer boards, and may possibly span over several chassis and cabinets. These packaging technologies impose an inherent hierarchical structure on the system; this hierarchy often influences the design of the machine organization (see, for example, [AsJM89] ).
For example, several processors may be laid out on a single chip, or placed on a multi-chip module to form a cluster; several of such clusters may reside on a multi-layer board, and several of such boards in a chassis. Signal latency across the physical boundaries defined by these packaging technologies can vary substantially. In general, signals running within the boundary of a package, either on a chip, on a multi-chip module, or on a board, will suffer less latency than signals that have to run across the boundary to another package. Under these circumstances, it is often necessary to build a memory hierarchy that reflects such latency variations. Each processor may have a fast local memory in the form of a cache or a register set.
Each cluster of processors may have a cluster memory shared by processors within the cluster, and so on. The number of pins provided in each packaging technology is quite different. Signal lines running between different packages tend to be affected more by pin limitations on package boundaries, than by wire placement between packages.
From a software perspective, large parallel application programs running on such systems require programmers and compilers to exploit several levels of granularity for more parallelism and better system utilization. These granularity levels, according to the terminology used by Cray, may include macro-tasks, micro-tasks, vector operations, etc. Each of these granularity levels has different degrees of parallelism and program locality. Hence, an inherent hierarchical structure is also imposed on the system architecture. It is well-known that the communication and memory access patterns within a macro-task are very different from those between macrotasks. In general, communication and memory accesses are more frequent within a macro-task than between macro-tasks. Hence, shorter latencies within a task are necessary for better performance. This requirement matches very well with the physical structure imposed by the packaging of the system. A macro-task which contains several micro-tasks can be assigned to a cluster of processors, where each micro-task can, in turn, be assigned to a single processor.
Communication within a cluster will stay within the boundary of a package, and latency can be minimized.
From these considerations, it is quite natural to conclude that we should try to exploit hierarchical structures imposed by both packaging technologies and parallel program characteristics. However, the experience with parallel applications has shown that reorganizing a parallel program to exploit just two levels of architectural hierarchy is a non-trivial problem.
Software technology probably will limit the level of system hierarchy to a very small number, most likely at two levels, in the foreseeable future. Several machines which have been build with a hierarchical structure to investigate parallel applications include Cedar [GJMW89] , Dash [LLGW92] , and Cenju [NTKM91] .
Another important issue in designing such systems is scalability. To allow program compatibility and portability, the major characteristics of a system architecture should remain unchanged as we scale the system up, from a few processors to hundreds or thousands of processors. A hierarchical structure allows us to use different interconnect schemes at different levels of hierarchy; we can choose different schemes for specific levels, based on how costeffective they are under those technological and system constraints. For example, there are relatively few processors within a cluster, and scalability is less of an issue than the provision of fast local communications. Hence, fast buses or even crossbar switches can be used within a cluster. On the other hand, scalability is more important than the communication latency when we connect together a large number of clusters, spread over a greater physical distance, such as in different cabinets. Hence, more cost-effective interconnect schemes, such as hypercubes, meshes, or shuffle-exchange networks have to be used.
A variety of hierarchical configurations have been studied. For example, clusters of processors using shared buses [WuLi81] , clusters of processors with crossbar switches [AgMa85] , local meshes of processors, with a global mesh connecting the clusters [Carl85] , two-level systems based on hypercubes and other network topologies [DaEa90] [DaEa91] [Aboe91], combinations of Omega networks and composite cube networks [Padm91a] , and systems with buses within a cluster and a global hypercube network [ChHo91] [HsYe91].
However, most of these studies are only limited to the discussions of traffic patterns with a high locality. Very few of them addressed physical packaging constraints. On the other hand, most recent work on packaging issues in multiprocessor interconnects, such as [Dall90] , [AbPa90] , and [Agar91] , address only non-hierarchical systems.
In this paper, we study and compare the design of interconnects for hierarchical multiprocessors with packaging constraints. We will limit our study to two-level hierarchical systems. In section 2, we give a short overview of previous work, and a description of the system configurations in our study. In section 3, we survey several options for the internal organization of a cluster. In section 4, we examine the clustering of meshes, hypercubes, and GSEs. We study their maximum bandwidth estimates in section 5, and their queueing behavior in section 6. Conclusions are drawn in section 7.
Preliminaries
Earlier work on packaging issues in multiprocessor interconnects has mainly used node pinout and bisection width as bases for cost comparisons. We will refer to channel width as the number of bits in a network channel, and hence the number of bits in a message nibble. Channel degree is the number of channels at a node.
The bisection of a graph is a partition of its nodes into two equal halves. Bisection width [Thom79] is the minimum number of wires crossing a bisection, over all possible bisections; it has often been used as an estimate of wiring on a two-dimensional area within the boundary of a package. In [Dall90] , k-ary n-cubes with constant bisection widths were compared. In [AbPa90] , both switch pinout and bisection width were used as cost factors in analytical studies of k-ary n-cubes. In both studies, different network topologies were constrained to have the same wiring costs or pinouts by varying the widths of their channels. The fundamental considerations there are to decide whether we should have a large number of narrow channels, or a small number of wide channels, and how these channels should be organized. The relationships between channel fanout and channel width are determined by topology and wiring constraints. All those studies assume a flat (i.e., non-hierarchical) structure.
Bisection width is a lower bound and is independent of processor placement. It is more suited for two-dimensional layouts at the next highest packaging level, for example, for wires between the nodes on the surface of a chip, or wires between chips on the surface of a multi-chip module. For large systems, where signals need to run across the boundary of packages, pinout is a more severe constraint. However, in this paper, we will use both pinout and bisection width as the determinants of the system cost. Other constraints have also been studied in the literature.
For example, [Agar91] examined fixed channel width, bisection width and node pinout, and took into account both switch delay and wire delay.
In our analytical models, we ignore the effect of wire length and time-of-flight delays. The technologies used in Thinking Machine's CM-5 and IBM's Vulcan, for example, allow several bits to be pipelined through a wire. The time-of-flight along the longest wire does not have to be a limit on the network cycle time.
There have been quite a few interconnect topologies proposed in the past. Here, we only consider major topologies which have been used most frequently in existing parallel machines, namely k-ary n-cubes such as meshes and hypercubes, and shuffle-exchange networks.
However, shuffle-exchange networks such as Omega networks [Lawr75] are primarily multiplestage networks. A processor has to go through a series of dedicated intermediate switching elements, arranged in stages, to communicate with another processor or a memory module. The traffic patterns and switch designs of multistage networks are thus different from a mesh or a hypercube, where each processor is directly connected to its adjacent processors, or memory modules. To allow more equitable comparisons among these topologies, we use a generalized shuffle-exchange network (GSE) [HsYZ87] instead. A GSE network is constructed by attaching a processor node to each switching element (as in a mesh, or a hypercube). All switching -6 elments are then connected with a shuffle connection.
To illustrate our discussions, assume we have a system with 1024 processors. We choose a layout plan that will place 64 processors on a board or a multi-chip module. Hence, there will be 16 of such boards, or modules, in our system. In a traditional implementation of hypercubes or meshes, each processor on a board will have its own channels and switch connections to processors on the same board, and to processors on other boards. Even though the packaging forces processors in the same package to form clusters, the underlying topology is still a uniform mesh or hypercube, disregarding physical packaging boundaries. We call such networks flat networks; they may be flat meshes (FMs), or flat hypercubes (FH), etc.
Alternatively, we can organize the system hierarchically. We allow processors on a board or a multi-chip module to form a cluster. The interconnecting scheme within a cluster can obviously be different from the scheme used to connect different clusters. The global network may be a mesh, hypercube, etc. We will call these clustered networks; they could be clustered meshes (CMs), or clustered hypercubes (CHs), etc. Similarly, a generalized shuffle-exchange network can also be arranged as a flat system or a clustered system.
Assume that the number of pins available for a cluster is constrained by a particular packaging technology and is fixed. The major difference between a flat system and a clustered system with the same global network topology thus lies in the way in which the pins are used by different organizations across packaging boundaries. The focus of this paper is to study which of these approaches can utilize the pins more effectively, across different network topologies.
Internal Organization of a Cluster
There are many choices for the internal organization of a cluster. They can roughly be divided into three approaches: (1) shared global channels with equitable access; (2) shared global channels with an interface processor; and (3) partitioned global channels.
(I) Shared global channels with equitable access.
As shown in Fig. 1 (A) , processors in each cluster share a global crossbar switch. There is only one single shared point within a cluster, from which all global channels are accessed. In Figure 1 (A), there are 8 processors in a cluster which are connected via k=2 buses to a global crossbar switch. The number of buses can be chosen by the system architect, depending on the desired performance and the topology of the global network. Notice that, instead of using k buses, we can also have each processor connect directly to the crossbar, or use a separate crossbar switch for local traffic, to increase network bandwidth. But it will come with an increased cost. Such extra bandwidth in the local buses, beyond what is provided by the k buses, will not yield substantial improvement in network delay, except for applications with heavy and sustained local traffic [HsYe91] . See Section 5.3 for further discussions on local bandwidth requirements. [ChHo91] assumes an approach that is similar to shared global channels with equitable access. A possible expensive variation is the extended hypercube of [KuPa92] , which gives each processor its own connection to the shared global switch, in addition to a local network.
(II) Shared global channels with an interface processor.
[Aboe91], [Carl85] , [DaEa90] , and [DaEa91] suggest a variant of the shared global channels approach, in which only a single interface processor can access the global switch in each cluster. All other processors are connected to the interface processor via a local network as shown in Figure 1 (B). This approach requires the interface processor to have a larger fanout than the other processors in the cluster. A major problem with this approach is observed in [DaEa90] for traffic that does not have an extremely high degree of locality. The local channels attached to the interface processor are carrying all of the global traffic to and from the cluster. If we assume that all channels have the same bandwidth, these channels become over-utilized and saturated at low loads, while the channels further away from the interface processor are underutilized. The equitable access approach can avoid this problem, and provides a more balanced organization.
(III) Partitioned global channels. Because of the problems in approaches (II) and (III), we will focus on the shared global channels, equitable access approach. For example, consider an NCUBE-like system [HaMS86] : a 1024-processor FH has a 64-processor subcube and 512 pins per board. Each processor has a 10x10 switch. Four of its inputs and four of its outputs go off-board. Hence, there are 64*4*2=512 channels. Each channel can only be one bit wide. We can also organize the system hierarchically as a global hypercube with 16 supernodes, with 64 processors in each supernode.
Each board/supernode needs 4*2=8 inter-cluster channels. With 512 pins per board, each channel can be 512/8 = 64 bits wide. We implied constant board pinout, but note that an FH and a CH (and any flat k-ary n-cube and its clustered analogue) have the same board pinout and bisection width. This is because only the organization of the channels changes, and not the network topology.
[HsYe92] applied this approach to meshes with torus-like wraparound channels. Consider 
System Configurations
Let N 1 be the system size, and N 1 =2 n 1 . Let N 0 be the number of processors in a cluster, and N 0 =2 n 0 . Let N=N 1 /N 0 be the number of clusters in the system, and N=2 n . To simplify our discussions, we assume a cluster is placed on a board from now on unless it is specified otherwise. Let W be the board pinout, L the message length (in bits), C the channel width (in bits), t m the total number of nibbles (or flits) in a message, and t h the number of nibbles in the message header assuming that the header has n 1 bits. p n 1 −1 . . . p 0 is the binary expansion of the node address in a system with N 1 processors.
-10
Hypercubes and Meshes
First consider the constant board pinout case. In a flat system, assume that B flat channels route outside the board (and B flat channels enter from outside). For example, for the FM, Now consider the bisection widths of k-ary n-cubes. We can adapt the terms for bisection width from [Dall90] , making appropriate adjustments for bidirectional channels. For a hypercube with N 1 processors and a channel width C, the bisection width is CN 1 . For bidirectional meshes, it is 4C√ N 1 . We set the network channel widths so that all networks have the same bisection width for the same system size.
In a clustered system, each cluster of N 0 processors is connected to the global crossbar switch by a system of k (<N 0 ) cluster buses. This model allows us to extend our results to the cases where all of the processors are directly connected to the global crossbar switch (i.e. when the number of cluster buses k = N 0 ). Figure 2 shows several system configurations for flat and clustered hypercubes and meshes. Processors connected to the same cluster bus form a subcluster.
We refer to the channels between different boards as global channels. A bus that routes traffic from the cluster to the global switch is called a cluster-to-global bus, and a bus that routes traffic from the global switch to the cluster is called a global-to-cluster bus. For flat systems, there are local channels that connect processors on the same board. For ease of implementation and analysis, we assume all local buses and local channels are identical to global channels in capacity. Note that this implies buses in different clustered topologies may have different widths. It is possible to build switches with channels of different bandwidths. Fast local buses obviously will yield better performance for traffic with more locality within a cluster. The effect of the speed in local buses on overall system performance is beyond the scope of this paper.
Because of the symmetry within the cluster, it is not important exactly which processors are grouped into a subcluster. Each subcluster of N 0 /k processors has its own cluster-to-global and global-to-cluster buses, and each bus has its own port to the global switch. A cluster-to-global bus may be viewed as a N 0 /k x 1 multiplexor switch, and is modeled as a single queue with N 0 /k sources. Similarly, a global-to-cluster bus is viewed as a 1 x N 0 /k demultiplexor switch, and modeled as a single queue with k+B sources (since it is fed by k local buses and B global channels) and N 0 /k destinations. We will use the terms ''bus'' and ''queue'' interchangeably for the cluster/global connections. In our analysis, we do not allow cut-through in the buses, and count all queueing delays resulting from bus contention, both from the subclusters to the global switch, and vice versa.
Generalized shuffle-exchange networks
In the traditional shuffle-exchange graph, node p n 1 −1 . . . p 0 is connected to nodes p n 1 −2 . . . p 0 p n 1 −1 and p n 1 −2 . . . p 0 p n 1 −1 . Instead of shifting the address bits by one position and connecting one node to two other nodes, we implement larger switches for each node, shift the address bits by b positions and connect each node to B other nodes (B=2 b ). Hence, in the GSE, each node has a BxB crossbar switch, and node p n 1 −1 . . . p 0 is connected to node A simple routing algorithm for the GSE was proposed in [HsYZ87] . Consider the case where logN 1 is divisible by logB. At each hop through the network, a message is forwarded to the node whose address corresponds to the appropriate logB bits of the destination address of the Each hop through the network rotates the address bits and changes 2 bits of the intermediate node address. A slight modification suffices for the case where logN 1 is not divisible by logB [HsYZ87] . In such cases, multiple paths exist. We will not consider fast-finishing routing algorithms in this paper, because they complicate analysis; for details see [YePL82] .
A flat GSE (FGSE) with fanout B is simply a GSE partitioned across N boards, with N 0 processors per board. All channels go outside the board. Under the constant pinout constraint, where W is the pinout of the board, the channel width is 2N 0 B W , and t m = C L .
A clustered GSE (CGSE) with fanout B would be a global network of N supernodes, with N 0 processors in each supernode. We can keep the same internal cluster organization as the systems in Section 4.1. The number of channels going outside a board is 2B. The channel width is C= 2B W . The bisection width of an N 1 -node GSE with BxB switches and C-bit channels is 5CN 1 B/16 [Hsu92] .
-13 Figure 3 shows an example configuration for flat and clustered GSEs. We can make some simple observations about the effect of varying the fanout B. A high fanout would imply short message distance. The channel width, however, would be smaller, resulting in more nibbles in a message. We will examine these ramifications in more detail in section 5.1.
For all clustered systems, the relative numbering of processors within a cluster is completely arbitrary because of symmetry. In Figures 2 and 3 , we have chosen processor numbers that correspond to analogous ''strips'' of processors in flat analogues of the respective clustered systems. Alternatively, we can assign to each processor a cluster number and an address for its identification within the cluster.
Saturation performance estimates
We can use either message switching or partial cut-through to route messages through networks. In message switching for multiple-nibble messages, an entire message has to arrive at the output port of a switch before it can be forwarded. Partial cut-through is similar to wormhole routing in [Dall90] . It allows a message with enough routing information to ''cut-through'' to the next switch once the outgoing channel is free, even if part of the message has not arrived [AbPa90] . We will focus on partial cut-through because of its apparent advantages over message switching.
In hypercubes and meshes, there are two possible routing strategies. In random routing, a message at a switch takes at random any one of the possible output channels that are on its show that the relative performance will not change much if only a small number of buffers are available per switch [Dall90] . In the region close to maximum network capacity, message delays becomes extremely unstable. Small increases in system load can cause huge increases in average delays. Hence, it should be avoided.
In the uniform traffic model, each processor generates a message per cycle with probability m g , and the message is directed at all other processors with equal probability. To model locality of traffic, let α be the probability that a generated message has its destination within the same cluster. For example, assume we have m g =0.1 and α=0.4 in a 1024-processor system with 16-processor clusters. A processor will generate a local message with probability 0.04, and a global message with probability 0.06. If we set α=(N 0 −1)/(N 1 −1), we get uniform traffic.
This model of locality has been widely used. It obviously will be biased towards clustered systems. However, as will be seen later, clustered systems have very good performance even in uniform traffic. Hence, an accurate modeling of the locality of traffic will not be considered
here.
An important performance measure of a network is its maximum capacity or saturation load, m g,sat . It is the network load at which the average message delay in the global channels becomes unbounded. The average queuelength at a switch also becomes unbounded. The utilization of some network channel becomes 100%.
Let m be the traffic rate (in nibbles per cycle) on a global channel. For the flat systems, m is also the traffic rate on a local channel. Let P t be the probability of termination, i.e., the probability that a message received on a global channel will terminate at that node. P t is generally approximated by the reciprocal of the average message distance.
Assume each processor in a flat system is attached to a BxB switch. Under the uniform traffic model, all channels have the same traffic rate. This is true for all the networks under consideration, regardless of routing strategy. Generalizing from the analysis in [AbPa90], we have m= BP t m g t m . This equation was originally derived for networks with random routing, but it is also holds for LR routing (see [Hsu92] ). Saturation occurs when m=1, so m g,sat = t m BP t .
For a clustered system, let B be the number of global channels going out of the cluster. The total traffic leaving the cluster is N 0 m g t m N 1 −1 N 1 −N 0 for uniform traffic. Hence, m=N 0 BP t m g t m N 1 −1 N 1 −N 0 , and m g,sat = N 1 −N 0 N 1 −1 N 0 t m BP t . m g,sat here is the rate at which the global channels of a clustered system saturate. It is possible that the cluster buses will saturate at lower traffic rates. We consider the saturation of the cluster buses in Section 5.3. The GSE is more complicated because of the fanout B. The average message distance in a GSE with N 1 nodes is logN 1 /logB . The fanout can be determined by the system architect. A high fanout generally means that a message will need fewer hops to get to its destination. The saturation load of GSEs is also a function of fanout.
Saturation of networks with constant pinouts
Consider the saturation load of the FGSE, m g,sat = 2N 0 L logN 1 /logB W . To get a saturation performance similar to the FH, logN 1 /logB has to be small, and the fanout B has to be large.
However, a large fanout implies that the channel width C will be extremely restricted. For example, consider a system with N 1 =1024 and N 0 =16. For the FGSE to have a saturation load no worse than an FH or CH, the FGSE must have logN 1 /logB =3, implying B=16. This requires a total of 2*N 0 *B=512 channels, and the number of nibbles in a message t m needs to be very large. For a smaller fanout on a CH, we can get the same saturation load with a smaller t m because of a wider channel. Hence, the FGSE is not suitable for systems under pinout constraints.
The saturation load of CGSEs also depends on the choice of the switch fanout B. Consider the logN/logB 1 term in saturation load formula for the CGSE, from Table 1 . Increasing B would decrease the average message distance, and increase the saturation load. However, we cannot make B too large, because it is difficult to build fast crossbars with high fanouts. If we want a message to take only two hops through the global network of the CGSE, we must make B= √ N. This is still plausible for systems with hundreds of clusters. However, the next step up that will make a difference in average message distance is B=N-1 (and not B=N, since a cluster for a cluster size of 64), the hypercubes and meshes have the highest saturation loads, followed by the GSEs. For systems with more than 16 clusters, the CGSE has the best saturation performance, followed by the FGSE and hypercubes, and finally the meshes. The degradation in saturation load for meshes is fairly severe for large systems.
-18
Recall that we removed the floor and ceiling functions for simplicity. This ignores a possibly significant artifact: the message length not being always divisible by the channel width.
For example, the FH in Figure 6 has 2-bit channels, and the CH 32-bit channels. Suppose we have 138-bit messages to be routed on 2-bit or 32-bit channels. For the 2-bit channels, t m = 138/2 =69; for the 32-bit channels, t m = 138/32 =5. Because of the wasted space for the 32-bit channels, and the ratio of the t m ′s is not equal to the ratio of the channel widths. However, the terms for saturation load are functions of t m . The FH saturates at approximately m g =0.029, the CH at m g =0.025; if we had chosen a message length that is divisible by both 2 and 32, the two systems would have identical saturation loads. We call this the channel efficiency effect. It tends to be insignificant when messages are very long, or when messages are always divisible by the channel widths. Hence, in a carefully designed system where all these parameters are taken into account, channel efficiency may not be a factor.
Saturation of networks with constant bisection widths
As in [Dall90] , we can use bisection width as a measure of wiring area to compare flat and clustered systems. To obtain systems with a constant bisection width, we normalize all the networks by varying their channel widths so that they have the same bisection width as an FH. For N=B−1, the CGSE will have better saturation performance than the FH. However, the fanout of the global crossbar becomes too large for a practical implementation.
Local bandwidth requirements
Consider a cluster of N 0 processors. The maximum traffic generated is N 0 m g,sat t m , which has to be distributed over k buses in the cluster. If the buses are not to saturate before m g,sat is reached, we have k≥N 0 m g,sat t m .
We will ignore channel efficiency effects for now. Notice that the number of buses required for each topology is independent of the wiring constraint used. For the CH, 2 buses are sufficient for uniform traffic, independent of the system configuration. For the CM, we have k≥8/ √ N. Hence, ignoring channel efficiency effects, 2 buses are sufficient for 16 or fewer clusters, and 1 bus is sufficient for systems with more clusters. For the CGSE, we have k≥B/ logB logN . Hence, the CGSE requires more buses for the full utilization of its global channels. For example, for N 1 =1024, N 0 =16, B=8, the CGSE needs k≥4 buses. Under uniform traffic, if we increase the number of buses beyond the minimum requirement, message delay will only see slight improvements, and at very heavy traffic rates. This is because the major component of message delay is from global network delays, and not cluster bus delays. In the CGSE of Figure 6 , four buses are sufficient to handle traffic rates sustainable by the global network. Increasing the number of buses to eight does not give significant improvement for most traffic rates.
Under the constant bisection width constraint, all systems have the same m g,sat . Hence, while each system may require a different minimum number of buses, the total bus bandwidth in a cluster is constant. Under the constant pinout constraint, a system with a higher value of m g,sat requires a greater local bus bandwidth.
These guidelines ignore channel efficiency effects. If messages are short and transmissions are not efficient, the number of local buses prescribed above may be insufficient to handle saturation traffic.
As observed in [HsYe91] , if local traffic is heavy, processors will be able to issue requests past the rate of m g,sat without saturating the global channels. In these situations, k needs to be even higher depending on the intensity of the local traffic, since the system now saturates at approximately m g,sat /(1−α). A crossbar switch may be needed to handle local traffic.
Queueing analysis and performance
Flat systems have been analyzed extensively
We will only summarize their results, and compare them with the analysis of the clustered systems in this section.
Analysis of clustered systems
Consider a cluster split into k subclusters, each with its own bus to and from the global through, we estimate the savings to be P t t m for global messages since cluster-local messages do not route through the global network, and there is no cut-through mechanism on the local buses.
While this is a simpler approximation for cut-through than [AbPa90] , it has been widely used (see, for example, [Agar91]) with reasonable results.
A cluster-to-global bus is modeled as a single queue with N 0 /k inputs (from N 0 /k processors), a global-to-cluster bus as a single queue with B+k inputs (B from the global switch and k from the local buses). Let T cluster−to−global be the average delay encountered by a message on a cluster-to-global bus, and T global−to−cluster the average delay encountered on a global-tocluster bus (derived in detail in [Hsu92] ). The total message delay is [Hsu92] T=(1+(1−α) mP t b )t m +(1+(1−α) P t 1 )t h +T cluster−to−global +T global−to−cluster −(1−α)t m /P t .
We simulated some of these networks to verify our analyses. The model used is fairly standard: the full system in each case is simulated, with each node issuing messages based on the traffic assumptions. Messages are forwarded through buffers in simulated switches. The delays predicted by our analysis match the numbers obtained through simulations (with small errors), up to channel utilizations of at least 80% [Hsu92] . The correction factor for partial cut-through is fairly accurate, and the inaccuracies in the total delay under cut-through mostly come from the message switching analysis, with the well-known problems of assuming independent arrivals etc.
Systems with constant pinout
We plotted the message delay characteristics of several example systems with different system sizes, cluster sizes and message lengths. have lower message delays than their flat counterparts. This is because of the wider channels in the clustered systems. Messages are divided into a few large nibbles, and can be routed in a few cycles. If the difference in saturation load is small (i.e., the channel efficiency effect is minor), clustered systems can give a major improvement over flat systems in routing performance.
Messages also take fewer hops in a clustered system than in its flat analogue, because of the
reduced distance within a cluster, but this tends to be a relatively minor effect.
However, a flat k-ary n-cube has slightly better saturation load. than its clustered analogue. For 16 clusters, we also examined CGSEs with B=4 and B=8 (Figures 7 and 8 ). Going from a fanout of 4 to 8 does not improve the average message distance (it remains constant at 2), but extra paths are provided for message routing. Hence, saturation loads should be equal, except for channel efficiency effects. The CGSE for B=8 is more insensitive to channel efficiency effects, since it has narrower channels and more nibbles per message. This is clearly illustrated in Figures 7 and 8 . For the relatively short message length in Figure 7 , the CGSE It is possible to offset channel efficiency effect by sacrificing some queueing performance.
For example, if we implemented two CH clusters, of N 0 /2 processors each, on the same board, the channels will be approximately half as wide as in the regular CH. However, local messages may have to take an extra hop, from a half-cluster to the other half-cluster. Because of the narrower global channels, the message latency will be higher. However, the system will saturate at slightly higher values of m g , because messages will be broken into twice as many nibbles as in
the original clustered configuration.
Systems with constant bisection width
Figures 10 and 11 show the performance of various systems with constant bisection width.
We normalize our systems to a FH. As observed in Section 5.2, the FH and CH have similar saturation loads. The FM has slightly a superior saturation load, and the CM is slightly inferior, because of channel efficiency effects. The CGSE is an inefficient network under this cost Under the constant pinout constraint, the CGSE is the best network because of its superior saturation load. Under the constant bisection width constraint, the contenders are the CH and FM, with the CH winning for longer messages, and the FM for shorter messages. The CM has the lowest latency, but is extremely sensitive to channel efficiency effects. The relative performance of the systems examined are very similar under both local and uniform traffic.
Conclusions
We presented analysis for clustered systems based on the hypercube, bidirectional mesh, and generalized shuffle-exchange. Meshes and hypercubes are widely used in existing multiprocessors. We also chose the GSE because this type of graph has optimal diameter for a given fanout. The relationship between fanout, diameter and performance under wiring constraints is more complicated. CGSEs with small fanouts (and large diameters) may have good message latencies at very low traffic, but they saturate at low loads. CGSEs with large fanouts (especially N=B 2 ) have good saturation loads that offset their longer latencies at low traffic. FGSEs turn out to be inefficient when wiring costs are considered.
We based our network comparisons on two cost constraints, bisection width and cluster pinout. Bisection is an estimate of system wiring area, while cluster pinout is a measure of the number of system cables.
A flat network and its clustered analogue have the same bisection width, if pinouts are kept constant. The clustered network may have a slightly inferior saturation load because of channel efficiency effects. However, a clustered network generally has better message delays than its flat analogue. This is because processors use wide, shared global channels, instead of narrow, dedicated ones, and the channels are able to handle similar system loads. Messages are broken into a small number of large nibbles instead of many small nibbles, and routed in fewer cycles.
If we compare networks with constant bisection widths, we find that all k-ary n-cubes saturate at roughly the same load, if channel efficiency is ignored. The CGSE is less areaefficient than the k-ary n-cubes, and does not have satisfactory performance under the constant bisection width constraint. For long messages (and minor channel efficiency effects), the CH is the best network. For short messages, the FM is superior. If there are no channel efficiency effects at all, then the network with the smallest latency, i.e., the CM, is the best. The CM has very wide channels and tends to be sensitive to fairly minor channel efficiency effects.
If we compare networks with constant pinout, we have to consider the number of clusters.
For k-ary n-cubes with 16 or fewer clusters, constant pinout also implies constant bisection width, and their performance curves resemble those of the systems with constant bisection width. CGSEs with N=B 2 also have similar saturation loads; their poor low traffic performance makes them inferior to CHs. If there are more than 16 clusters, the CGSE with N=B 2 has the best saturation load of all the networks, followed by the FH and CH, and then the FM and CM.
The superiority in saturation load of the CGSE makes up for poor low traffic performance, but -28 additional local buses may be necessary to support the extra bandwidth.
We also examined the effects of locality of traffic. The relative performance of the systems we studied does not seem to change with locality of traffic.
In large-scale system design, the packaging technology will determine the appropriate wiring cost constraint. Based on this cost constraint and information about system configuration and message granularity, we will be able to determine whether the flat mesh, clustered mesh, clustered hypercube or clustered GSE is the most suitable choice for a high-performance interconnection network.
