AbstractÐApproximate analytical queuing network models for expected message packet delay in 2-level and 3-level hierarchical ring interconnection networks (INs) are developed. A major class of traffic carried by these INs consists of cache line transfers between processor caches and remote memory modules in shared-memory multiprocessors. Such traffic consists of short, fixed-length messages; they can be conveniently transported by the slotted-ring transmission technique which is studied here. The packet delay results derived from the models are shown to be quite accurate when checked against a simulation study. As well as facilitating analysis, the analytical models can be used to determine optimal sizes for the rings at different levels in the hierarchy, where optimality is in terms of minimizing average packet delay.
Hierarchical Ring Network Configuration
and Performance Modeling V. Carl Hamacher, Senior Member, IEEE, and Hong Jiang, Member, IEEE AbstractÐApproximate analytical queuing network models for expected message packet delay in 2-level and 3-level hierarchical ring interconnection networks (INs) are developed. A major class of traffic carried by these INs consists of cache line transfers between processor caches and remote memory modules in shared-memory multiprocessors. Such traffic consists of short, fixed-length messages; they can be conveniently transported by the slotted-ring transmission technique which is studied here. The packet delay results derived from the models are shown to be quite accurate when checked against a simulation study. As well as facilitating analysis, the analytical models can be used to determine optimal sizes for the rings at different levels in the hierarchy, where optimality is in terms of minimizing average packet delay.
Index TermsÐInterconnection networks, hierarchical rings, slotted rings, shared-memory multiprocessors, queuing models, message-passing performance. ae
INTRODUCTION
A main hardware component in a multiprocessor system is the interconnection network (IN) that connects together processors and memory modules. One such IN structure, hierarchical slotted rings, is an interesting base on which to build large scale shared-memory multiprocessors. They have received a great deal of attention recently, both in academia [14] , [18] , [15] , [5] , [9] , [10] , [12] , [6] and in industry [17] , [3] , [4] . The salient features of this class of INs are: 1) the physical locality of hierarchical rings blends naturally with that of computational locality of sharedmemory multiprocessing [14] , [9] , 2) the hierarchical ring structure provides natural and efficient broadcasting and multicasting capabilities that are crucial for process coordination and cache coherence protocols [5] , and 3) hierarchical rings have an inherent and unique capability of ªdilutingº the impact of hot-spot traffic [18] , [9] . Nevertheless, a more popular choice for INs seems to be meshes. This, as noted in [14] , may stem from the fact that meshconnected systems are relatively easy to build using off-theshelf routers and processors and have good scalability characteristics. While meshes have superior scaling characteristics relative to hierarchical rings, two comparative studies of hierarchical rings and meshes in the literature, one based on approximate modeling [6] and the other based on detailed execution-driven simulations [14] , concluded that hierarchical rings can outperform meshes under some practical workloads. More specifically, [14] found that hierarchical rings perform significantly better than meshes for system sizes up to about 120 processor nodes if the workload exhibits moderate to high memory access locality. Even if there is no memory locality, [14] observed that hierarchical ring systems perform better than meshes for systems with large cache lines either if the system is small or if the global ring has double the normal bandwidth.
Exact analytical modeling of hierarchical slotted-ring networks is intractable because of the phenomenon of ªclusteringº of occupied slots in the ring, as observed in [13] , [1] . As a result, analytical studies of such networks have been based on approximation techniques [13] , [1] , [18] . With the exception of [18] , which analyzed 2-level structures, hierarchical ring structures have not been studied analytically so far despite the existence of analytical studies in the literature on single-level rings [13] , [1] . In [13] , buffering and queuing effects were not included at the input ports and contention for slot access was modeled only in the single-level ring case. Bhuyan et al. [1] extended the model in [13] to incorporate buffers at input ports and to consider a double-ring system where two unidirectional slotted rings were put in parallel. Zhang and Yan [18] analyzed a 2-level hierarchical ring system with emphasis on finding relative performances of a few cache coherence protocols and the impact of hot spot traffic. Thus, the models developed in [18] were geared toward specific coherence protocols under the hot spot traffic condition. Further, all models in [13] , [1] , [18] assumed a source removal packet transfer protocol. Two other recent performance studies on hierarchical ring networks were based entirely on simulations [9] , [14] .
In this paper, we use approximate analytical techniques to model the packet delay performance of 2-level and 3-level hierarchical ring networks that operate under a full range of applied load conditions and a destination removal protocol, as opposed to source removal. The destination removal protocol is more efficient in terms of network channel utilization and has been employed in recent research prototypes [16] , [15] . The model is used to gain important insights into the optimal design of hierarchical ring systems. That is, for a given total node size and traffic environment, how should one determine the size of rings on different levels to minimize the expected packet delay? The effect of doubling the bandwidth of the global ring, after that ring is shown to be a traffic bottleneck, is also determined.
The paper is organized as follows: Section 2 presents a description of the hierarchical ring interconnection network model, including enough structural and operational detail for performance evaluation purposes. Section 3 develops packet delay models using queuing models to capture the effect of contention. The analytical models developed are validated through extensive simulations in Section 4. Section 5 addresses the issue of optimal configuration using the analytical models developed in Section 3. We also include the effect of doubling the bandwidth of the global ring. Finally, some concluding remarks and prospects for future work are made in Section 6. An earlier version of this paper, with a slightly different analytical model, appeared in [7] .
HIERARCHICAL RING NETWORKS
The hierarchical slotted-ring IN studied here consists of unidirectional rings, as employed in [3] , [4] , [15] , [16] . Processor node clusters are only connected to local rings, as shown in Fig. 1 . Each segment, called a station, connects one cluster into the ring. The station switch, S, removes an incoming ring packet into its cluster interface if it is the destination or sends the packet on around the ring otherwise. This packet-handling protocol is the same as that used in destination-remove, slotted, Local Area Networks (LANs) [13] . The switch also introduces a pending transmit packet from its cluster interface into the downstream station as soon as it observes its own ring input side to be empty. Ring traffic is thus never blocked.
In the context of memory Read/Write messages in shared-memory multiprocessors, operations can be described briefly as follows: At the destination station, packets have priority on the cluster bus. If the target memory module is free to handle the request, it starts the operation (a Read or a Write) and immediately sends a positive acknowledgment message back to the source station, where the acknowledgment is removed by the source station switch. A negative acknowledgment is returned if the target memory module is busy and the Read/Write request message will need to be tried again later by the source. If the destination memory module is free, a Write operation requires a request and acknowledgment message. A Read operation requires three messages: one to send the Read request, an acknowledgment, and a later one from the destination station to return the requested data. These details are not actually needed for the network performance modeling done later, but they explain the use of the destination-remove protocol in the shared-memory application.
The bit width of a slot in the local ring is assumed to be enough to carry full information for a memory word Write message or a two-word reply message to a Read request. This wide-slot format is used in both [3] and [4] and we will refer to this slot quantity of information as a packet. Current cache line sizes in multiprocessor systems consist of multiple words [8] , which will not fit into one slot. Therefore, cache line transfers would need to consist of multiple-packet messages. This presents no difficulty for the IN described here because a wide-slot packet is large enough to contain source and destination node addresses and can therefore move through the IN as an independently routed unit. Also, the order of packets from any one source is maintained as they reach their destination.
A local ring can be expanded to any desired number of segments because each station is a regenerative repeater in the electrical sense. However, from a performance standpoint, packet transfer delay will increase linearly, degrading performance. To alleviate the performance problem, a higher level ring can be added in the form of a global segmented ring that is used to interconnect local rings, as shown in Fig. 1 . It operates much like a local ring with its source and destination stations being local ring interfaces instead of cluster interfaces. This structure can be extended to even higher levels. Packet blocking can occur at the crossover switch between two rings. For example, in a 2-level system, if a packet from a local ring needs to move up to the global ring at the same time that a continuing packet on the global ring arrives at the crossover switch, there is contention for the downstream link on the global ring, and only one packet can proceed. The other packet must be temporarily buffered in the crossover switch to insure that packets are never lost in the network. Details will be given in Section 3.2.
CONTENTION (QUEUING) MODEL FOR PACKET DELAY
In [6] , [12] , we developed packet delay and throughput performance measures for hierarchical rings in the light traffic (no contention) situation. While contention-free models are easy to develop and useful for rough network comparison purposes, any detailed evaluation of a network must consider contentions that occur. Further, only contention models can identify potential system performance bottlenecks. In this section, analytical models will be developed to capture the effect of contention under a full range of applied loads.
Packet Destination Distribution
Applications that run on shared-memory multiprocessors will have different patterns of message destination locality as the processor clusters (containing one or more processors) make memory Read/Write requests to remote memory modules. These patterns may range from situations where a cluster references mainly only a small number of other cluster memories (high locality) to situations where references are uniformly distributed over all other clusters (low/no locality). In the first case, clusters that reference each other often should be located on the same local ring.
Conversely, if such situations dominate, the size of the local ring in a hierarchical ring network can be chosen to best match the size of the typical locality sets. If applications tend to have uniform destination distributions, then, for a fixed total number of clusters, the various ring sizes can be chosen to minimize average packet delay. An example of this network design optimization is given in Section 5.
In the models to be developed, the following parameters reflect packet destination locality. In H2 (2-level systems), is the probability that a packet is destined for a cluster on the same local ring, with I À being the probability that it will need to move over the global ring to a different local ring. In H3 (3-level systems), v is the probability of a ªsame local ringº destination. w is the probability that the packet is destined for another local ring attached to the same intermediate ring; while q I À v w is the probability that the packet must move all the way up through the global ring, eventually moving down through the hierarchy to a local ring on a different intermediate ring.
Queues in the Network
FIFO queues are associated with each local ring station interface and interring interface, as shown in Fig. 2 and Fig. 3 , respectively. At a station interface, shown in Fig. 2 , the packet at the head of the queue waits until an empty slot passes by or a full slot destined to the local station arrives and the packet is removed from the slot by the station, at which time the head packet is transmitted onto the slot. Thus, a slot is deemed empty if it 1) contains no valid packet or 2) contains a packet destined to the local station and will be removed by it. The transmitted packet will then travel to its destination station unblocked if the destination is on the local ring or to the interring interface otherwise. At the interring interface, shown in Fig. 3 , the packet joins the FIFO queue for the higher level ring. Once at the head of the queue, the packet follows similar steps as in the case of a local station interface; that is, the packet accesses the first empty slot and moves around the ring to join the FIFO queue at another interring interface connecting down to the destination ring or up to a higher-level ring, depending on the destination. Ultimately, the packet is removed from the ring by the destination station. Thus, the packet delay, d (see Fig. 2 ), of a packet is the sum of 1. queuing delays at all FIFO queues on its entire path from source station to destination station, 2. slot access time at all interfaces on its path, that is, the time between when the packet reaches the head of a FIFO queue and when it gets an empty slot, 3. slot traverse time, the total time the packet spends moving through ring segment slots on its entire path, and 4. a final time step into the destination station bus buffer.
Part 3 of the packet delay is uniquely determined by the source and destination addresses and the network configuration, independent of traffic density and contention. Clearly, parts 1 and 2 of packet delay capture the effect of contention and, hence, are traffic density dependent. Unfortunately, it is extremely difficult to model the contention exactly, due to the dependence among full slots. This dependence, also known as ªclustering of full slots,º has been observed in [13] , [1] , where, as traffic intensifies, full slots tend to cluster together to form ªtrainsº of slots, as opposed to full slots being uniformly and independently distributed on the rings. This dependence makes an exact analysis intractable [11] . A second factor that complicates the exact analysis is the issue of finite buffers. To make the analysis tractable and simple, we circumvent the problems by making two main simplifying assumptions. First, we assume that the event of a slot being full is independent of that of other slots. Second, we assume the FIFO buffers at all interfaces are infinite in size. Fortunately, these assumptions have been shown to be not problematic in [13] , [1] , [18] and by our own simulation validation studies.
With the above assumptions, we model the contention in parts 1 and 2 of packet delay using the M/G/1 queuing center model, similar to the approach in [1] and [18] where source-remove one-level and two-level rings, respectively, are analyzed. The key in this method lies in finding the expected service time of the M/G/1 service center which models a particular FIFO queue. This expected service time is effectively the expected time that a packet at the head of the queue waits before it gets an empty slot. In what follows, we first define the necessary parameters and list assumptions for the analysis and then give a detailed description of the analytical model.
It should also be noted that, from the modeling viewpoint, there is also a buffer, called a ring link buffer, associated with each ring link in the system, as shown in 1. Packet arrivals occur only at discrete time points and the associated ring link ªserverº has a constant service time of 1 discrete time step; and 2. This ring link buffer has priority over station FIFO queues and interring crossover queues in competing for access to the ring link ªserver.º This priority policy is consistent with the implementations of the NUMAchine [15] , [14] and KSR [4] .
We will not need a specific notation to identify these buffers because their total occupancies can be derived from ring utilization, which can be calculated directly from input packet traffic and packet travel patterns. This will become clear later.
Definitions and Assumptions
Time is discretized into clock ticks. One tick is the time needed for a packet to move between adjacent slot segments in any ring or from a ring link buffer to a FIFO queue in an interring interface (see Fig. 3 ) or from a ring link buffer to a station cluster bus buffer at a destination (see Fig. 2 ). The models to be developed are based on the following system parameters:
1. !: identical traffic arrival rate at each local station, i.e., number of independent packets per clock tick arriving at a local ring station FIFO queue, 2. packet destination locality in H2 is determined by probability as defined in Section 3.1, 3. packet destination locality in H3 is determined by probabilities v and w as defined in Section 3.1, 4. x: total number of local stations in the network, 5. v: number of stations on a local ring, 6. w: number of local rings on an intermediate ring in 3-level networks, 7. q: number of lower-level rings connected directly to the global ring. Note that q x v in 2-level ring networks and q x vw in 3-level ring networks. Furthermore, we make the following assumptions:
1. The traffic arrival rate at each station and interring interface FIFO follows a Poisson process. 2. One packet can be completely carried by one slot. 3. A packet is removed from the network by the destination immediately after it reaches the destination station cluster bus buffer (see Fig. 2 ). 
General Model
The basic idea of this analysis is to solve the M/G/1 queuing model for all FIFO queues (local stations and interring interfaces), which will give rise to expected queue lengths at all FIFO queues. In order to do this, we need ring utilizations. Using Little's result, these results can then be used to derive expected packet delays as follows. Let i , I i x, denote the queue length at local station i and let vÀqi and qÀvi denote, respectively, the local-ring to global-ring FIFO queue length and the global-ring to local-ring FIFO queue length of the interring interface i, I i x v for the 2-level ring. Similarly, for the 3-level ring, let wÀvi , vÀwi , qÀwj , and wÀqj denote, respectively, the middle-ring to local-ring, localring to middle-ring, global-ring to middle-ring, and middlering to global-ring FIFO queue lengths. Here, I i x v and I j 
In each equation, denotes the expected value of the variable . The numerator in (3.1) represents the total population (of packets) in FIFO queues, interring interfaces, and rings. Aside from average queue lengths, , interface and ring packet occupancies are accounted for as follows:
The term Pv!I À accounts for packets in the two links leading from ring buffers to FIFO queues, as shown in Fig. 3 (the step from t i to t iI ). The terms x x v v and x v q account for packets in all local rings and the global ring, respectively, and the term x! accounts for packets in all links leading into station cluster bus buffers, as shown in Fig. 2 (the step from t dÀI to t d ). The denominator represents the system throughput. An implicit assumption here is that the system is nonsaturated and in steady state, making the system throughput equal to the total packet arrival rate. Similar comments apply to the terms in rQ .
Ring Utilizations
In H2 and H3, all local rings have v I links, with the extra link being needed to incorporate the interring interface to the intermediate level ring. All global rings have q links, while, in H3, intermediate rings have w I links, with the extra link incorporating the interface to the global ring.
Because of the destination-remove protocol, it is easy to see that, on average, a packet traverses half of the links on any ring it moves over to reach its destination. This assumes that destinations are uniformly distributed inside the local, intermediate, and global sets of packets.
H2: Assuming symmetry over all stations, there are two types of utilizations: v for all local rings and q for the global ring. v : To derive v , consider a period of time steps. During this time, there are two sources of traffic onto each local ring: one from local stations i and the other from the global ring through qÀv . Traffic from i can be further divided into two parts, namely, those packets staying in the same local ring with probability and those going up to the global ring with probability I À . They all use v IaP links on average. Thus, traffic from i uses v! v IaP links over time .
The total traffic from global ring qÀv can be calculated as: 
L packets from all xav À I other local rings and, arguing as in 1, over the number of links used by these packets is:
But, there are v I links available over . Therefore, combining link usage from i traffic with eI and fI, we have (2) is:
But, there are w I links available over . Therefore, combining eP and fP, we have:
Also note that ! qÀw ! wÀq vw!I À w À v X QXW q : Over there are x!I À w À v packets that follow the L 3 M 3 G 3 M 3 L path, each of which uses qaP links; but q links are available, thus
Derivation of Average Queue Lengths
Now, we need average queue lengths, , everywhere, for both H2 and H3 systems. H2: i : Slot access time at a local station will be 0 if the upstream link buffer is empty at the time the packet arrives at the head of the line (HOL) position. Service in the first link traversed is counted in the v component of (3.1) because, technically, as soon as the HOL entry starts to get service in the first link, it can be considered that it has been dropped into the empty upstream link buffer.
If p is the probability that a slot is full AND continuing past the current point, then slot access time for the HOL message packet is:
Now, applying Little's Law, we get i !, where is the average waiting time in the queue. When a new packet arrives, it must wait s time units for each item ahead of it and then wait s more units for its own service. Because of the memoryless property of the stochastic process, we have s s i . Therefore,
The probability that a slot is full is v . The probability that it is continuing past the current point can be shown to be vÀI v by a detailed consideration of the possible source and destination of each packet that appears in the input side link buffer of a local station. Therefore, 
Expected Packet Delay
The expressions for ring utilizations, traffic rates, and average queue lengths, developed in Sections 3.5 and 3.6, can now be used in the general model, described in Section 3.4, to derive expressions for the expected message delay in both the 2-level and 3-level ring structures. The required sequence of substitutions is as follows in converting the global expression (3.1) for rP into an explicit expression involving only the structural parameters xY v and q xav and the traffic parameters ! and : First, substitute from (3.3) for v into (3.12) for p and then substitute this explicit expression for p into (3.11) to obtain an explicit expression for i . Similarly, use (3.4), (3.14), After performing a number of algebraic simplifications, we have and algebraic rearrangements and simplifications can be used to derive the following expression for expected packet delay in 3-level ring structures. The final result is:
As with the rP expression (3.17), each of the terms in 
VALIDATION OF THE ANALYTICAL MODELS VIA SIMULATIONS
In this section, we validate our analytical model through extensive simulations. In the simulation study reported in [12] , an event-driven simulator was used to study 2-level and 3-level hierarchical ring systems. All the simulation results presented here have very small WS percent confidence intervals and, so, these intervals are not shown. In Fig. 4 , results for an H2 system are plotted to show expected packet delay as a function of ! and locality. Since the global ring saturates faster than any other ring in the system, we also included its utilization. We were not able to compare the case of HXP and ! b HXHHR because the system entered saturation soon after that point. Nevertheless, it is clear from the figure that our model is very accurate with the exception of two points where errors of VXQ percent and ITXU percent occur at global ring utilizations of VP percent and WP percent, respectively. This discrepancy can be explained as a result of our model's inability to capture the ªtrain effectsº (see Section 3.2) at the nearsaturated global ring conditions. Fig. 5 shows a comparison between our model and the simulations for an H3 system. As with the case of H2, our model agrees very well with the simulation, with the worst error being UXU percent at a global ring utilization of VI percent.
Our final comparison between model and simulation is shown in Fig. 6 for three H3 configurations, again revealing very good agreement except at high global ring utilization levels.
The more important point brought out by Fig. 6 , however, relates to the relationship between average packet delay performance and network configuration at different traffic levels. Consider the following: Assume a distribution of message packet destinations that is characterized by the application, not related to network configuration. For example, in the uniform distribution, all processor cluster nodes are equally likely as the destination of a message packet. This presents the most demanding case for any multiprocessor network. There is no locality that can be exploited. Fig. 6 shows such a case. System size x is close to RHH for all three configurations. As the configurations (v, w, q) vary, v , w , and q must also vary to properly reflect a uniform message packet destination distribution. The figure reveals that, for light traffic (! HXHHI), the vY wY q TY TY II configuration provides a lower average packet delay than the IHY IHY R configuration; while, for heavy traffic (! HXHHS and global ring utilizations upwards of US percent), the opposite is true. In general, we have shown earlier [6] that the configuration leading to the lowest maximum distance between any pair of nodes (the HAMACHER AND JIANG: HIERARCHICAL RING NETWORK CONFIGURATION AND PERFORMANCE MODELING 9 Fig. 5 . Comparison between the model and simulation for an H3 system where x SHR, v U, w T, q IP, and ! HXHHS. minimum diameter network) has v, w, and q sizes in proportions I X I X P. This is consistent with the TY TY II configuration having the lowest average delay in the light traffic (and, thus, low contention) case. Correspondingly, in [14] , an independent detailed simulation study of H3 systems showed that good configurations for the heavy uniform traffic case all had relatively small global rings. In particular, they derived vY wY q TY QY Q for a particular x SR network and IPY QY Q for an x IHV network. This tendency is qualitatively similar to our result that IHY IHY R is better than TY TY II for the heavy traffic case. We will expand on this use of the model in configuration design in the next section.
OPTIMAL CONFIGURATIONS AND BOTTLENECKS
One very important issue in the design of hierarchical ring systems is that of configuration. Our analytical model can predict expected packet delay accurately. It can now be used to answer the logical question: What is the best configuration for the hierarchical ring network to minimize the average delay, given a particular application-based traffic pattern and system size? A quick answer to this question can be very helpful in enabling the system architect/designer to make sensible design decisions. The answer to the question may be found by deriving optimal values for v in H2 and v and w in H3 that minimize rP and rQ , respectively.
The expressions for rP and rQ are closed form functions of x, v, w, and traffic, which is uniquely defined by values of ! and locality ( , v , and w ). Therefore, if one has some knowledge of the density (!) and pattern (locality) of the traffic which the future system will likely be subject to, then, for a given system size (x), it is possible to find values of v (for H2) and v and w (for H3) that minimize rP and rQ , respectively, for given values of ! and application-based traffic locality. In this section, we show how (3.17) and (3.18) can be used to find optimal values of v and w. All 3D plots in this section were generated using the Maple-V software [2] . The design optimization question, as we have posed it, only makes sense if we are able to show how the physical network locality parameters v , w , and q ( I À v À w ), are functionally related to x, v, w, and q xavw for a given application-based locality specification. As an example, we will deal with the uniform message packet destination case here. This is simply the case in which all other x À I nodes are equally likely as destinations from any particular source node. This traffic distribution is reflected in the following functional relationships: In H2, v À Iax À I, and, in H3, v v À Iax À I, w w À Ivax À I, and q I À v À w q À Ivwax À I. These substitutions are made in rP and rQ before plotting the Maple-V surfaces. Fig. 7 shows a 3D plot of rP as a function of v and ! while the traffic pattern is uniform and x SHH. In this figure, traffic density ! ranges from HXHHHS, representing light traffic, to HXHHR, representing the heavier traffic. As can be seen in the figure, there is an optimum of v for each ! value. For light traffic (! HXHHHS), v is optimal near IT, shifting to larger values as ! increases, with v being optimal near 28 for ! HXHHR.
In Figs. 8 and 9 , we plot rQ as a function of v and w for ! HXHHP and ! HXHHR, respectively, while keeping the traffic pattern uniform and x SHH. As expected, for each ! value, there is a pair of optimal v and w values. In fact, for ! HXHHP, the optimal values for v and w are T and U, respectively; whereas, for ! HXHHR, values of W and IH for v and w, respectively, minimize rQ .
A general rule-of-thumb can be concluded from the results of Sections 4 and 5 for the uniform traffic case: As the traffic intensity rate ! moves from light to heavy, the proportional ring sizes for optimal network configurations shift from 1:2 in H2 and 1:1:2 in H3 to 2:1 in H2 and 2:2:1 in H3. 
Global Ring Bottleneck
It is clear from numerical examples derived from either the analytical models or simulations that the global ring saturates first when there is low locality in the message traffic. For a uniform message packet distribution, Fig. 6 shows that choosing a ring size configuration with the global ring relatively smaller than the local and intermediate rings leads to lower utilization of the global ring and lower overall average packet delay. Put another way, the optimal configuration allows a higher traffic rate (more throughput) before saturation occurs.
Since a relatively small global ring represents a proportionally very small component of the hardware implementation cost of a full network, it is feasible to consider increasing its bandwidth. In [14] , the authors propose doubling the global ring bandwidth. This can be achieved in one of two waysÐdoubling the physical width of the links or doubling the clock rate and adding a (pipeline) buffer in each linkÐas discussed in [14] .
It is very easy to change the analytic queuing model to account for a double-bandwidth global ring. We will not give details here, but will state some numerical results.
For the uniform traffic case, an x WH H3 system with the configuration vY wY q TY SY Q and a double-bandwidth global ring has a packet delay versus ! performance that is very close to that of an x US system with the configuration vY wY q SY SY Q and a regular global ring. As another example, an x IHV H3 system with an vY wY q TY TY Q configuration and a double-bandwidth global ring has a performance comparable to a regular x WH system with an vY wY q TY SY Q configuration.
One way to view these results is that doubling the global ring bandwidth in these two cases allows an increase of PH percent in system size, x, and total packet throughput, HAMACHER AND JIANG: HIERARCHICAL RING NETWORK CONFIGURATION AND PERFORMANCE MODELING 11 Fig. 8 . 3D plot for H3 delay with x SHH, ! HXHHP, and uniform message packet destination distributions. Fig. 9 . 3D plot for H3 delay with x SHH, ! HXHHR, and uniform message packet destination distributions.
x!, for the same packet delay versus ! performance, as ! is varied over a wide operational range.
CONCLUDING REMARKS
Network configuration, that is, appropriate choices for the size of local, intermediate, and global rings, can be quickly and easily estimated by using the queuing models developed here, without resorting to time-consuming simulations, assuming that minimizing average message delay is the important criterion. We gave an example of such a design study in Section 5. As we noted, network optimization is only meaningful relative to a specified traffic intensity and message destination distribution that is determined by the application. In Section 5, we used a uniform distribution, which is easy to incorporate into the model. For more general application-based distributions, such as those described in [9] , we have shown in [10] how to incorporate them into a simple model that is, however, only valid for very light traffic (no significant contention at crossover switches). We are currently studying how to incorporate more general distribution specifications into the queuing models, enabling wider use of the models in network design and optimization.
