In this paper, we present a queueing model for performance analysis of nite-bu ered multistage interconnection networks. The proposed model captures network behavior in an asynchronous communication mode and is based on realistic assumptions. A uniform tra c model is developed rst and then extended to capture non-uniform tra c in the presence of hot-spot. Throughput and delay are computed using the proposed model and the results are validated via simulation. The analysis is extended to predict performance of MIN-based multiprocessors where the concept of the maximum number of outstanding memory requests is included. The e ects of bu er length, switch size, and the maximum allowable outstanding requests on the system performance are discussed. Various design decisions using this model are drawn with respect to delay, throughput, and system power.
Introduction
Multistage interconnection networks (MIN's) have been proposed as an e cient interconnection medium for multiprocessors. They have been used in various commercial and experimental systems [1] [2] [3] [4] . Behavior of the interconnection network plays an important role in the performance of multiprocessors. For an optimal design, it is necessary to analyze various con gurations and constraints of the interconnection network. In this paper, we present a queueing model for performance prediction of MIN and MIN-based multiprocessors.
Earlier research on MIN performance study has focussed on three types of network models: circuit switched 5]; packet switched with in nite bu er [6] [7] [8] [9] ; packet switched with nite bu er [10] [11] [12] [13] [14] . Study of circuit switched MIN's has gradually diminished since various packet switching techniques have become more prevalent. In nite bu er analysis does not necessarily predict realistic behaviors of MINs under various workloads. For example, it is argued that small bu er lengths (2 or more) behave as in nite bu ers 6]. This is true only under light loads or when we restrict one outstanding request per processor in the network. Multiple outstanding requests increase tra c in the system and the bu er length needs to be large in order to mimic the in nite bu er performance. Futhermore, practical designs have nite length bu ers in the switches. Recent research e ort therefore is directed towards analysis of nite-bu ered MINs.
A model for nite bu ered MINs should capture the following issues for predicting realistic performance.
The processors in an MIMD mode operate independently of each other with occasional synchronization. Thus the network model should be based on asynchronous message transmission.
The packets are normally of xed size. Therefore, the time required for transferring a packet from one stage to the next stage is deterministic.
Messages that can not be transmitted from one stage to the next due to the unavailability of bu er space should be blocked rather than rejected. Systems like Cedar use blocking of packets to avoid unnecessary regeneration process 1].
The model should be general enough to analyze uniform as well as non-uniform memory reference patterns. In addition, analysis of an isolated interconnection network does not reveal the behavior in a multiprocessor environment. An integrated study of the network and the system level constraints can provide better insight to the performance study.
Prior work on nite-bu ered MINs are mainly based on probabilistic models [10] [11] [12] [13] 21] . These analyses are valid for synchronous networks where all the input/output operations happen at discrete stage cycles. These models do not capture asynchronous behavior especially when the service time of the SEs is more than one clock cycle. The queueing model for nite-bu ered asynchronous MINs developed in 14] assumes nonblocking capability and exponential service time for the switching elements.
None of the above models has considered all the design issues mentioned earlier. In this paper, we present a queueing model for performance analysis of MINs that considers asynchronous packet switching transmission, nite bu ers, deterministic switch service time, message blocking, and constraints of a multiprocessor environment. The MIN is rst modeled assuming uniform memory references. Next, the methodology for extending the model to analyze non-uniform tra c in the presence of hot-spot is described demonstrating the versatility of the analytical model. The model has been validated via extensive simulation. Average message delay and throughput are used as performance measures to characterize a MIN. Variation of performance with input load and bu er length is discussed. The analysis is extended to predict performance of MIN-based multiprocessors. Results are obtained for the e ect of multiple outstanding requests on the multiprocessor performance. A performance metric called system power is analyzed which gives a meaningful measure considering the tradeo s between delay and throughput 19].
The rest of the paper is organized as follows. The network architecture and operations are described in Section 2. In Section 3, a queueing model for MINs is developed for uniform tra c, and the extension to a non-uniform tra c pattern is presented in Section 4. Performance analysis and discussion on various aspects of network behavior are presented in Section 5, followed by the concluding remarks in Section 6.
Network Operations
An N-node multiprocessor consists of N processing elements (PEs) and N memory modules (MMs) interconnected by an (N N) MIN. An (N N) MIN designed using (a a) SEs has n stages, where n = log a N. An (8 8) baseline MIN is shown in Figure 1 . It consists of (2 2) switching elements (SEs), each of which has bu ers of size L at their input ports. Placement of bu ers at the input ports of SEs is advantageous and cheaper compared to having bu ers at the output ports 13]. The analysis however can be used for MINs that use bu ers at the output ports as the e ective arrival rate at each stage remains the same irrespective of the location of bu er. (i) Each processor generates xed-size messages independently at a rate and the intermessage times are exponentially distributed. (ii) A memory request is uniformly distributed among all the MMs. (iii) The SEs have deterministic service time (d cycles). During this period, the address is decoded, the destination address is checked, and the data is transferred depending upon the availability of bu er space.
(iv) A packet is blocked at a stage if the destination bu er at the next stage is full. Packets arriving at the rst stage of the MIN are discarded if the bu er is full. Almost all performance studies incorporate assumptions (i) and (ii) to ensure mathematical simplicity. Relaxation of the second assumption to non-uniform tra c is possible and the analysis of a single hot spot tra c model is presented in Section 4. Assumption (iii) is based on practical systems like Cedar and BBN Butter y. Cedar also uses blocking of packets in the MIN and this concept is absorbed in assumption (iv).
A request from a processor is routed to the destined MM through the interconnection network (IN). An acknowledgement/reply from the MM is returned through another layer of MIN in the reverse direction to the PE that originated the request 1]. The \forward network" and the \reverse network" are distinct but are topologically identical. It is thus su cient to analyze the performance of either network 10]. By using the e ective input rate, the analysis presented here can be used for both forward and reverse networks.
Queueing Model
The bu ers of the SEs of a MIN are of nite length and have deterministic service time. Hence, each of them can be modelled as an M=D=1=L queueing center. The study consists of two parts. First, we present the analysis of an M=D=1=L queue, and then extend the analysis for a network of n queues, where n is the number of stages in the MIN.
M/D/1/L Queue Analysis
Notations:
: packet generation rate of a source (processor). d: switch service time. L: length of a bu er in the SEs. p k : probability that there are k customers in an M=D=1 queueing center at steady state. p (L) k : probability that there are k customers in an M=D=1=L queueing center at steady state. : tra c intensity at the server = d.
The state probabilities of an M=G=1=L queueing system are proportional to the corresponding state probabilities of the M=G=1 system in the interval, 0 k L 20]. Using this concept, the steady state probabilities of an M=D=1=L queueing center can be derived from an M=D=1 queue in the range 0 k L. The derivation is described in detail in 20]. The probability that there are k customers in an M=D=1=L queueing center is given as
where x denotes the probability that the bu er is full. The bu er becomes full when there are (L + 1) packets at the service center; L packets in the queue and one in the server. x can be also termed as the blocking probability as it represents the probability that a packet will be blocked at the preceeding stage. From 20],
The values 
Let Q be a random variable that represents the number of jobs at a service center. The average value of Q, denoted as E Q], is given as
Using Little's law, the average time, E T] spent at the center is
The denominator captures the e ect of blocking by adjusting the arrival rate at a nite-bu ered service center.
MIN Analysis
The notations used in Section 3.1 are also used for the MIN analysis with a few modi cations as follows.
i : packet arrival rate at stage i, 1 
k at stage i, 1 i n. i : tra c intensity at the server = i d, 1 i n.
The basic model of a (4x4) MIN using (2x2) SEs is shown in Figure 2 . The packet arrival and departure rates at each bu er are indicated in the gure. Note that the departure rate from a bu er may not be the same as the arrival rate at the bu er as it is a ected by the blocking probability as well as the service time distribution of the server. The uniform memory reference assumption makes all the servers of a particular stage statistically indistinguishable. This can be veri ed from Figure 2 . The departure rate from a rst stage bu er is 2 which is divided equally among the output ports of the SEs because of the uniform memory reference assumption. Each output port of rst stage SEs receive packets at a rate 2 =2 from the outputs of two bu ers. They add up to make the e ective arrival rate at the bu ers of the second stage equal to 2 . All the bu ers at a particular stage have the same arrival rate. A packet is transmitted from stage to stage passing through exactly one bu er per stage. It is therefore su cient to analyze one bu er per stage of the MIN. A packet has to travel through a chain of n bu ers in an n-stage MIN. Each bu er is modelled as an M=D=1=L queueing center capturing the deterministic service time and the nite bu er considerartion. A MIN is thus modelled as a chain of n M=D=1=L queueing centers as shown in Figure 3 . We therefore analyze the probability density function (pdf) of the interdeparture time of an M=D=1=L queue. Let i be a random variable which represents the time between departures from an M=D=1=L queueing center of ith stage. Let be the event that the queue is empty after a departure. f i (t) represents the probability density function of i and f i j (t) denotes the probability density function of i given that the queue is empty. f i j (t) denotes the probability density of i , given that the queue is not empty. State p (L) 0 (i) is the probability that the queue is empty. Instead of using the departure point probability of an empty queue, the asymptotic limit (as L ! 1), i.e., the general time probability, is used. This makes the interdeparture time density approximate but allows bounds on the approximation region 14]. This approximation is validated by comparing the results with those obtained through simulation (the deparature point probabilities are not approximated in the simulation). Without such an approximation, the analysis would become extremely complex. Thus, the interdeparture probability density function is given by
The pdf is simply the density of the server when the queue is not empty. As the server has a deterministic service time of d cycles, there will be a departure every d cycles when the queue is not empty. Thus
where (t) is an impulse function. When the queue is empty, the pdf is the density of the service time plus the arrival time. The service time and the arrival time are independent of each other. The Laplace transform of the sum of two independent density functions is equal to the product of their Laplace transforms. Taking the Laplace transforms,
The inverse Laplace transform is f i j (t) = i e ? i (t?d) U(t ? d); (9) where U(t) is an unit step function. Thus,
The expected value of the density function of the interdeparture time can be approximated as the mean interarrival time at the next stage bu er. Let E i ] represent the expected value of the interdeparture time of packets from the queueing center. E i ] can be obtained from equation (10) as
It is extremely di cult to accurately characterize the nature of interdeparture processes. In order to keep the model tractable, we can approximate the interdeparture time distribution from one stage to the next as exponential with an average value of i+1 = 1=E i ] requests/cycle. It will be shown in Section 5 that this assumption does not induce substantial di erence between analytical and simulation results. We compute the departure rate, i+1 , from equation (11) ; for 2 i n;
The above expression is used to compute i starting from i = 1 to n. Using equations (4) and (5), the average time spent at the ith stage is
; for 1 i n: (14) The average delay for a packet is obtained by summing up the delays of all the stages. The normalized throughput, X, is determined by the output of a bu er in the last stage of the MIN, and is equal to n+1 .
Non-Uniform Tra c Model
Tra c non-uniformity in parallel systems can occur due to concurrent requests by several processors to a shared memory module (hot MM). This creates tree saturation in the interconnection network and results in hot spot contention [15] [16] [17] [18] . Performance of a MIN is degraded due to the presence of hot-spots.
In this section, we extend the proposed technique for analyzing a single hot-spot tra c model. A certain fraction of the tra c (hot tra c) from each processor is assumed to be directed towards the hot MM and the remaining tra c (cold tra c) is uniformly distributed over all the MMs. Let h be the fraction of requests directed to the hot MM from a processor. For a MIN that uses (a a) SEs, the tra c rate at the input port of an SE at stage i in the fan-in tree of a hot MM, is (1 ? h) i + a i?1 h i , where i denotes the request departure rate of stage i ? 1. This is illustrated in Figure 4 where a = 2. The bold path represents the fan-in tree for the hot memory (MM3). Figure 4 . The following tra c patterns can be observed from the gure. Any processor accessing the hot memory module (MM3) has to take a route which can be modelled as a series of n queues as shown by the path in Figure  5 (a). The cold MMs can be formed into n groups depending upon their location with respect to the hot MM. The tra c interactions in the access path to the MMs of a group are the same for any processor. Note that the route taken by each processor to access the MMs of a group will be di erent but because of the same tra c interaction, the model for the path remains the same. Di erent groups have di erent access paths. Groupings for the example under consideration is illustrated in Figure 4 . A PE accessing MMs of group 1 uses the path shown in Figure 5(b) . The path shown in Figure 5 (c) is used by a PE accessing MMs of group 2. Similarly, any PE accessing MMs of group 3 uses the path shown in Figure 5(d) . The additional tra c at various stages of the MIN are due to the tra c from the other PEs. The cold tra c rate of (1 ? h) 4 is the output of the last stage at each of the networks excluding the hot tra c path. The output rate of the hot tra c path has an additional rate of 8 4 due to the hot tra c from all the processors in the system. Thus, in order to access an MM, a PE needs to take a particular path depending upon the location of the hot MM. There are four di erent types of paths in this case. In general, there are (n + 1) di erent types of paths for a PE in an n-stage MIN. This is due to the network topology and can be veri ed by trivial observations. ). The nth group will comprise the remaining MMs. It can be derived that the number of MMs in group i is equal to (a ? 1)a i?1 . The path for the n groups can be modelled using n queueing networks. The rst path for cold tra c represents the route to access MMs of group 1. It is the same as the hot tra c path except for the departure of a tra c rate of a n?1 h n+1 instead of the additional arrival rate at the MM. The next queueing network represents the path for accessing MMs of group 2. Thus, the ith queueing network for the cold MMs represents the path for group i MMs. The additional arrival and departure rates follow a regular pattern and can be derived and formalized by observing the queueing networks shown in Figure 5 . The nite-bu ered queueing networks shown in Figure 5 can be solved using the methodology described in Section 3. The delay of hot tra c can be obtained by solving the queueing network of Figure 5 In general, the delay calculation of hot tra c can be derived as follows. In the hot tra c path, tra c at the input port of an SE at stage i, denoted as h i , is (1?h) i +a i?1 h i , where i denotes the request departure rate of stage i ? 1. i can be computed using (13) .
The average time spent at the ith stage can be computed from (14) . Thus the average hot tra c delay, D hot , can be obtained from
Similarly, the cold tra c delay for the n groups can be computed using equations (13) and (14) . The input rates at various stages for each of the groups will be di erent and can be determined as explained earlier.
Performance degradation due to the presence of hot-spots can be improved by combining requests [15] [16] [17] [18] . The model for non-uniform tra c can also be extended to capture the e ect of combining. Depending upon the combining technique used, one can derive the probability of combining at various stages. The e ective input rate at di erent stages depends upon the probability of combining and can be obtained using the methodologies described in 17]. These values can be used in equations (12)- (15) to compute the average delay and throughput.
Performance Evaluation
In order to validate the proposed analytical model, an (NxN) delta network was simulated. The network uses (2x2) SEs. Packets were generated randomly with an exponential distribution of interarrival time by each processor with an average rate of requests per cycle. A uniform random number generator was used to determine the destination memory. Throughput and delay were computed by counting the number of request completions and the average time taken to reach the output port, respectively. The simulation was run for su ciently long time to obtain results in steady state. The 95% con dence interval was observed to be within 3% of the mean. Comparisons between the analytical and simulation results for (64x64) and (1024x1024) systems using (2x2) SEs are shown in Figures 6 and 7 . The di erence between the analysis and the simulation results is within 7%. The curves indicate that the analytical results are fairly accurate. The variation of delay for a (64x64) MIN with non-uniform tra c is shown in Figure   8 . Results are plotted for hot spot tra c rates of 2%, 4%, 8%, and 16%. Figure 8(a) shows the delay incurred by a request directed to the hot memory module. Average delay for cold tra c is plotted in Figure 8(b) . The simulation results are also shown to validate the analysis. In the presence of a hot-spot, the network saturates much earlier than the uniform memory reference case. Figure 9 . It is mentioned in 6] that a small bu er length shows performance equivalent to an in nite bu er. It can be inferred from Figures 9(a) and 9(b) that this is true only when the input load is not high. Under light tra c, i.e. for < 0:5, the nite-bu ered MINs with L 4 mimics the performance of MINs having in nite bu ering capacity. The variation of delay and throughput is prominent until the bu er length is considerably high for heavy tra c. The model can be used to determine the minimum bu er length required to get a performance equivalent to the in nite bu er case. For example, the minimum bu er length required to mimic the performance of in nite bu er is approximately 15 at = 0:9. The e ect of switch size, a, on delay and normalized throughput of a (4096x4096) MIN is shown in Figures 10(a) and 10(b) , respectively. The results are plotted for ve di erent switch sizes (a = 2; 4; 8; 16; 64) and for di erent bu er lengths. The delay is higher for smaller SEs as expected. Depending upon the priority of performance metric, an optimal SE size and bu er length can be computed. If delay is the primary concern, large SEs with small bu ers should be used to satisfy the performance requirement. Small SEs with large bu ers would be more cost e ective for throughput oriented systems. The analysis proposed here is for a generic MIN model and can be extended to incorporate several system level constraints. In a multiprocessor environment, there is a limitation on the number of outstanding memory requests a processor can have before being blocked to wait for the completion of a request. Let m denote the maximum al-lowable number of outstanding requests. A processor keeps on generating requests at the rate of packets/cycle until it has m outstanding requests. It then gets blocked until the completion of a request.
To model multiple outstanding memory requests, the same set of equations (12) (13) (14) can be used to compute the delay D for a given . Let z = ( This modi ed value of 0 is used in the set of equations (12) (13) (14) (15) to compute delay and throughput. Figure 11 depicts the comparison of analytical and simulation results for di erent values of m for a (64x64) system. As expected, better throughput is observed by allowing multiple outstanding requests. It is observed that the system saturates early for low values of m. This saturation is attributed to the fact that the e ective input load to the system is limited by m. The throughput improves for higher values of m, but the packets incur more delay. For a given bu er size, we can determine the maximum input load at which the network saturates for di erent values of m. On the other hand, for a given value of m and , the minimum bu er length can be computed for any desired performance level (delay and throughput).
Throughput and delay are not necessarily su cient measures of system performance. It is observed that higher throughputs result in longer delay. This can also be inferred from Figures (6-11) . A combined metric called system power is sometimes more meaningful than D or X alone. System power is de ned as the ratio of throughput to delay 19]. A higher power means either a higher throughput or lower delay. In either case, a higher power is better than a lower power. Denoting P as the system power, we get P = NX D . Fig. 12 . Variation of System Power. The variation of system power with respect to the input load is shown in Figure 12 (a) for various bu er lengths. It is observed that the system power increases with the input load for small bu ers. For large bu ers, the system power increases with the input load and attains a peak value after which it decreases with the increase in load. This can be explained as follows. For a large bu er size, the throughput rst increases with the input load until it saturates. On the other hand, delay increases monotonically. This can be also observed from the graphs of Figures 6 and 7 . Thus, after a certain input load, the power reduces. The model can be used for predicting the optimum load to maximize the power of a MIN. The variation of power with respect to bu er length is plotted in Figure 12 (b). System power is insensitive to bu er length for light load. However, system power decreases under heavy loads with the increase in bu er length.
Concluding Remarks
A queueing model for evaluating performance of nite-bu ered, asynchronous MINs is presented in this paper. The uniqueness of this model compared to previous nite-bu ered analyses is that it captures asynchronous operations, deterministic service time of switches, message blocking, and behavior in a multiprocessor environment. Both uniform as well as non-uniform tra c patterns are considered while analyzing MIN performance. Comparison with simulation results shows that the analytical model is highly accurate. The MIN is then included in a multiprocessor environment to study its e ect on the overall system performance. It is observed that there is a considerable gain in throughput by allowing multiple outstanding requests in a multiprocessor. Various design alternatives based on performance requirements are discussed. It is di cult to come up with an optimal set of design parameters to satisfy all performance measures. The model can be used to compute suitable values of MIN parameters based on the priorities of performance metrics. Current investigation is focussed on the extension of the model to analyze multi-path MINs and other routing protocols such as virtual cut-through and wormhole routing.
