Abstract | This paper provides a method for analyzing the queueing behavior of switching networks constructed from switches that employ shared bu ering or parallel bypass input bu ering. It extends the queueing models rst introduced by Jenq and later generalized by Szymanski and Shaikh to handle these classes of networks. Our analysis explicitly models the state of an entire switch and infers information about the distribution of packets associated with particular inputs or outputs when needed. Earlier analyses of networks constructed from switches using input bu ering attempt to infer the state of a switch from the states of individual bu ers and cannot be directly applied to the networks of interest here.
I. Introduction
In a widely cited paper 4], Jenq describes a method for analyzing the queueing behavior of binary banyan networks with a single bu er at each switch input. The method, while not yielding closed form solutions, does permit the e cient computation of the delay and throughput characteristics of a switch. A key element of the analysis is the inference of the state of a single switch from the state of its two bu ers, based on the assumption that the states of the two bu ers are independent. This independence assumption is not valid but does not yield gross inaccuracies in the systems that Jenq studied.
Recently, Szymanski and Shaikh 6] have extended Jenq's method to switching systems constructed from switches with an arbitrary number of inputs and an arbitrary number of bu er slots. They have also applied it to systems with di erent bu ering techniques. While these extensions are useful, it turns out that for many speci c choices of system parameters, the independence assumption mentioned above leads to signi cant inaccuracies.
We extend the previous work to cover switching systems in which the bu er slots in a switch are shared among all the inputs and outputs, rather than being dedicated to either particular inputs or particular outputs. Such systems require an analysis which explicitly models the state of the entire switch rather than the states of individual input or output queues. We can also apply our Figure 1 : Recursive De nition of a Delta Network method to systems using parallel bypass input bu ering, a class of systems that cannot be analyzed directly using the previous methods. Our technique can also be applied to the systems studied previously and for some system con gurations yields signi cantly more accurate results.
In section 2, we review the previous results for switching systems with input bu ering, in order to motivate the key issues involved in their analysis. In section 3 , we show how to analyze a switching system with shared bu ering and present a variety of performance curves characterizing such systems. In section 4, we show how our methods can be extended to switching systems with input bu ering, including systems supporting bypass queueing. Finally, in section 5, we provide numerical comparisons of the di erent bu ering techniques, describe our computational experience and suggest some possible extensions to our work. Figure 1 shows the recursive construction of a delta network D n;d with n inputs and outputs, constructed from d-port switches. Such networks provide a single path between any inputs and outputs, and have log d n stages of switching. (We use the term network here to describe the system as a whole and switch to describe the components from which the network is constructed.) The delta network is topologically equivalent to such networks as the banyan and omega networks. The results we describe here are equally applicable to any of these networks. Delta networks are often constructed from switches that contain bu ering for a small number of packets, with ow control between successive switches to ensure that the bu ers do not over ow. Figure 2 shows the structure of a typical switch in which each switch input has a bu er with a capacity of packets.
II. Analysis of Networks with Input Buffering
Typically these systems are operated in a time-slotted fashion, with xed length packets progressing from stage to stage in a synchronous fashion. Consequently, we can think of the system as operating in two phases. In the rst phase, ow control information passes through the network from right to left. In the second phase, packets ow from left to right, in accordance with the ow control information. A switch input will allow its predecessor to send it a packet if it has an empty bu er slot currently or if one of the packets in its bu er will leave during the second phase of the current cycle. This is called global ow control, since the ow control decision at a switch potentially depends on all of its successors in the network. Local ow control is also possible; in this form, a switch input allows its predecessor to send a packet only if its bu er has an empty slot. While local ow control doesn't make as e ective use of a switch's bu ers, it is more straightforward to implement, particularly in high speed systems where the propagation time required for global ow control can lead to unacceptable overheads. Note also, that several packets in a switch may contend for the same output, but only one will be allowed to proceed during a given cycle. We assume (as is usual) that in such a case, one of the contending packets is selected at random.
One way to analyze the queueing behavior of a bu ered delta network is to explicitly model the state of a single input bu er by a discrete time birth-death process and then model the state of an entire switch by assuming that the states of its various input bu ers are independent. A Bernoulli arrival process is assumed and packets are independently assigned random destination addresses upon entry to the system. This analytical technique is described in 6]. We brie y review it here for completeness.
Let i (j) be the steady state probability that an input bu er in stage i of the network (stages are numbered from left to right starting with 1) contains exactly j packets, where 0 j . Let a i be the probability that a packet is available to enter a stage i bu er and let q i be the probability that the rst packet (assuming there is a packet) in a stage i bu er can leave during a given cycle. With these de nitions, the transition probabilities for the stage i bu er are as shown below.
(Here, a i = 1 ? a i and q i = 1 ? q i ; we use the overline throughout to indicate the \complement" of the given probability.) The reasoning is straightforward. If the queue contains j packets where 0 < j < , then the probability that during the next cycle the queue contains j + 1 packets is just the probability that a new packet is available to enter the queue and the packet at the head of the queue does not leave; this is a i q i , assuming that arrivals and departures are independent of one another. Similarly, the probability that during the next cycle the queue contains j?1 packets is just the probability that no new packet is available to enter the queue and the packet at the head of the queue does leave; that is, a i q i .
If we knew a i and q i then, we could easily compute the state probabilities i (j). The trouble of course is that a i and q i depend on the state probabilities of the bu ers in the neighboring switches. This leads to an iterative computational method in which we assign arbitrary initial values to the state probabilities, then compute a i and q i for all i, use these values together with the balance equations for the Markov chain to compute new state probabilities, and so forth.
We calculate a i using the following equation
The reasoning is that a packet is available to enter a particular input bu er of a stage i switch if at least one of the d bu ers in the predecessor is non-empty and has a rst packet for the particular stage i switch of interest. Note that the states of the predecessor's d bu ers are assumed to be independent.
De ne b i to be the probability that a successor of a stage i switch can accept a packet. Then,
for local ow control and
The rst equality above is based on the observation that the rst packet in a stage i bu er can leave if the successor it is destined for can accept it and it wins any contention that may occur between it and the other input bu ers in the same switch. There are d?1 other input bu ers that might contend with it, the probability that any one does contend is i (0)=d, and the probability that the given input bu er wins, when it has to contend with j others is 1=(j + 1). In realistic systems, each input to the network is supplied with a bu er that is typically much larger than those in the switches. We can model such a bu er using the Markov chain shown below, where 0 is the number of bu er slots, b 0 is the probability that a stage 1 switch can accept a packet o ered to it (computed according to the equation for b i given above) and is the o ered load, that is the probability that a packet is available to enter the bu er.
Finally, we note that a 1 is computed not according to the general equation given above but is equal to the probability that the input bu er is nonempty; also, we assume that the output of the network can always accept a packet meaning that b k = 1, where k = log d n.
Given the above quantities, we can easily obtain the common performance metrics of interest. The carried load, for example, is the probability that a bu er in the last stage is non-empty and given that it is non-empty, that it is able to transmit a packet; that is, k (0)q k . The average delay through the network can be calculated by summing the average delays at each stage. The average In this expression, the quantity in the denominator of the initial fraction is the average arrival rate at stage i and the summation is the average queue length. For local grants, we just substitute a i i (B) for the expression in the denominator. The delay in the input bu er can be calculated in a similar fashion.
The performance curves shown in Figure 3 were computed with this method. The leftmost and center pairs of plots show the maximum obtainable throughput as a function of network size for networks comprising switches of di erent sizes and varying amounts of bu ering. The rightmost plots show the e ect of varying the amount of bu ering for switches with 256 inputs. The curves on the left show the throughput in the case of local ow control and those on the right are for global ow control. It's interesting to note that the networks constructed from larger switches have lower throughput when n is large. This appears to be caused by two mechanisms. First, because the networks constructed from larger switches have fewer stages for a given value of n, they have less bu ering overall. Secondly, the head-of-line blocking that occurs in these networks has a greater e ect on the networks made up of large switches, since a blocked packet can a ect packets with a wider range of destination addresses in this case. It's well known that switching networks in which bu ers are shared among the inputs can yield better performance than those in which bu ers are dedicated either to inputs or outputs. Figure 4 shows a switch in which packets arriving at any of d inputs are placed in available bu er slots from a pool containing B slots. Packets are routed from the shared bu er to the appropriate outputs. An implementation of such a switch would require a d B crossbar to distribute arriving packets to bu ers and a separate B d crossbar to route packets from bu ers to outputs.
As in the input bu ered switch, one can use either local or global ow control, but we analyze only the case of local ow control. There are two additional possibilities for implementing local ow control which we refer to as the grant and acknowledgement methods. In the grant method of ow control, a switch with x empty bu er slots, grants permission to send a packet to min fx; dg of its upstream neighbors at the start of an operation cycle of the switch. If x < d, we assume that x predecessors are chosen at random. Notice that in this method, a switch supplies grants to upstream neighbors without knowing which of them has packets to send. This can result in sending a grant to a neighbor that doesn't have a packet, while a neighbor that does have a packet may not receive a grant. The acknowledgement method of ow control remedies this fault by allowing all predecessors with packets to send them. The receiving switch stores as many as it can in its bu er and acknowledges their receipt by means of a control signal. Unacknowledged packets are retransmitted during a subsequent cycle. The acknowledgement method requires that the predecessors hold a copy of a packet pending an acknowledgement, but allows better bu er utilization overall.
We rst analyze a network using the grant method of ow control. We model each switch as a B + 1 state Markov chain. We let i (s) be the steady state probability that a stage i switch contains exactly s packets and we let i (s 1 ; s 2 ) be the probability that a stage i switch contains s 2 packets in the current cycle given that it contained s 1 packets during the previous cycle.
Let p i (j; s) be the probability that j packets enter a stage i switch that has s packets in its bu er and let q i (j; s) be the probability that j packets leave a stage i switch that has s packets in its bu er. Then 
Y is easily calculated, assuming all distributions of s packets to the d outputs are equally likely. This is just a classical distribution problem. For the purposes of calculation, the following recurrence is all we require. ) is independent of the stage of the switch in the network. For computational purposes, it is most convenient to merely precompute a table with the values of Y required; the above recurrence is ideal for this purpose. As in the earlier analysis, we compute performance parameters by assuming a set of initial values for i (j), then use these and the equations given above to compute i (s 1 ; s 2 ). These, together with the balance equations for the Markov chain are used to obtain new values of i (j) and then we iterate until we obtain convergence. While convergence is not guaranteed, our experience has shown convergence to be fairly rapid except when the o ered load is approximately equal to the network's maximum throughput; when the o ered load is below this critical point, convergence is obtained in fewer than 100 iterations, above the critical point convergence typically requires several hundred iterations and in the vicinity of the critical point, it may require several thousand iterations.
Notice that the calculation of Y d (r; s) given above relies on the assumption that the addresses of the packets stored within a switch's bu er are independent. This is not in fact the case. While it is true that the addresses of packets arriving at a switch are independent (given the input tra c assumptions), bu ered packets are correlated as a result of having contended for outputs. The correlations are strongest when d is small and B large.
We can now easily obtain the performance metrics of interest. The carried load is given by Figure 5 shows curves of o ered load vs. carried load for networks with 256 inputs and outputs and varying switch and bu er dimensions. In the plots = B=d is the number of bu er slots per switch input, the solid lines are the analytical results, while the dashed lines are simulation results. We note that for shared bu er networks, large switches usually perform just slightly better than small ones with the same values of . The advantage of the acknowledgement method of ow control is most pronounced when the number of bu er slots is limited, although one would expect a greater bene t in the presence of unbalanced tra c patterns.
The analysis is optimistic in the sense that it predicts higher carried loads than the simulation. This is typical of such analytical techniques. Notice that the analytical results are most accurate when the switch size is largest and the bu ering is smallest. Haifeng Bi 1] has traced the source of the discrepancy to the independence assumption mentioned above. Because the analysis neglects the correlations among the destination addresses for packets bu ered in a given switch, it overestimates the number of distinct outputs for which packets are present in a given state. This in turn, leads to an overestimate of the number of packets leaving a switch in a given state. As an experiment, Bi ran modi ed simulations in which correlations among packets in a switch were systematically eliminated by randomly reassigning their addresses at the start of each simulation cycle. The simulation results obtained in this way were virtually identical with the analytical results, meaning that the crucial direction for further re nement of the analytical models lies in capturing the e ects of correlations among packets.
The simulation results given above, are taken from an extensive simulation study described in reference 1]. The simulation results consistently reveal the same characteristics mentioned above for all the queueing models we study; that is, the analysis overestimates the maximum carried load and its accuracy is best for networks comprising large switches with limited bu ering. The simulation and analysis do rank the di erent bu ering techniques consistently making it possible to compare di erent bu ering techniques qualitatively using the analytical methods. We include no further simulation results here. Interested readers can nd further details in 1]. Figure 6 shows curves of average delay. The curves that become constant for large load give the delay through the network itself. The curves that rise steeply for large loads include the delay through the input bu er in addition to the network delay. We note that for o ered loads below the maximum capacity of a given network the total delay is generally between 1 and 2 times the number of stages in the network, yielding an advantage for networks with large switches. We also note that the maximum network delay for a given con guration is generally smallest for = 1:5 or 2. Figure 7 gives maximum throughput curves for networks with shared bu ering of varying size and bu er capacities.
IV. Improved Analysis of Networks with Input Buffering
We now return to the study of networks comprising switches using input bu ering. In addition to switches that use fo bu ers, we are interested in switches that use bypass bu ering to avoid the head-of-line blocking e ects that limit the performance of systems with fo bu er- ing. Two types of bypass bu ering are possible. In serial bypass, the rst packets in a switch's input bu ers rst contend for outputs, then the losing input bu ers that contain a second packet are allowed to contend a second time, those that lose in the second round and have a third packet are allowed to contend a third time, and so forth. In parallel bypass, all packets in a switch contend in a single round with the winners proceeding to the outputs. This allows more than one packet from a given input to proceed during a single cycle, allowing potentially higher performance, although of course each output can carry at most one packet per cycle. In high speed systems, parallel bypass is actually somewhat easier to implement, as one does not have the overhead of multiple contention rounds. For this reason and because it is more straightforward to analyze, we concentrate here on parallel bypass.
The analysis of a network with parallel bypass input bu ering is similar to that for a network with shared bu ers using the grant method of ow control. In particular, we need only alter the equations for b i and p i (j; s). As previously, B is the total number of bu ers in a switch and = B=d. Let X d (j; s) be the probability that a given input bu er has j packets given that the switch as a whole contains s. X and W are easily computed, assuming that when the switch contains s packets, all distributions of those packets among the input bu ers are equally likely. This assumption is not really correct, as it neglects correlations among packets in a given switch, resulting from prior contention for outputs. Bi 1] presents simulation results quantifying the discrepancy caused by this assumption; the results are similar to those cited above. Let z d (s) be the number of ways to distribute s distinct objects (packets) among d distinct containers (input bu ers), under the restriction that each container may contain at most objects. Also, let x d (r; s) be the number of ways to distribute s objects among d containers of capacity , so that a particular container receives exactly r objects. Using these equations, it is straightforward to compute tables containing the requisite values of X and W. We now return to the case of an input bu ered network with fo bu ers. Most of the analysis for bypass input bu ering carries over to this case. The two equations requiring modi cation are those for a i and q i (j; s). Let Y d (r; s) be the probability that exactly r input bu ers contain at least one packet, given that the switch contains s packets. Then, assuming that when a switch contains s packets, all distributions of the packets among the outputs are equally likely, and that the destination addresses of all packets are independent, Figure 8 gives curves of maximum throughput for networks comprising switches with both fo and parallel bypass input bu ering, of varying size and bu er capacity. We note that bypass bu ering gives a very substantial improvement over fo bu ering and that larger bu ers yield a greater improvement in the case of bypass bu ering. It's also worthwhile to note the di erences between the curves in the top row of Figure 8 to the corresponding curves in the top row of Figure 3 that were obtained using Szymanski and Shaikh's analysis. The prior analysis is consistently more optimistic than the method described here. However in many cases the improvement obtained with the new method is slight.
V. Conclusions Figure 9 lists the numbers of the equations used to compute the key quantities for each of the four bu ering Figure 10 compares the maximum throughput obtained with the various bu ering methods and networks of varying size, switch dimension and bu er capacity. We show curves for shared bu ering using both the grant and acknowledgement methods of ow control. We show curves for input bu ering using local ow control, with bypass queueing and fo queueing. We note that shared bu er switches o er clearly superior performance for a given amount of bu ering, but bypass input bu ering performs impressively as well. Fifo input bu ering, performs rather poorly in comparison to the other methods, but may be acceptable in certain applications. Interestingly, variation in switch size yields only small changes in maximum throughput for networks with the same values of , but the reduction in the number of stages obtained with larger switches yields a signi cant economy in implementation, as well as lower delays. We note that the acknowledgement method of ow control yields only modest improvements over the grant method when we have uniform random tra c with Bernoulli arrivals. This appears surprising until one realizes that in the presence of heavy uniform random tra c, all the predecessors of a given switch are likely to have one or more packets to send it at any one time. We would expect a greater di erence in the face of non-uniform bursty trafc, since in this case, there can be substantial di erences in the instantaneous tra c from the upstream neighbors.
Our computational experience with the method described here is quite favorable. In collecting the data for all the curves shown in this paper, we computed approximately 900 data points and used a total of 40 hours The memory requirements for this program are under 3 Mbytes when dimensioned for networks with up to 12 stages, switches with up to 32 inputs and a total of up to 100 bu er slots. The other programs are a little smaller in both code and memory usage. The switch dimension and bu er capacity both have a strong in uence on the running time and memory requirements. We have computed results for switches with d = 32 and = 3, but this is about as far as one can reasonably push the method with typical workstations. Fortunately, this covers the cases of greatest interest, as larger switches are di cult to implement. Also, the relative insensitivity of the results on switch dimension allows one to extrapolate to networks of larger switches with a good deal of condence. Since the running time is relatively insensitive to the total size of the network, it is far superior to simulation when modeling networks with hundreds or thousands of inputs. We note that, the analysis also supports computation of delay distributions, and packet loss rates in addition to average delay and throughput. As mentioned above, Haifeng Bi 1] has made a careful evaluation of the analytical techniques described here by comparing them with simulation and has both quanti ed the discrepancies and identi ed the crucial contributing factors. Bi also studied networks comprising switches with output bu ering and made a systematic comparison of output bu ering with input bu ering. His work demonstrates that the di erence commonly noted between input bu ering and output bu ering is less signi cant than commonly assumed. What most authors overlook is that in a switch with output bu ering, the internal crossbar or bus required to provide access to the outputs requires greater capacity than the crossbar required in a switch using fo input bu ering. In conventional output bu ering, each bu er can receive up to d packets per cycle, whereas in fo input bu ering, each bu er can transmit only one packet per cycle. If one compares conventional output bu ering to bypass input bu ering, where each input bu er can transmit multiple packets in a given cycle, the di erence between the two disappears. Bi has compared generalized forms of input and output bu ering. In his study, there is a parameter r that limits the number of packets that can be transmitted from an input bu er in a given cycle or received at an output bu er. His results show that there is very little di erence between input and output bu ering when they operate under the same restrictions with respect to crossbar access; interestingly in fact, input bu ering enjoys a slight advantage due to \boundary e ects" at the rst and last stages of the network. An interesting extension of this work would be to modify our methods to allow modeling of systems with global ow control. This appears di cult and may be of only academic interest, but nonetheless it would be interesting to compare with the earlier analyses. Extension of our models to networks with uneven tra c distribution would also be worthwhile. We expect this could be done following the pattern established in 5]. Networks that perform distribution and/or packet replication are of substantial practical interest currently 7] . The extension of our models to cover distribution networks appears straightforward; the case of replication is more challenging but may prove tractable.
Finally, these techniques are directly applicable to the study of several switching system architectures that are under development for atm networks 2, 3, 7] . A detailed comparison of these systems using the tools we have developed could have an important practical impact on the development of emerging networks.
