We propose SXmin: a self routing, group Knockout Principle 15] based ATM packet switch which provides comparable delay-throughput performance and packet loss probabilities at signi cantly reduced hardware requirements compared to earlier switches 6, 12, 13, 15, 17] . The N N SXmin consists of an N N Batcher sorter followed by log 2 N ?1 stages of sort-expander (SX) modules arranged in the form of a complete binary tree. Each SX module consists of a column of 2 2 switches with a wraparound-unshu e input-output interconnection. This enables the hierarchical utilization of the group Knockout Principle to expand the number of inputs by a small factor at each stage, resulting in a signi cant reduction in overall hardware complexity. Routing at each switch is controlled by a single bit. However, in case of contention, a dual bit resolution algorithm is used locally which drops excess packets in a predetermined manner while ensuring global randomness of packet loss over the entire switching network. There are no internal bu ers at the individual stages and therefore the internal delay is constant and proportional to the number of stages. The use of simple hardware components and regular interconnections in the SX modules makes the network suitable for optical implementation.
which implies that the switch is not self-routing. The number of crosspoints used in the Clos network is still O(N 2 ). The SCOQ switch 17] combines the sort-banyan structure with internal buses to achieve a better hardware performance as compared to 6, 12, 13] . It consists of a sorting network followed by N buses connected to L routing modules. The buses are equipped with packet lters to concentrate the input packets into L groups of up to N L inputs each based on destination addresses. Each routing module consists of L N L N L banyan networks for routing up to L packets simultaneously to a given destination. The number of packet lters (or crosspoints) is O(NL).
The paper is organized as follows. Section 2 gives a brief background on the Group Knockout Principle and de nes an expansion function used in the design of the proposed switching network. Section 3 describes the overall structure of the switching network as well as the structure and operation of the SX modules and the routing algorithm. The performance of the switching network is analyzed in section 4. Section 5 describes the hardware requirements of the switching network followed by a hardware comparison with other networks. Conclusions are given in section 6.
Motivation and Background
In this paper we propose SXmin: a self-routing, space division, modular, packet switch requiring no internal speedup and with packet loss probabilities below any predetermined design value for loading up to 100 %. The proposed switch is not a multichannel switch like the one in 20], rather the switch architecture is based on output bu ering by allowing simultaneous arrival of multiple packets to the same output. To avoid internal speedup, spatial expansion of the internal switching stages is used as in 12, 15, 17] . For example the Sunshine switch consists of a sorting network followed by 8 parallel planes of N N banyan routers, an expansion factor of 8.
The proposed network architecture falls under the category of non-blocking sort-banyan architectures. However the sorting network 18] is not followed by a single or multiple banyans, rather it is followed by stages of sort-expander (SX) modules. Each SX module at every stage consists of a single column of 2 2 switches and maintains its input cyclically sorted while expanding them by a small factor determined by the Group Knockout Principle 15] . The goal of the proposed network architecture is to minimize the overall expansion thus reducing the cost of the network as compared to previously proposed architectures. It will be shown that the asymptotic cost of the entire SX modules is that of a single N N banyan.
SX modules do not act as knockout concentrators, rather a novel dual bit controlled routing/contention resolution algorithm with a predetermined scheme of dropping excess packets is used at each switch. The algorithm allows the hierarchical exploitation of the Group Knockout Principle at each stage with minimum utilization of hardware. This paper discusses the switch design for independent Bernoulli tra c with uniform distribution of destinations. The design can be easily extended to handle non-uniform tra c patterns (such as Hot-spot, Single-Source Single-Destination etc.) by changing the expansion factor at each stage, the details of which are not part of this paper.
The proposed switching network requires fewer hardware elements than the switches in 6,12,13,17] for comparable packet loss probabilities. We discuss the hardware requirements of a 1024 1024 switch and show that it requires less than half the number of 2 2 switches as compared to an SCOQ switch of the same size. Since the proposed N N switch has no internal buses and uses only simple components such as 2 2 comparators and 2 2 switches with regular interconnections, it is suitable for optical implementation using 2 2 electro-optic couplers. We also analyze the switching network with respect to performance parameters such as packet loss rate, throughput, mean bu er lengths etc.
De nitions
We brie y describe the Group Knockout Principle in terms of an arbitrary set of outputs and de ne an expansion function to be used in the design of the proposed switching network.
Consider any N N switch under a uniform tra c model where the packets have independent, uniform destination address distributions. Packet arrival processes to an input port are independent Bernoulli processes with 0 =Pr. a packet arrives at input port]. Consider a group of outputs G, where jGj = g, 1 g N. Then the probability that there are exactly k packets destined to G in one time slot is the binomial distribution
where N is the size of the network and 0 k N. Furthermore assume that each group of outputs G can accept up to a maximum of m packets in a time slot, where 1 m N. Then E N; 0(m; g), the expected number of packets lost by G due to the arrival of more than m packets in a network of xed size N and xed uniform tra c load 0 , can be written as:
The proposed switching network can be designed for any predetermined value of packet loss. A packet loss probability of 10 ?6 , computed using typical error rates in system transmissions, is widely accepted and shall be chosen to illustrate our network design. The network can be easily modi ed to handle much lower packet loss rates if required, by choosing a di erent threshold value for the expansion function de ned below.
De nition 1 Let N; 0(g) be the expansion function de ned as follows: N; 0(g) = fMin m : E N; 0(m; g) < 10 ?6 g; g < N; N; 0(N ) = N: (3) Qualitatively, N; 0(g) represents the minimum number of packets destined to each group of outputs G that must be transmitted without any loss in order to keep the packet loss probability less than 10 ?6 . In other words, packets destined to a group G of outputs can be dropped by the network only if the total number of packets destined to G exceeds N; 0(g). Thus N; 0 (g) g represents the expansion factor for any group of outputs with arbitrary cardinality g. is less than 2 and approaches 1 as g increases. In particular note that N; 0(g) 2 N; 0( g 2 ). This fact forms the basis for the design of our switching network. Table 1 shows N; 0(g) for 0 = 1 and varying N when g is a power of 2.
We note that the load factor 0 in N; 0 is also a design parameter. When designed for a particular value of 0 , 0 < 0 1, the performance parameters of the switching network such as packet loss rate, mean bu er lengths etc. will form an upper bound for actual values of loading 0 . To maintain generality, we shall discuss and analyze the network as designed for 0 = 1. For the rest of the paper we abbreviate N;1 as N .
Proposed Switching Network
The N N SXmin starts with an N N Batcher sorting network 18] which sorts destination addresses of input packets in increasing order. This is followed by log 2 N ? 1 stages of sortexpander (SX) modules, labeled from 0 to n ? 2, where n = log 2 N. Stage Table 1 and hence 2 4 switches are used at this stage 4 . The least two signi cant destination address bits are used for routing at each switch. There are N (4) input lines coming in to the module which are connected in a wraparound pattern to the switches. Outputs from each 2 4 switch are then linked to corresponding output bu ers in each output port as shown in the gure. Each output bu er should be able to receive up to N (1) simultaneous packet arrivals. This can be done either by speeding up the operation of each individual output bu er by a factor of N (1) or by the use of N (1) parallel bu ers for each SXmin output similar to 6]. In the latter case no speedup of the bu ers is necessary.
Note that if the original pattern of interconnection through the rest of SXmin had been followed, the last two stages would have consisted of N 4 SX modules each labeled N (4) : 2 N (2) and N 2 SX modules each labeled N (2) : 2 N (1). Each SX module in the second-tolast stage would have N (2) 2 2 switches while every SX module in the last stage would contain N (1) 2 2 switches respectively. The hardware cost of these two stages is reduced by collapsing them into one stage of N 4 SX modules each containing N (1) 2 4 switches.
Sort-Expander (SX) Modules
Consider a single column of switches with K input lines, labeled 0 to K ? 1 from top to bottom. The architecture of the SXmin switching network is based on a special relation between destination addresses of packets arriving at these K adjoint inputs of each SX 4 Note that the property N (4r) 2 N (r) is not true in general for r = 2 k , k 1. Cyclically sorted input sequences at SX modules in each stage of SXmin are important for simple routing with low packet loss probabilities in case of con icts. The cyclic sorting property also helps maintain fairness in packet contention resolution. Thus in addition to routing packets to their destinations, the control algorithm for SXmin must also maintain the cyclic sorting property through successive stages. We will describe a dual-bit control algorithm at each 2 2 switch that routes packets to their proper destinations while maintaining the inputs to each SX module in cyclically sorted order with a very high probability. The control algorithm also serves as a contention resolution scheme when there is packet over ow at any SX module. It is di erent from the contention resolution algorithms in 6, 15, 17] in that excess packets in each stage are not dropped at random, rather packets belonging to speci c subsets of destination addresses are dropped with greater probability. Fairness is maintained by being biased against di erent subsets of destination addresses at di erent stages, such that the expected probability of packet loss over the entire network is the same for each subset of destination addresses. We rst note that because of the cyclic sorting constraint, the occurrence of packet collision (i.e., two input packets to a switch having the same (n?i?1) th bit; b n?i?1 = b 0 n?i?1 ) implies the existence of at least N ( N 2 i+1 )+1 active packets with the same (n?i?1) th bit. If the (n?i?1) th bit is 0 (1), this implies that these packets have destination addresses in the range G i+1 2m (G i+1 2m+1 ). Since the total number of input lines N ( N 2 i ) 2 N ( N 2 i+1 ), there cannot be more than N ( N 2 i+1 ) packets to both G i+1 2m and G i+1 2m+1 Next we note from lines 6{8 in the algorithm that if the excess packets belong to G i+1 2m (i.e., b n?i?1 = b 0 n?i?1 = 0), packets with the highest destination address in G i+1 2m (i.e., b n?i?2 = 1) are dropped in preference to packets with lower destination address (i.e., b n?i?2 = 0). Correspondingly, we note from line 9{11 in the algorithm that the reverse is true if the excess packets belong to G i+1 2m+1 . Thus there is a bias against destination addresses at the boundary of G i+1 2m and G i+1 2m+1 . However fairness is maintained overall since there is a bias against a di erent set of destination addresses at di erent stages. For example, in stage 0 of a 1024 1024 switch, there is a bias against packets with destination addresses around 511 and 512. In stage 1 there is a bias against packets with destination addresses around 255{256 and 767{768 and so on for the remaining stages 5 . We shall prove more formally in section 3 that the expected packet loss probability over the entire network is the same for all groups of destination addresses. Proof : See Appendix. 5 Intuitively, this explains why we use E(m; g) as opposed to the normalized value E(m;g) g in Equation 3 . Using E(m;g) g to calculate the number of switches in an SX module would only bound the packet loss rate for the entire group, whereas we need to bound the packet loss rate for an individual destination address. 
Cyclic Sorting: Probabilistic Analysis
In the previous subsection we assumed that input packets to SX modules in di erent stages are in cyclically sorted order. The assumption was important in the de nition of the routing/contention resolution algorithm for dropping excess packets in a predetermined nonrandom fashion, using just a single column of switches. Before proving that dual-bit controlled 2 2 switches in the SX modules are su cient to maintain cyclic sorting, we prove the following weaker result, which o ers a more intuitive understanding of the basic concept. Consider a modi ed SXmin architecture obtained by replacing the dual-bit controlled switches in each SX module by n-bit controlled switching elements with the following control algorithm at each switch. Lemma 2 Let G t represent a group of t contiguous output destinations where 1 t N. Let 
P i
Gt=k represent the probability that there are exactly k packets with destination addresses belonging to G t , at the input to an SX module in stage i of the switching network, 0 i n ? 2. (P 0 Gt=k = P Gt=k is de ned in Equation 1). Then, P i Gt=k P Gt= t ; when 0 k t; P i Gt=k P Gt=k ; when t < k N: (4) Proof : See Appendix. Let PCS i represent the probability that input packets to SX modules in stage i are in cyclically sorted order by destination address. ( N  2 i+1 ) ). From Theorem 2, packets in G i u and G i l will be in cyclically sorted order if the packets randomly selected by the dual-bit control algorithm at each of these l switches correspond precisely to those packets selected in lines 5{12 of the n-bit control algorithm. The probability that at least one wrong packet is selected, (thus destroying the cyclically sorted ordering), is 1 ? ( 1 2 ) l . Let PCS i+1 represent the probability that the packets in G i u (G i l ) are not cyclically sorted. Depending on the values of b n?i?1 b n?i?2 , the con icting packets must be destined to one of the following groups of outputs: G i+2 4m , G i+2 4m+1 , G i+2 Figure 4 . Let L i G j be a random variable representing the number of packets destined to G j which are lost in an SX module m, in stage i of SXmin, (0 i < n?2). Let PL i;q G j = Pr. h L i G j = q i and let P i G j =k represent the probability that there are exactly k packets with destination addresses belonging to G j in stage i. P 0 G j =k can be obtained from Equation 1 by substituting g = 4.
For i > 0, there are two possible cases to consider:
Case 1: The input packets to module m are not cyclically sorted.
In this case the packet loss is overestimated by assuming that all packets destined to G j are lost. PL i;q G j PCS i P i G j =q ; 1 q N ( N 2 i ):
Case 2: The inputs are cyclically sorted with probability PCS i .
The destination addresses of input packets to SX module m belong to the group G i m . Packets on the upper (lower) output ports of the switches in module m have b n?i?1 = 0 (b n?i?1 = 1) and are routed to SX modules 2m (2m + 1) in stage i + 1. These packets have destination addresses belonging to groups G i+1 2m (G i+1 2m+1 ). Without loss of generality, assume G j G i+1 2m (the case when G j G i+1 2m+1 is similar). Let G k ; G k+1 ; : : : ; G j?1 ; G j+1 ; : : :; G r be the other groups of four contiguous outputs belonging to G i+1 2m , k j r, G i+1 2m = S r t=k G t . By the dual-bit control algorithm, packets destined to G i+1 2m are lost in module m if their total number exceeds the number of switches N ( N 2 i+1 ). Consider the following two subcases which could lead to the loss of packets destined to G j : A. There are at least q packets destined to G j such that the total number of packets destined to S j t=k G t is at least N ( N 2 i+1 ) + q: Since the input is cyclically sorted, the largest q packets destined to G j are in contention with smaller packets destined to S j t=k G t and are therefore dropped as per the control algorithm. It is possible that some (or all) of these packets have the same b n?i?1 and b n?i?2 bits as their contending packets destined to S j t=k G t , in which case one of the packets is dropped with probability 1 2 at each switch. Recall that in the preceding analysis of the cyclic sorting probability, the possibility of such a con ict is subsumed in the term for PCS i+1 . We will overestimate the packet loss by assuming that in such a case all packets destined to G j are lost. 
Since G j G i+1 2m , we have j S j t=k G t j N 2 i+1 < N ( N 2 i+1 ) and thus Lemma 2 can be used to bound the rst two terms in Equation 11 . Also note that P i G j =q P 0 G j =4 , when 1 q 4 and P i G j =q P 0 G j =q , for q > 4, to bound the last term in Equation 11. The probability PL i;0 G j will be used to compute the delay-throughput characteristics of SXmin. A lower bound on PL i;0 G j can be obtained as
when the probabilities are computed using Equation 11 . The expected number of packets destined to G j and lost in stage i can also be computed as
q PL i;q G j : (13) Using Equation 13 , the expected number of packets destined to G j and lost over all SX modules in stages i, 0 i < n ? 2, in the SXmin, E(L SX G j ), can be calculated as Table 3 shows E(L i G j ), and E(L SX G j ) for each group G j at each stage i in a 32 32 switch when the o ered tra c load = 1, 0 j 7, 0 i 2. The tables also serve to illustrate the fairness of the contention resolution algorithm. It can be seen that the expected packet loss for each group G j reaches a maximum in the stage where there is a bias against that group and falls o rapidly in other stages. Note that the maximum value of E(L SX G j ) at a tra c load of = 1 is approximately 10 ?6 . This is to be expected since the expansion function N was calculated using the design parameter 10 ?6 . Figure 7 shows the variation of maximum E(L SX G j ) values for di erent network sizes N with = 1.
Packet Loss Probability in the Last Stage
The last stage of SXmin (stage n ? 2) consists of N 4 SX modules each serving a group of four consecutive output destinations, as shown in Figure 4 . There can be up to N (4) simultaneous packet arrivals to the SX module in a time slot, which are distributed over N (1) 2 4 switches such that consecutive inputs are connected to consecutive switches modulo N (1). A packet destined to a particular output can be lost due to blocking in the 2 4 switches only if there are more than N (1) simultaneous packet arrivals destined to that output. (The SX module remains non-blocking if there are no more than N (1) simultaneous arrivals to any output destination).
Let P n?2 k represent the probability that there are exactly k packet arrivals from the SX module in the previous stage (stage n?3), which are destined to the same output port. The expected packet loss at each output port can be bounded by PCS n?2
The packet loss probability (PLP) of a network is de ned as 6,15] PLP = expected packet loss o ered load : (16) Let n?2 be the o ered tra c load at an input to an SX module in stage n ? 2. n?2 is de ned as, n?2 = Expected number of active input packets at the SX module Number of outputs served by the SX module Since each SX module in stage n?2 serves a group of four outputs and the expected number of packets lost by that output group in stages 0 to n ? 3 is given by E(L SX G j ), this works out
The packet loss probability for SXmin, PLP, is 
Output Queueing Analysis
Each output bu er concentrates up to N (1) outputs during a time slot. Under the uniform tra c assumption an approximate Markov chain model can be used to determine parameters such as packet loss due to bu er over ow, mean switch delay, normalized switch throughput etc. Such analysis has been widely published (for instance in 6,17]). We provide an outline for the derivation of the steady state equations by showing a lower bound for packet arrival probabilities at the bu ers.
For an arbitrary output let r k denote the probability that there are k simultaneous packet arrivals to its bu er in one time slot. 
Let the output under consideration belong to some group G j of four consecutive outputs, where 0 j N 4 ? 1. The probability that none of the packets destined to the given output are lost in the rst n ? 2 stages of SXmin, PL 0 , is given by PL 0 = Q n?3 i=0 PL i;0 G j . Equation 4 in Lemma 2 can be used to obtain the lower bound P n?2 k P 0 k PL 0 . The bu er arrival equations can therefore be written as, . Let B n denote the steady state probability of having n packets in an output bu er. The queue size can be modeled by a nite-state Markov chain as in 6, 17] . The delay-throughput characteristics of SXmin can be obtained by solving for the steady state probabilities using the values in Equation 20 .
At this point, it is interesting to compare the output queueing performance of SXmin with other Knockout Principle based switches such as the SCOQ switch 17] and the original Knockout switch 6]. The SCOQ switch has no packet loss in the sorter and the packet lter stages but su ers from internal packet loss in the routing network stage (RN in 17] ). Using the same notation as in Equation 20 , the bu er packet arrival equations for an SCOQ switch, g is a decreasing function approaching 1 for larger values of N. As N increases the number of 2 2 switches is N 2 log N, which is the same as that of a single N N banyan network.
It was mentioned at the beginning of this paper that the design and analysis of SXmin was carried out for 0 = 1. However most Knockout Principle based switches work best for tra c loading around 0.9, since the number of output bu ers required grows very rapidly as ! 1. Hence it might be more practical to design SXmin for 0 = 0:9. For N = 1024 and 0 = 1, the sorter requires 28; 160 comparators. The following 8 stages of SXs use 8; 834 2 2 switches in total. The last stage of SXs consists of 2; 304 additional 2 4 switches.
As a comparison, the SCOQ switch has the following hardware requirements: NL packet lters; N 4 (log 2 N + log N) comparator elements; and NL 2 (log N L ) 2 2 switches, where L is the number of packet arrivals to a single destination that can be simultaneously accepted. To maintain an acceptable packet loss probability (< 10 ?6 ), the choice of L is L 8. Substituting the value of L = 8, the SCOQ switch thus uses 8N packet lters; N 4 (log 2 N + log N) comparator elements; and 4N(log N 8 ) 2 2 switches. Note that the SCOQ architecture is bus-based and thus may not really be suitable for large values of N. This is especially true when considering an optical implementation where buses are implemented using passive splitters. The number of outputs of a passive splitter is limited by the available power budget. On the other hand, an optical version of SXmin can be implemented using 2 2 electro-optic directional couplers. Since the cost of couplers is very high, the lower hardware requirements of SXmin make this more attractive. Table 4 shows a comparison of the hardware requirements of SXmin (for 0 = 0:9 and 0 = 1) with the SCOQ switch 17], the Knockout switch 6], the Sunshine switch 12], and Lee's modular switch 13] for N = 1024. Note that Lee's switch is designed primarily with scalability in mind whereas the other networks are not. The table serves to illustrate the tradeo involved when considering hardware cost versus scalability.
The single-bit controlled 2 2 switches in the SCOQ network and the Sunshine switch (simpli ed without the feedback loop; M = 0 in 12]) must be capable of dropping packets in case of routing con icts and can be implemented using 6{8 gates. A dual-bit controlled 2 2 switch implementing the SXmin routing algorithm in a pseudo-random fashion requires around 10{13 gates. Finally, the 2 4 switches in the last stage of SXmin can be implemented using 16{20 gates. The switch counts from Table 4 indicate that the overall hardware cost of SXmin is much lower than the other schemes.
Conclusions
We have proposed SXmin, a self-routing, modular, fast ATM packet switch. SXmin maintains low probability of packet loss and does not require internal speedup of the switch while requiring much less hardware than other switches (e.g., Knockout Principle based switches 6,15], the SCOQ switch 17] and the Sunshine switch 12]) for the same degree of performance. SXmin consists of a Batcher bitonic sorter followed by SX modules consisting of dual bit controlled 2 2 switches. The number of switches in the sort-expander stages are calculated using the group Knockout Principle, recursively applied to successively smaller groups of outputs. SX modules keep the switch internally nonblocking by maintaining packet destinations in a cyclically sorted order. A novel dual bit algorithm is proposed to resolve packet contentions at SX modules.
The proposed architecture uses modules readily available not only for electronic but also for optical networks. The switch was designed with an optical implementation in mind, therefore it does not contain any elements which are either costly or di cult to implement in optics. Recursive nature of the SX stage allows easy modular growth and a VLSI implementation. The regular wrap around-perfect unshu e interconnections between SX modules in adjacent stages makes the network suitable for time-domain optical implementation 19]. The proposed switching network allows easy adoption to di erent requirements (like lower packet loss probability, varying load, larger number of inputs, etc.) while keeping the cost of hardware at a reasonable level. Let PL r represent the probability that r packets destined to the group G t of outputs have been lost in SX modules in stages 0; 1; : : :; i ? 1 of SXmin. Then P i
Gt=k can be written as P i Gt=k = P Gt=k PL 0 + P Gt=k+1 PL 1 + : : : + P Gt=N PL N?k :
From Equation 1 we note that P Gt=k is a binomial distribution attaining its maximum when k = t. For k > t, P Gt=k > P Gt=k+1 . Substituting these bounds in Equation 22 and noting that P N?k r=0 PL r = 1 yields the desired result. 3. N ( N 2 i+1 ) < t 2 N ( N 2 i ) In this case the only con icts happen among packets with b n?i?1 = 1. Using a similar argument as in case 2, it can be seen that both G i u and G i l are cyclically sorted.
2
