Abstract-We present new results and numerical stndies of very fast rhedulers for SMS (Switch-Memory-Switch) ronten, which emulate outputqueuing hy buffering packets in a partitioned shared-memory located between input and output ports. The architecture of Juniper's core routers and Bmcade's storage switches is based on SMS.
I. INTRODUCTION
Routers play a critical role in modern computing of all forms [121. [151-[71, [4], [6] , [91, [191[10, Chapters 7.12, 8.121 . A router used to be nothing more than a general purpose computer connected via a standard bus to hardware fix transmitting and receiving packets over links. This was because the link bandwidth was low enough for a general purpose processor to implement the entire router functionality. With the ad\.ent of high-speed fiber optic technology 1171.
[IS], the situation has reversed, and in many networks today routers are the bottleneck in moving data.
A router needs to be able to buffer packets because of contentions for output links. This buffering can be at the input port, at the output port, in the switching fabric. or in shared memories. The first and second cases are referred to as inpiif qrieiiing and output queuing. respectively [121. 'The work of Ihc ihrd author was supprrted in part by NSF Grant CCR-
1988160.
Output-queued routers are appealing because they have better latency and throughput than input queued routers. However, a direct implementation of an output queued router needs to run the switching fabric and the buffer memory at N times the line speed for an N-input, N-output router (since at the start of a cycle, all packets at an input port may be destined to the same output port). Thus input queuing is preferred for implementation reasons. and considerable effort has been devoted to overcoming its limitations, e.g.. the development of virtual output queuing to overcome head-of-line blocking [13].
It is natural to ask if it is possible to build a router whose external behavior is identical to an idealized output-queued router using slower components. Chuang et al.
[5] define a router S to emulate output queuing, if, given identical input arrival patterns_ the departure time of every packet from 5' is the same as that from the output queued router. They showed that a router with queues at both the inputs and the outputs that can suppon 2 reads and 2 writes per cycle can in principle be scheduled to emulate output queuing. However, computing the schedule itself involves solving an instance ofthe marriage problem. The standard "proposal algorithm" for finding a stable marriage takes O ( N 2 ) worst-case and O(N1ogN) expected time. The best parallel algorithm presently known for computing stable marriages [SI has complexity O ( f l log3 N ) and uses N" processors. It is extremely complicated. based on interior point methods for linear programming.
Thus it is neither theoretically efficient (i.e., does not have O(polylog(N)) complexity). nor of practical significance.
Prakash, Sharif, and Aziz I161 proposed the SwitchMemory-Switch (SMS) architecture as an abstraction of the M-series Internet core routers from Juniper Networks. The set of input ports is connected via an Ai x M interconnect to 114 packet memories; these h.1 memories are connected to the set of output pons through another interconnect. In every cycle one packet can be read from and written to each memory. It is shown that when M 2 2N -1; the SMS architecture could emulate an output-queued switch. Specifically a scheduler based on computing perfect matchings in bipartite graphs is presented, whose time complexity is O(log2 N ) on a parallel random access machine. However, this algorithm requires [I61 represent the first implementation of output queuing that runs in polylog time using a polynomial number of processors, the scheduler requires building and manipulating complex data Structures, and its practical utility remains unclear.
The SMS routcr architecture has benefits beyond just the ability to emulale output queuing. High performance packet switches use SRAM rather than DRAM for packet buffers because the access time of an SRAM is less than that of DRAM by a factor of up to 40. SRAM is considerably more expensive than DRAM, and the cost (measured in terms of power, footprint, as well as price) of memory is a significant factor in the design of routers. The SMS architecture is therefore appealing because of its ability to "pool" available memory, and thereby achieve better memory utilization. Indeed, Juniper's core routers and Brocade's storage switches are based on the SMS architecture precisely because it reduces memory cost.
With the above motivation, we began work on developing practical schedulers for routers based on the SMS architecture.
We developed RiPSS [3]-a very simple randomized parallel scheduler for SMS routers-which we review in detail in Section II. In essence we proved that RiPSS computes a complete assignment of packets to memories in O(log* N ) basic matching rounds with high probability (w.h.p.), independent of the inpur trajJic pattern. The intuitive idea is that when there are (2 + e ) . N memories, the ratio of unmatched memories to unmatched inputs increases by an amount exponential in N (rather than a constant, which would yield a O(log N ) bound on the number of iterations).
The formal proof of O(log* N ) result for RiPSS uses the machinery of Martingale analysis and Azuma's inequality.
Further. although the RiPSS intuitively is very simple and fast, the constants derived in the proof are quite large. and we suspected them to he loose.
In this paper we make two contributions toward the goal of developing practical schedulers for SMS routers: 
BACKGROUND
In this section we review germane results on SMS routers from our prior work [161, [3] . The SMS architecture is depicted in Figure 1 . Input ports are connected via an N x M interconnect to h.1 packet memories; these memories are connected to the output pons through another interconnect. In every cycle one packet can be read from and written to each memory. (The generalization of our results to faster or slower memories is straightforward.)
A. Onulating o~ctprif queuing with the SMS architecture
Since output-queuing is highly desirable, ideally, SMS scheduling should result in the SMS router emulating an outputqueued router. By emulation, we mean that for any arrival sequence (1) a packet is dropped by the SMS router iff it will be dropped by the output-queued router. and (2) if a packet is not dropped then the cycle in which it departs the SMS router must be same as the cycle in which it would have departed the output-queued router.
0-7803-8355-9/04/S20.00 02004 IEEE.
The cycle in which a packet would have departed an outputqueued router is referred to as its time-rtamnp. In each cycle, packets at the inputs are written to a subset of memories through the first interconnect, and packets whose time-stamp is equal to the current time are read from the memories and transferred to the outputs through the second interconnect. Since the time-stamp of a packet is known when it is written to a memory, Task 3 is simple. Task 1 can also he performed efficiently using a parallel prefix sum computation, as described in [161.
Task 2 is the most complex step and is the main focus of this paper. For routers that are relatively small and support slow links, the SMS architecture can emulate output-queuing by using a ' straightforward greedy sequential algorithm to compute an assignment of incoming packets to compatible memories. However for routers with many ports operating at high speeds, the sequential algorithm is not fast enough to compute the assignment. Prakash et al. 1161 presented the first parallel algorithm for computing the assignment: however, the algorithm requires building and manipulating complex data structures. Subsequently, we developed RiPPS-a simple, highly-parallel randomized scheduler [3], which we describe in the next section. . There are N input ports, each with a huffer that can hold I packets. At each input port, the current packet is the packet at the head of that input buffer. In the basic algorithm of [3] I is 1; in the pipelined algorithm we present in Section IV ofthis paper, I = O(log* N ) . There 
where c > 1 is a constant, then with high probability. RiPSS will not drop a packet that will not he dropped by that output-queued switch.
2) Number of Memory Banks : Even though the cumulative size of memories in an SMS architecture can be close to that of an output-queued router, having a large number of small memories is slightly more expensive than having a small number of large memories. In this context the following results were established in [31.
Lemma 2.5: If an adversary places packets in the memory then at least 2N -1 memory banks are needed in order to satisfy arrival and departure constraints in SMS.
Since a well-designed algorithm can control the placement of packets in the memory it is possible that such an algorithm can make do with a smaller number of memory hanks than the hound in Lemma 2.5. However, the following result is shown
Theorem 2.6: There is no deterministic algorithm that can match any sequence of packet arrivals to memories while satisfying arrival and departure constraints if the number of memories is M = N + A and A < N / 8 . Furthermore, for any randomized algorithm there exists an arrival sequence for which it will fail with probability at least 1/? if A C N/8.
RPSS-A NUMERICAL STUDY

A. Worst case bounds on performance
The theorems presented in Section I1 tell us only the asymptotic behavior of RiPSS. In particular they do not tell us what the hidden constants are and for what value of N (the number of input ports) and M (the number of memory hanks in the SMS architecture) we obtain acceptably small probability of failure. Since we do not have simple closed form expressions for the exact probability of failure and for the number of rounds, in this section we present concrete upper hounds on the probability of failure and number of rounds needed to limit probability of failure to a certain value.
In a given cycle, we know that each input can he incompatible with at most N-1 memories. Thus each input must have at least h i ' -N + 1 compatible memories. We consider the worst case scenario, where each input has exactly hf -N + 1 compatible memories and all inputs contend for the same set of M -N + l memories. Thus the compatibility graph would he a complete N x (A4 -N + 1) bipartite graph. Let P,,,(i,j, k ) be the probability of the event that if k balls are thrown uniformly at random into j bins then there are exactly i non-empty bins.
Thus probability of i packets being matched in such a scenario would be P,,,(z, iV, A.l-N+l). For this to happen, if the first k -1 balls fall into exactly i bins then the last hall must also fall in one of these i bins. Alternately if they fall in i -1 bins then the last one must fall in a new bin. This gives us the following recurrence relation,
Now let the probability P,(n, n: T ) he the probability of matching n packets to a subset of m memories in T rounds when each input of the n inputs are compatible with each of the m memories. Thus,
~, ( , ? . m , T ) = C~, ( i , n , m ) -~, ( n -i , m -i , r -l )
A similar approach can he used to computing the expected number of rounds. We emphasize that the solution to the recurrence relations provides only an upper hound, since the relations assume a very pessimistic scenario. Table I shows the minimum number of rounds needed to ensure that the probability of all packets being matched is at least 0.999, The numbers in table show that, if e 2 0.5 we never need more than 3 rounds.
Figure 2 depicts the expected number of rounds for various values of N and M . It can be seen from the figure that even for the case where M = L2.1Nj (i.e., L = .l) we need less than 4.01 rounds for a 4096 port switch on average. With N = 3N the expected number of rounds remains below 2.02.
In Figure 3 we plot the probability of matching all inputs at the end of a fixed number of rounds. With h.I = L2.lN1 and 4 rounds, the probability of matching all inputs is very close to one for a switch with up to 4096 ports. In Figure 4 we look at the effect of increasing the number of memory hanks on the expected number of rounds. We use M = + e ) N ]
and plot the expected number of rounds for different values of N and e. Obviously, as E increases the expected number of rounds decreases. However there does not seem to be much gain after L = 0.5.
B. Simulation srudies on slochaslic trufic
n i e failure probabilities and expected number of rounds computed in the previous section provide upper bounds on worst case traffic. However, we do not know of any explicit arrival sequence that would achieve those bounds and it may be the case that no such sequence exists. Similarly, we do not know whether the bound on M provided in Theorem 2.6 is 0-7803-8355-9/04i320.00 02004 IEEE. Figure 5 . Figure 6_ and Table ll the average number of rounds needed 10 match all inputs with uniform Bernoulli traffic; as might be expected. the average number of rounds needed here is less than the worst-case upper bound in the previous section. Even when only 1.6N memories are used, the packets are matched in less than 3.1 rounds on average for N up to 4096. Figure 6 shows the number of rounds needed for different values of M I N . buffers are almost full. Further, according to the analysis in [3] , this is independent of the arrival pattern and works even with bursty traffic. To study this theoretical prediction, we simulated both SMS and output queued routers with 64 input and output ports and bursty traffic (geometrically distributed bursts for randomly chosen outputs). For each cycle we measured the ratio of number of packets in a buffer to that of average number of packets in each buffer. Ideally, if all buffers are equally full, all measurements should be close to one. A spread in these numbers indicates that buffers are not evenly occupied. Figure 7ia) shows distribution of this measurement for both SMS and OQ for bursty traffic. Here, the plot for SMS buffers is concentrated around one, indicating that packets are evenly balanced across all buffers, while for OQ, a buffer can have as much as 15 times the average number of packets. Figure 7(b) shows similar results for uniform Bernoulli arrivals. Since traffic arrives uniformly at all outputs, in this case the output queues also remain balanced and small, but even here, the distribution of packets across buffers is more balanced in the SMS router than in the OQ router.
IV. PIPELINING
When multiple rounds of the basic matching procedure are used, the memories and inputs that are matched in earlier rounds will remain idle till the end of the cycle. In this section we present a simple pipelined scheduling algorithms. PRiPSS (Pipelined RiPSS) We start by analyzing a variant of the basic matching process in which only a random sample of the memory banks attempt to match themselves to the inputs. The rounds start with round i = 0 to facilitate relating this process to the rounds in the pipelined matching process. In the ith round of this 'sampled matching process' each memory bank attempts to match itself with probability 1/2"+', for i 2 0. We now describe this algorithm and we establish that it computes a perfect matching in O(log* N) rounds. The base used for the logarithm for the log* N analysis is not 2, but a value b, which is less than 2 but greater than elle. The more detailed analysis is given in the Appendix I (which works for any t > 0)
establishes the result using the lraditional base 2.
Sampled Matching Process:
for i = 0,1, 1) in parallel each unmatched memory sends a message to ' a random compatible input port with probability 1/2'+'
and does nothing with probability 1 -I/?"'.
2) in parallel each input port i picks a memory bank j that sent it a message and assigns its current packet to that memory bank. It then broadcasts a bit to all memory banks to inform them that it is no longer available to be matched (the bit sent to memory bank j is a 1 and the bit sent to all other processors is 0). 3) in parallel each memory bank that receives a I-bit from its matched input decrements a counter initially set to s. If the counter goes down to zero, the processor declares itself matched.
Define z IT y to be a y-high tower of 2. Let b = e('/'-') where 0 < 6 < 112 -l/e, and let ri = &. We observe that in iteration i for any given unmatched input port p. the expected number of processors compatible with p that send a message to some compatible input is 2 ( N + Nt)/2"+' = Using a Chernoff bound [l] we can show that with high probability. for any constant c > 0. at least (1 -c ) . N/2' of the processors that are compatible with a given unmatched input port do actually send a message in that round.
For i 2 0, let xi denote the number of unmatched inputs that remain after the ith iteration of the sampled matching
Proof?
07803-8355-9/04/%20.M) QUM4 Em.
process. For the base case of the lemma we note that E [zo] 5
Hence by applying Azuma's inequality [I] we have that zo 6 N / b w.h.p. in N .
Assume inductively that the result holds for xc-l for some i > 0, and consider x,. We have
If E[xi] 2 fi by Azuma's inequality we have that xi 5
Let us now return to the analysis of the pipelined matching process, and let D = log: N . 
Theorem 4.2: I f Ao(T-1) is true then w.h.p. in N , Ao(T)
is m e .
Proofr Consider the start of cycle T. Note that for any input port with i unmatched packets, the number of packets that can be matched at that port during cycle T -1 is O_ I _ or 2 (since we have assumed that w = 2). Let r2(T -1) be the number of inputs that had i or i -1 unmatched packets at the start of cycle T -1 and have at least i -1 unmatched packets at the end of cycle T -1. Since one new packet arrives at each input port at the start of cycle T, we have
The last equation above uses the inequality ri(T -1) 5 32,+1. We can establish this as follows:
Let nl be the number of active inputs in Qi(T -1) that are unmatched after the first iteration of stage T ~ 1, let X be the set of, inputs that have i -1 unmatched packets after the first iteration of stage T -1. and let 71.2 be the number of inputs in A' that are unmatched after the second iteration of
Since qi(T -1) 5 si(T -1) 5 i i (by the induction assumption). we have n1 5 :;+I by Lemma 4.1.
For n2 we note that /XI = Z I + Z Z , where X I is the number of inputs that had i unmatched packets at the start of cycle T-1, and have i -1 unmatched packets after the first iteration. and x2 is the number of inputs that had i-1 unmatched packets at the start of cycle T -1 and continue to have i -1 unmatched packets after the first iteration. Clearly, z1 6 q;(T -I), and 2 2 5 2, by the behavior of the sampled matching process on inputs that had i -1 unmatched packets at the start of cycle Further, when the low-probability event of failure in matching all packets occurs, PRiPSS resumes its normal behaviour of matching all packets within lo& N cycles, for a suitable constant 4 > 1.
A. Simulation of PRiPSS
We ran simulations of PRiPSS for 100,000 cycles with We simulated PRiPSS, PRiPSS-1. and PRiPSS-2 for 100,000 cycles with uniform Bernoulli arrivals and varied D. For PRiPSS-2 we found 6 = 0.5 gave the hest results. Table I11 shows the minimum value of D needed such that all the packets are matched in the simulations. Note that the algorithm for PRiPSS-1 requires 3 rounds of communication between memories and inputs while PRiPSS-2 and PRiPSS use only 2 rounds in our simulations. It appears that PRiPSS is an attractive alternative to the basic non-pipelined RiPSS and performs better than PRiPSS-1 and PRiPSS-2. It placed every packet in memory using just 2 rounds per cycle while keeping the latency to only two cycles for N 5 2048. and using only A4 = 1.6N memory banks. While PRiPSS-2 also requires only 
V. DISCUSSION
In this paper we have presented several results on practical routers for output-queued switches based on the SMS architecture. We have presented extensive numerical results for RiPSS. a randomized, parallel scheduler for SMS described in our earlier work [3]. We have presented a new and improved pipelined randomized parallel scheduler, PRiPSS. and analyzed its performance, and we have presented numerical results evaluating the performance of PRiPSS and two other pipelined heuristics.
Our results for RiPSS and PRiPSS are very encouraging. For switches with up to N= 4_096 input ports. RiPSS placed all incoming packets in compatible memory hanks using just 3 rounds in 99.9% of the cycles even when the number of memory banks M was only 1.6N. Earlier results in [31 have shown that under adversarial conditions, no placement is possible unless M 2 2N -1, and there exist arrival sequences for which no randomized scheduler can place packets more than half the time unless h.1 2 9N/8. The fact that RiPSS 07803-8355-9/04/$~0.00 OZW Em. places all packets under Bernoulli arrivals in just 3 rounds in 99.9% of the cycles when M 2 1.6N is encouraging.
The effective use of available memory by RiPSS relative to the outputqueued switch it simulated is impressive. For both Bernoulli arrival and bursty traffic, most of the memory banks in the SMS switch using RiPSS had load very close to the average load in most cycles. In practical terms, this means that if one uses RiPSS in an SMS architecture, the total amount of buffer space required needs to be only slightly larger than the total number of packets that need to be buffered. This had been proved analytically in [31 and our simulations support this result convincingly.
The pipelined scheduler PRiPSS that we presented and analyzed in this paper is superior to the pipelined scheduling We compared PRiPSS with two other pipelined strategies, PWPSS-1 and PRiPSS-2, that appeared to be natural heuristics, but for which we could prove only an O(log1og N)-stage bound on the delay, rather than the O(log' N) bound that we proved for PRiPSS.
Our 
APPENDIX I DETAILED ANALYSIS OF PRlPSS
We now give a detailed analysis of the pipelined randomized scheduler based on the pipelined matching procedure in Section IV, for the case when 6 > 0 is an arbitrarily small constant.
It is interesting to note that log; N is not defined for all values of N , if b 5 e'/e. In fact. if b 5 e'/" then ( b tt 1) 2 e for any value of i. Thus we cannot simply repeat the analysis in Section 4.2 with b = e-.
Recall that zi = 6 .
We will set b = 2 for this analysis. Let D he the smallest integer such that iD 5 a. In order to prove the above theorem we will first need the following lemma.
Lenima 1.2:
The number of unmatched inputs in the set Qi(T,t) after a execution of single iteration of pipelined matching procedure is no more than, qi(T,t)eLN/2i+'n.(T,t))/a, w.h.p. in N, where a > 0 is a constant independent of N.
Proot First we bound the expectation. Let j be an input in Q, (T,t) . Let q ( j ) be the set of unmatched memories that can he matched to input j . Clearly 1q(j)l 2 eN + qi(T; t).
Let C,,, be the index of the input to which memory m sends a request. Thus if m E lq(j)l, then Pr[C, = j ] = 1/(2;tf' . qi (T,t) ). Let C = (C1:C2 ... ,CM) and define the random variable S j ( C ) to be 1 if Vm. (C, # j ) and 0 otherwise. Informally & ( C ) indicates that input j did not get a request from any of the memories. Since an input is matched if and only if it gets a request from at least one of the memories. S, (C) = 1 implies input j did not get a match in that round, Let X(C) = CjtQi T , t ) X i ( C ) be the total number of unmatched inputs in Q, (T,t) at the end of the round. Then, IQ;(T> t)l. Let si(^, t ) = qi(T, t ) .
))(L"tQ"(T't)) 5 4 i (~, t ) e -( l +~N / 2 ' + ' q * ( T . L ) ) .
Thus defining a martingale and using Azuma's inequality we can say that w.h.p. in N _ number ofunmatched inputs in the set q;(T,t) would be no more than qi (T, (T, t) Thus by Lemma 2.2 with high probability all the inputs in Q a ( T ) will be matched in phase 3. Since each input other than Q L ( T ) had at least one packet matched now. all the packets that arrived at cycle T -D must be matched.
rn
Similar to the proof of PRiPSS, we now argue that our algorithm has a self-stabilizing property in the sense that if goes to a bad state, in a small number of cycles it will recover fiom the bad state to a state where PO is again true and no drops occur.
Since p'N/2" = N if j = log, log N, it follows that Piq, log N is trivially true for any cycle. Thus, by the second assertion in Theorem 2.1, any time PO is not true for a cycle, within logqlogN cycles it must hold uue with high probability. Thus P , will hold in almost all cycles during the execution of PRiPSS-1, and no packets will be dropped.
APPENDIX I11
holds for cycle T+ 1 with high probability.
ANALYSIS OF PRIPSS-2
We now sketch a proof that PRiPSS-2 places all packets with high probability in N with D = c . loglog N stages of pipeline, for a suitable constant c. For this we observe that the constant 0.4 used in the statement of Lemma 2.3 could be replaced by a somewhat smaller constant, call it c', while maintaining the validity of the claim. Thus if we use a sufficiently small constant b in PRiPSS-2, Lemma 2.3 can be proved for the case when the basic matching procedure is executed by each memory only with probability (1 -6). Thus PO will hold w.h.p. in N for PRiPSS-2. Now, if PO is true, then the number of unmatched packets that arrived D cycles earlier must be less than N/ logc N. Thus an input that h3s such a packet must receive a request from any compatible memory with probability at least w, Since there are at least t . N compatible memories, the probability of such an input not receiving a grant would be at most (1 -(6.log" N)/N)'N < e- ("."'g' Thus with high probability in N all such inputs must be matched in a single round.
0-7803-83S5-9/04i$20.00 02004 IEEE
