Abstract-This paper is ahout high capacity switches and routers that give guaranteed throughput, rate and delay guarantees. Many routers are built using input queueing or combined input and output queueing (CIOQ), using crossbar switching fabrics. But such routers require impractically complex scheduling algorithms to provide the desired guarantees. We explore how a buffered crossbar -a crossbar switch with a packet buffer at each crosspoint -can provide guaranteed performance (throughput, rate, and delay), with less complex, practical scheduling algorithms. We describe scheduling algorithms that operate in parallel on each input and output port, and hence are scalable. With these algorithms, buffered crossbars with a speedup of two can provide 100% throughput, rate, and delay guarantees.
I. BACKGROUND
A. me BufJered Crossbar- Figure 1 shows a 3 x 3 buffered crossbar, with line-rate R.
To prevent head-of-line blocking, the inputs maintain virtual output queues (VOQs). Fixed length packets' wait in the VOQs to be transferred across the switch. Each crosspaint contains a buffer that can hold one cell. The buffer between input z and output 2 is denoted as &; when the buffer holds a cell, B,, = 1, else B,, = 0.
Because the packets are all the same length, time is slotted, with a time slot equal to the time it takes for a cell to arrive on the external line. Internally, the switch runs faster than the external line, and the ratio between the two IS the speedup. If the switch can remove S cells from each input and transfer 5' cells to each output in a time slot, then it has a speedup of S . Throughout most of this paper we will assume that 5' = 2, and so the switch has output queues. Network operators want high capacity routers that give guaranteed performance. First, they prefer routers that guarantee throughput so they can maximize the utilization of B, why use ~~f~~~d Crossbars? their expensive long-haul links. Second, they want routers that can allocate rate, Third, they each flow a Buffered crossbars are interesting because they have simwould like fie capability to conuol the delay for packets of individual flows for real-time applications, Because h e y Want pler algorithms than an unbuffered crossbar. In an unbuffered Crossbar, the s " must find a matching high combined the trend has been towards input queued or Most of between inputs and outputs that doesn't oversubscribe either.
Overcoming both constraints at the same time leads to comoutput queued (cIoQ) these routers use a crossbar switching fabric with a centralized scheduler. While it is theoreticdly possible to build crossbar plex scheduling algorithms, such as maximal [ 5 1 7 maximum size /71 and maximum weight bipartite matching$ [I]: Or 1 iterative schedulers that are hard to pipeline [4l [S] . schedulers that give 100% throughput [l] or rate and delay [21[31 they are considered too complex to be
The appeal of a buffered crossbar switch is that its scheduler N~ commercial backbone today can is much simpler. The scheduler operates in two stages. First, hard guarantees on throughput, rate or delay.
each input (independently and in parallel) picks a cell to place into a crosspoint. And then in the second stage each output In practice, commercial systems use heuristics such as (independently and in parallel) picks a crosspoint to take a BLIP E41 or a maximal matching algorithm such as WFA [51 cell from. The processing can be distributed to run on each with insufficient speedup to give guarantees. Perhaps the most input and output, and so no longer requires a single centralized promising way of obtaining guaranteed performance has been scheduler. It can be pipelined to run at high speed, making to use maximal matching with a speedup of two in the switch buffered crossbars appealing for high performance switches fabric [61. But this only gives guaranteed throughput with no and routers. guarantees on rates and delay. And it still requires a centralized scheduler which doesn't scale with an increase in the number lwe that length packets are as
Of ports due tO the Conl~UniCatiOn COmpkxity. In this paper fixed length cells. This is common practice in high performance LAN switches and routers; variable length packets are segminted into cells as they arrive, carried across the switch as cells, and reassembled back into packets again Researchers first noticed via simulation that buffered crossbars provide good throughput for admissible uniform traffic with simple algorithms ~9I [lOl[l1] [121. Simulations also indicated that, with modest speedup, a buffered crossbar can closely approximate fair queueing [131. In [141, the authors described a mechanism to provide fair allocation and confirmed through simulations that a buffered crossbar can allocate service in a weighted max-min fair manner. Until recently, there were no analytical results on guaranteed throughput to explain or confirm the observations made by simulations. In this paper, we describe a series of algorithms with a broad class of performance guarantees over and above FCFS and strict priority FCFS emulation. We prove that these algorithms can achieve 100% throughput, can mimic an OQ switch using a weighted round robin scheduler (which gives rate guarantees), and can also achieve delay guarantees. The main benefit of these algorithms is that each input and output makes simple schedding decisions independently and in parallel, eliminating the need for a centralized scheduler. Our results show buffered crossbars can greatly simplify the scheduling process.
Of course, simplifying the scheduler comes at the expense of a more complicated crossbar: it now has to hold and mainrain N 2 packet buffers. In the past this would have been prohibitively complex: The number of ports and capacity of a crossbar switch used to be limited by the N 2 crosspoints that dominated the chip area (hence the development of multi-stage switch fabrics, such as Clos, Banyan and Omega switches based on smaller crossbar elements). But nowadays, crossbar switches are limited by the number of pins required to get data on and off the chip [MI. Improvements in process technology, and reductions in geometries, means that the logic required for N 2 crosspoints is small compared to the size of chip needed for N inputs and N outputs. The chips are pad-limited, with an underutilized die. A buffered crossbar can use the unused die for packet buffers. We believe that in current technology, the 128 x 128 unbuffered crossbar switch reported in 1181 could hold 128? cell buffers.
C. Organizasian of this paper
The rest of the paper is organized as follows. In Section 11, we show that a buffered crossbar with speedup two can give 100% throughput for non-uniform traffic. In Section 111, we briefly review the counting method introduced in 121 used to show how a CIOQ switch can mimic an OQ switch. In Section IV we show how the counting method can be applied to a buffered crossbar and so as to mimic a class of OQ switches. In Section V, we describe how a buffered crossbar can give rate guarantees between an inputloutput pair for a weighted round robin scheduler. In Section VI, we show that the buffered crossbar with one cell per crosspoint can give delay guarantees when S = 3, and introduce a novel mechanism called header scheduling which supports delay guarantees when S = 2. Figure 2 shows the scheduling phases in a buffered crossbar with a speedup of two. The two scheduling phases each SCHEDULING ALGORITHM consists of two parts: input scheduling, aad output scheduling. In the input scheduling phase, each input (independentIy and in parallel) picks a cell to place into an empty crosspoint. In the output scheduling phase, each output (independently and in parallel) picks a cell from a non-empty crosspoint to take from. The key to creating a scheduling algorithm is determining the input and output scheduling policy which decides how inputs and outputs pick cells in the scheduling phases, We will see a number of different policies each of which provides a different scheduling algorithm. The first algorithm we'll consider is the most general. In each scheduling phase, the input picks any non-empty VOQ, and the output picks my non-empty crosspoin 1.
ACHIEVING 100% THROUGHPUT WITH AN ARBITRARY
We will adopt the following notation and definitions. The switch has N ports, and VOQij holds cells at input i destined for output j . 
Arrival
the longer it might take to be transferred to the outpul. Many orderings of the cells are possible -each ordering leading to a differem switch scheduling algorithm.
In addition, each output also maintains an output priority list: an ordered list of cells at the inputs waiting to be transferred to a particular output. The output priority list is constructed based on the order in which the cells would depart from the OQ switch. This priority list will depend on the queueing policy followed by the OQ switch (i.e., WFQ, strict priorities, etc.).
The following definitions previously defined in [2] is necessary for the understanding of the rest of the paper.
DeJnirion 3: Output Cushion -At -any lime, the output cushion of a cell c. OC(c) , is the number of cells at c's output port that has an earlier departure order than cell c. 
, L(c) = OC(c) -I T ( c ) .
The slackness reflects the urgency with which we must transfer the cell to its output. The key to identical behavior is to find scheduling algorithms for which the slackness is always non-negative. Although not strictly necessary, this will ensure that when a cell is transferred to the output its output cushion is non-negative, (or reaches zero in the time slot it is transferred). The idea is that when the output cushion of a cell reaches zero, the input thread of that cell must also equal zero.
This means that either: (1) the cell is already at its output, and will depart the output on time, or (2) the cell is at the head of its priority list (because its input thread is zero), and will be transferred to the output immediately, which ensures that the cell will depart the output on time.
Based on the input and output priority lists, the counting method required that in each scheduling phase, at least one of the following conditions for each cell c is satisfied: (1) cell c is transferred from the input side, (2) a cell that is ahead of cell c in its input priority list is transferred from the input side, or (3) a cell that is ahead of cell c in its output priority list is transferred to the output side.
In [21, it was proved that meeting the conditions of the counting method ensured that the slackness of a cell increased by at least one in each scheduling phase, which was essential in proving that the slackness of any cell is always non-negative. However, [2] required a stable marriage dgarithrn [2 11 to meet the conditions of the counting method. We will now show how the buffered crossbar can also meet the conditions of the counting method in a simple distributed manner where each input and output makes decisions independently and in parallel.
of ports.
Iv. COUNTlNG METHOD WITH A BUFFERED CROSSBAR In order to ensure that the slackness of a cell increases by at least one in each scheduling phase for a buffered crossbar, the input and output scheduling policies must carefully be selected to guarantee that the conditions of the counting method are met. The input scheduling policy gives preference to cells based on the input priority list. Similarly. the output scheduling policy gives preference to cells based on the output priority list. Since the output priority list is ordered based on departure order, preference is given to cells with an earlier departure order.
However, the buffered crossbar has an additional requirement to meet the conditions of the counting method. The input priority list also must be arranged so cells destined to the same output are ordered based on departure order. Specifically, cells to the same output with an earlier departure order must have a higher priority. Cells to different outputs can still be ordered in any way. This requirement is necessary to ensure that, in the output scheduling phase, the cell selected has the earliest departure order of the cells stored in the input queues corresponding to the non-empty crosspoints, as can be seen in the following example, Let cells a, b, and c all be destined to output j . Cell a, is stored in input queue i l , cell b is stored in input queue i z , cell cis stored in crosspoint BilJ, and no other cells are deskined to output j at time t . The departure order is t 4 < t b < t , for cells a, b, and c respectively. In the input scheduling phase, input i l does not select cell a since cell c is already i n the crosspoint Bi,j and input i~ selects cell b. In the output scheduling phase, cell b is selected since it has an earlier departure order than cell c. As a result, the conditions of the counting method is not met for cell a since cell b which has a later departure order does not have a higher priority than cell a in the output priority list. This occurred because at some point in time cell c was incorrectly given a higher priority than cell a in the input priority list. This motivates the following "Group By Virtual Output Queue" insertion policy previously described in [21.
GBVOQ Insertion Poticy:
1) When a cell arrives to a non-empty VOQ, the cell is inserted in the input priority list just behind the last cell belonging to the same V'OQ. This ensures that cells destined to the same output are ordered based on departure order. 2) When a cell arrives to an empty VOQ, the cell is inserted at the head of the input priority list. At first glance, it seems unfair to insert a cell which arrives to an empty VOQ at the head of the input priority Iist. However, it is possible that there are no other cells in the system destined to that output. Therefore, the cell may immediately need to be transferred to the output in order to keep that output busy.
We will now prove in the following lemma that the buffered crossbar can satisfy the conditions of the counting method. then we no longer need to consider it.5 If a different cell is transferred to its crosspoint, the cell would belong to c's input thread, and l T ( c ) will decrease by one. a Case 2: If Bij = 1, then a cell will be transferred from one of the crosspoints l3,j to output j in the output scheduling phase. By definition of the GBVOQ insertion policy the cell in the crosspoint Bij has an earIier departure order than cell c. Since the output scheduling policy selects the non-empty crosspoint that contains the cell with the earliest departure order, OC(c) increases by one. Therefore, L ( c ) increases by at least one per scheduling phase.
The counting method using the GBVOQ insertion policy can /be applied trivially to show that a buffered crossbar can mimic a restricted PIFO-OQ switch, i.e., a PIFO-OQ switch with the restriction that cells from an input/output pair depart the switch in the order they arrive. This restricted policy includes output link schedulers which are fair across all inputs, i.e., provide rate guarantees between each inputloutput pair.6 7 7 " 2: (SuBciency) A buffered crossbar with a speedup of two can mimic the restricted PIFO-OQ switch, regardless of the incoming traffic pattern.
Proof See Appendix B. Similarly? in the WRR buffered crossbar, a virtual finishing time needs to be assigned to each cell, so as to determine the correct departure order. The problem is that cells are buffered at both inputs and outputs (and in the crossbar). Calculating the virtual finishing time when the cell arrives would require the input to have information about the cell's output, and all the cells at other inputs destined to it, This is impractical. Fortunately with our restricted definition of flows, cells are held in the input priority list in their departure order andas we'll show below -it is sufficient for the output to assign a virtual finish time only when cells reach the crosspoint.
The output needs to know upon arrival of the kth cell to the switch whether the k -lth cell (from this input to the given output) has departed, If it has departed, then the kth cell is transferred to the crosspoint immediately and is assigned the virtual finish time based only on the current virtual time. If it has not departed, then the kth cell is not be transferred immediately or the k -l t h cell must be in the output queue. Therefore, the ktk cell is assigned the virtual finish time based on the virtual finish time of the k -lth cell. We will formalize these cases in the proof, 
VI. DELAY GUARANTEES IN A BUFFERED CROSSBAR
In practice, an input/output pair carries many flows, not just one. For example, it carries TCP flows between source/destination pairs, and we might want to give each flow a different rate or delay guarantee. In order to do this, we need to relax our constraint on the definition of the flow, and determine how to assign a different rate to each flow. This is what we will do next; and we'll see that it increases the complexity of the buffered crossbar and requires more speedup.
In a PIFO-OQ switch, an arriving cell can be pushed into any location in the queue. It could, for example, be scheduled to depart ahead of all currently queued cells between the same input/output pair. In order to meet the conditions of the counting method, the cell in the crosspoint must have the earliest departure order of all cells stored in the input queue belonging to its inpudoutput pair. This causes problems for the buffered crossbar switch. Imagine the situation in which a crosspoint has a cell in it, and an arriving cell has an earlier departure order than the cell in the crosspoint. This causes what we call "crosspoint blocking" since the arriving cell cannot overtake the cell in the crosspoint.
If each crosspoint had a cell buffer for each flow, crosspoint blocking could be avoided7 However, this doesn't scale for a large number of flows. We now show how a buffered crossbar can overcome crosspoint blocking in a manner which is independent of the number of flows between an input/output pair.
A. Delay gllU~Un6eeS with speedup three
When a cell arrives to an input with an earlier departure order than the cell in the crosspoint buffer, we will swap the cell in the crosspoint with the newly arriving Logically, the cell that was previously in the crosspoint is recalled to the input where it is treated like a newly arriving cell. 3 y modifying the arrival phase to include swapping, crosspoint blocking can be avoided. This is at the expense of additional speedup to perform the swap. The modified arrival phase requires a new insertion policy. T h i s policy needs to meet two requirements: (1) To prevent crosspoint blocking, cells from an input/output pair must be inserted based on their departure order. (2) The slackness of a cell must be non-negative when inserted.
The Insertion Policy: As a consequence of the first requirement, an arriving cell c destined to output port j is inserted behind all cells destined to output j with a departure order less than cell e. To satisfy the second requirement, cell c is inserted Cnmediafely behind the cell that departs before i t (if it exists), destined LO the same output, If no such cell exists, cell c is inserted at the head of the priority list. This ensures that the slackness of the cell c is non-negative.
The priority list defined by this insertion policy has the property that cells from input i to output j are ordered based on their PIFO departure order. An example is shown in Figure   3 .9 23eoreni 4: (Suficiency) A buffered crossbar can mimic a PIFO-OQ switch (and hence give delay guarantees) with speedup three, regardless of the incoming traffic pattern.
Proolf: See Appenbx D.
Delay guarantees with speedup two
We overcame crosspoint blocking by swapping the cell in a crosspoint with a newly arriving cell. This was necessary because we allowed cells to be transferred to the buffered crossbar even before they were scheduled to depart. This early transfer was the cause of crosspoint blocking, and thus required swapping. But we could eliminate the need for swapping if we avoided transferring a cell to the crosspoint until it was reaUy ready to be transferred to the output. 
2001.
OQ switch with a fixed delay of N / 2 time slots. Proofi See appendix E.
1
The result comes at the expense of a more complicated buffering scheme in the crossbar and requires N cells buffering per output. Since these N cells can arrive in the same scheduling phase, there is an additional implementation complexity.
This can be eliminated by modifying the buffered crossbar so it has N c e h far each €Iij, for a total of N 3 cells. In the latter case. no more than one cell can arrive to a crosspoint in each scheduling phase. While requiring more storage, it will also mimic a PPO-OQ switch with a fixed delay of N / 2 time slots with speedup two. This might be practical for small values of N . In both modified buffered crossbar architectures, the number of crosspoints is independent of the number of flows in the switch.
VII. CONCLUSIONS
It is hard to scale crossbar-based routers because the scheduler for a crossbar must resolve the input and output constraints simultaneously. Whereas centralized schedulers get very complicated, the scheduler for a buffered crossbar allows inputs and outputs to make decisions independently and in parallel. With speedup two, and scheduling algorithms which are distributed and easy to implement, buffered crossbars provide throughput, rate and delay guarantees.
Although the crossbar is more complex than before, the bandwidth and pin count is the same as before, the CIOQ architecture is maintained, and no memory needs to run faster than twice the line-rate. This provides a simple path to scale crossbar based routers. Proof: This is a straightforward extension of Foster's
We will use the above lemma in proving Theorem 1.
Theorem 8: Under an arbitrary scheduling algorithm, the buffered crossbar gives 100% throughput with speedup of two.
In the rest of the proof we will assume that all indices i , j , k vary from 1,2, ..N. Denote the occupancy of VOQij at time n by Xij(n.). Also, let Zij denote the combined occupancy of the VOQij and the crosspoint Bij at time n. By definition, Zij(n) = X i j ( n ) -t Baj(n.). Proof.-
PT{C,,~ &(n)
Observe that from (3) Denote D i j ( n ) = 1 if a cell departs from VOQij at time n and zero otherwise. Also, let Aij(n) = 1 if a cell arrives to VOQij and zero otherwise. Then, X i j ( n + 1) = X,(n) +
Aij(n) -D i j ( n ) .
Henceforth, we will drop the time n from the symbol for Dij(n) and A i j ( n ) , and refer to them as D , and Aij respectively, since in the rest of the proof, we will only be concerned with the arrivals and departures of cells at
Since [ A i j -Dijl 5 1 and similarly lAik -Disl 5 1, we getlo In Section 11, it was shown that for a buffered crossbar with speedup of two, Ri.j is strictly negative when X,(n) > 0 and summation sign in equation (8)
(6) the Eaffic is admissible, So the first product term inside b e i > .% k Denote E i j ( n ) = 1 if a cell departs from the combined queue of VOQij and the crosspoint Bij, and zero otherwise. Note that Eij(n) = 1 only when a cell departs from *e crosspoint Bij to the output at time n, since all d e~x t u r e s to the output must occur from the crosspoint. Also recall that the arrival rate to the combined queue, VOQij and Bij, is the same as the arrival rate to VOQij. S O we can write zij(~? + 1) = Z~~( P Z ) + Aij(n.) -E i j ( n ) . Again we WilI drop the time 71 from the symbol for and 24ij(n)? and refer to them as E.ij and Aij respectiveIy.
Then, similar to the derivation in (6), we can derive using
Similarly, if the traffic is admissible, then C r , E [ A k j ]
Also, when B. .(n,) = 1, then from (1) and case I of theorem 1 in section 11, we know that the output j will receive at least one cell and SO at least one cell must have departed one of h e crosspoints destined to output j at time 72. And SO when &e uaffic is admissible and Bij(ri) = 1, then Sj < 0. This implies &at the second product term inside the summation sign in equation (8),
23
(417 (??.) (7) In both cases, X.ij(n)Rij and Bij(n)Sj are equal to zero only if X i j = 0 and Bij = 0 respectively. Now we want to use Lemma 2 and show that the whole right hand side of equation (8) is strictly negative. All that needs to be done is to ensure that one of the VOQs X i j in the summation in equation (8) is large enough so that 3Xij(n)Rij can negate the positive
So from (6) and
In order to show this, let, A, , ,
.Ar). Choose any
7 ' > 0, and let 
As shown in section 11, when Xij > 0, 
Therefore, we have
If we substitute this in equation S, then for all n such that
But, we also have from equation 1,
Let y correspond to the variable in Lemma 2 and set y = 2 y~3 . AISO it is easy to see that,
I;
"This is in fact the conditional expectation Uven knowledge of the state of all queues and crosspoints at time n . For simplicity in thz rest of the proof (since we only use the conditional expectation). we will drop the conditional expeclation sign and simply use the symbol for expectation as its meaning is scheduling algorithm gives 100% throughput.
APPENDIX B PROOF FOR THEOREM 2
Before we prove the theorem, we will need the following lemmas.
Lernnza 3 Proof: Consider any cell z that is inserted with a slackness of L ( z ) . Following the arrival phase, L ( s ) increases by at least one in each of the two scheduling phases. And in the departure phase, L ( z ) will decrease by one. Therefore, at the end of the time slot, L ( s ) increases by at least one. For example, if arriving cell 3;, is inserted with a slackness of zero, then at the end of the time slot, the slackness of cell 5 will be at least one.
From Lemma 3 and the fact that the slackness of an arriving cell will increase by one at the end of the time slot relative to the slackness of the cell when it arrived, we know that if the slackness of a cell i s less than one, then its slackness must have been negative when the cell was inserted. Lett be the first time that an arriving cell is inserted with negative slackness.
Consider two cases:
Case 1: If cell c was inserted at the head of the priority list, I T ( c ) is zero. Since the output cushion is defined as a non-negative value, the slackness of the cell is nonnegative when inserted, which contradicts our assumption.
Case 2: If cell c was not inserted at the head of the priority list, cell c must be inserted immediately behind another cell, c', destined to the same output as cell e.
Since c' was inserted before time 1, it musl have been inserted with non-negative slackness. At the end of the time slot cell d was inserted, its slackness increased by one. From Lemma 3, the slackness of cell e' is still at least one at time t . But since IT(c) = IT(c') + 1, and
So the slackness of the cell c must also be non-negative when inserted, which again contradicts our assumption.
Theorem 2: (Suflciency) A buffered crossbar with a speedup of two can mimic the restricted PFO-OQ switch, regardless of the incoming traffic pattern.
Proof: Suppose that the CIOQ switch has successfully mimicked the OQ switch up until time slot t -1, Consider the beginning of time slot t . We must show that any cell reaching its departure time is either: (1) already at the output side of the switch, or ( 2 ) will be transferred to the output during time slot i.
From Lemma 3 and Lemma 4, we know that a cell always has a non-negative slackness. Therefore, when a cell reaches its departure time (i.e. its output cushion has reached zero), its input thread must also equal zero. This means either: (1) that the cell is already at its output, and may depart on time,
(2) that the cell is in the crosspoint buffer or (3) that the cell is simultaneously at the head of its input priority list (because its input thread is zero), and has the earliest departure time (because it has reached its departure time). In case (3), the input scheduling phase is guaranteed to transfer the cell to the crosspoint. Since the cell is in the crosspoint after the input scheduling phase in both cases ( 2 ) and (3), and has the earliest departure time, it wiIl be selected in the output scheduling phase. The cell will then reach the output during the time slot, and therefore the cell departs on time.
APPENDIX c PROOF FOR THEOREM 3 In what follows, consider the following virtual finish time assignment policy when a cell arrives to the crosspoint.
Assume a cell c which arrives to the crosspoint B, at time t . Without loss of generality let this be the kth cell from input i 10 output j .
Case I: ~f the k -lih cell is still present in output j of the buffered crossbar, then the output of the buffered crossbar will assign the virtual finish time, qk-' f &.
Case 2: If the k -It* cell is not present in output j and the kth cell is not transferred to the crosspoint in the scheduling phase immediately after its arrival. then the output of the buffered crossbar will assign a virtual finish time of + &.
Case 3: If the k -l t h cell is not present in the output j of the buffered crossbar and the kth cell is transferred to the crosspoint immediately after its arrival, then the output of the buffered crossbar will assign the virtual finish time, V ( t ) -t-$.
We further assume that the buffered crossbar has speedup two and the output picks cells with the smallest virtual finish time from the non-empty crosspoints.
Lemma 5: The virtual finish time of every cell c is the same in the WRR-buffered crossbar switch and the WRR-OQ
switch.
Proof: Assume that the buffered crossbar has correctly calculated the virtual finish time of all cells which have arrived to the crosspoints up until time t -1 and the outputs have chosen the cells from their crosspoints which have the smallest finish time in every scheduling phase. From the results in section N, this means that the buffered crossbar with a speedup two, has mimicked the W R R -O Q switch up until time t -1. Let 1; be the first time that the virtual finishing time of a cell calculated is different from the virtual finishing time calculated by the WRR-OQ switch. Consider that cell c which arrives to the crosspoint Bij at time t was incorrectly calculated. Without loss of generality let this be the kth cell from input i to output j . We consider three cases.
Case 1: If the I; -lfh cell is still present in output j of the buffered crossbar,,then this means that it was also present in the WRR-OQ switch when the I;' ' cell arrived. So both the WRR-OQ switch and the output of the buffered crossbar will assign the same virtual finish time, I?)-* + &, which contradicts our assumption.
Case 2: If the k -l t h cell is not present in output j and the kth cell is not transferred to the crosspoint in the scheduling phase immediately after its arrival, then it must have been inserted behind the k -l t h cell in the input priority list or it was inserted to the head of the input priority list, but the crosspoint contained the k -l t h cell. Since the buffered crossbar switch has mimicked the WRR-OQ switch up until time t -1, this means that the ( kcell, was also present in the WRK-OQ switch at time a$. The output of the buffered crossbar assigns a virtual finish time of F:-' -t & which matches the virtual finish time assigned by the WRR-OQ switch, The assignment is the same, which contradicts our assumption.
-lth cell is not present in the output j of the buffered crossbar and the kth cell is transferred to the crosspoint immediately after its arrival, then since the buffered crossbar switch has mimicked the WRR-OQ switch up until time t -1, neither switch has cells in the system from input i destined to output j . So both the WRR-OQ switch and the output of the buffered crossbar will assign the same virtual finish time, V(t)+-& which again contradicts our assumption.
So the virtual finish rime of a cell at time t can also be correctly calculated. The above lemma, and the fact that the WRR-OQ switch is a special case of the restricted PFO-OQ policy imply the following theorem. 7 7 " 3: (Sii@ciency) A buffered crossbar can mimic an OQ switch using a weighted round robin policy with speedup two. regardless of the incoming traffic pattern. Before we prove the theorem, we will need the following lemmas.
Lemma 6: After the modified arrival phase, all cells in the crosspoints Bi, will have earlier departure order than any cell queued at input i destined for output j .
Pruufi Assume that the above property holds up until time t-1. Let t be the fist time that any cell c in the crosspoint does not have the earliest departure order as compared to any cell queued at input 1: destined for outpul j . At tlme t, there can be at most one newly arriving cell c to an input.
If the arriving cell has a earlier departure order than the cell in the corresponding crosspoint, then the modified arrival phase aIlows cell c to swap with the cell in the corresponding crosspoint, which contradicts our assumption.
Lemnia 7: The slackness L(c) of a cell c decreases by ai least one in each scheduling phase.
Pro@ Since Lemma 6 guarantees that the cell in the crosspoint has the earliest departure order compared with any cell queued id the corresponding input queue, Lemma 1 still holds.
I
Lemma 8: The slackness L(c) of a cell c waiting on the input side is non-decreasing from time slot to time slot.
Given Lemma 7 the only other difference as compared to Lemma 3 is in the modified arrival phase. Irrespective of whether a swap occurred or not. there is only one newly arriving cell to deal with i.e. if a swap does not occur, then it is a cell which just arrived at the input, else if a swap occurs, then the newly arriving cell is the cell from the swapped crosspoint. The rest of the proof is similar to Lemma 3.
8
Lenzina 9: The slackness L ( c ) of a newly arriving cell c is non-negative.
As described in Lemma 8, we only need to be concerned about inserting one newly arriving cell to the priority list at the input irrespective of whether a swap occurred Theorem 4: A buffered crossbar can mimic a PIFO-OQ switch (and hence give delay guarantees) with speedup three, regardless. of the incoming traffic pattern. Given Lemma 8 and Lemma 9, the proof is
I

Proof:
Proof:
or not. The rest of the proof is similar to Lemma 4.
Proof: exactly the same as the proof for Theorem 2.
APPENDIX E
PROOF FOR THEOREM s
Proof: An input can receive at most p + N -I grants over any p consecutive scheduling phases. If the input adds new grants to the tal of a grant FIFO, Md reads one grant from the head of the grant FlFO in each scheduling phase, then the grant FIFO will never contain more than N -1 grants. Each time the input takes a grant from the grant FIFO, it sends the corresponding cell to the set of N crasspoints for its output. Because the grant FIFO is served once per phase, a cell that is granted at scheduling phase p will reach the output crosspoint by phase p + N -1.
We need to verify that the per-output buffers in the crossbar never overflow. If the crosspoint scheduler issues a grant at phase p , then the corresponding cell will reach the output crosspoint between phases p and p + N -1. Therefore. during scheduling phase p , the only cells which can be in the output crosspoint are cells which were granted between phases p -N to p -1. With N buffers per output, the buffers will never overflow, and each cell faces a delay of at most N scheduling phases, or Ar/2 time slots (because 5' = 2J
