Abstract-Crossbar-based switches are commonly used to implement routers with throughputs up to about 1 Tb/s. The advent of crossbar scheduling algorithms that provide strong performance guarantees now makes it possible to engineer systems that perform well, even under extreme traffic conditions. Until recently, such performance guarantees have only been developed for crossbars that switch cells rather than variable length packets. Cell-based crossbars incur a worst-case bandwidth penalty of up to a factor of two, since they must fragment variable length packets into fixed length cells. In addition, schedulers for cell-based crossbars may fail to deliver the expected performance guarantees when used in routers that forward packets. We show how to obtain performance guarantees for asynchronous crossbars that are directly comparable to those previously developed for synchronous, cell-based crossbars. In particular we define derivatives of the Group by Virtual Output Queue (GVOQ) scheduler of Chuang et al. and the Least Occupied Output First Scheduler of Krishna et al. and show that both can provide strong performance guarantees in systems with speedup 2. Specifically, we show that these schedulers are work-conserving and that they can emulate an output-queued switch using any queueing discipline in the class of restricted Push-In, First-Out queueing disciplines. We also show that there are schedulers for segment-based crossbars, (introduced recently by Katevenis and Passas) that can deliver strong performance guarantees with small buffer requirements and no bandwidth fragmentation.
I. INTRODUCTION
C ROSSBAR switches have long been a popular choice for transferring data from inputs to outputs in mid-range performance switches and routers [1] . Unlike bus-based switches, crossbars can provide throughputs approaching 1 Tb/s, while allowing individual line cards to operate at speeds comparable to the external links.
However the control of high performance crossbars is challenging, requiring crossbar schedulers that match inputs to outputs in the time it takes for a minimum length packet to be forwarded. The matching selected by the scheduler has a major influence on system performance, placing a premium on algorithms that can produce high quality matchings in a very short period of time.
Traditionally, crossbars schedulers have been evaluated largely on the basis of how they perform on random traffic arrival patterns that do not cause long term overloads at inputs or outputs. Most often, such evaluations have been carried out using simulation [14] . Recently, there has been a growing body of work providing rigorous performance guarantees for such systems [11] , [15] in the context of well-behaved, random traffic. A separate thread of research concentrates on schedulers that can provide strong performance guarantees that apply to arbitrary traffic patterns [3] , [8] , [18] , including adversarial traffic that may overload some outputs for extended periods of time. The work reported here belongs to this second category. Since the internet lacks comprehensive mechanisms to manage traffic, extreme traffic conditions can occur in the internet due to link failures, route changes or simply unusual traffic conditions. For these reasons, we argue that it is important to understand how systems perform when they are subjected to such extreme conditions. Moreover, we argue that strong performance guarantees are desirable in backbone routers, if they can be obtained at an acceptable cost.
There are two fundamental properties that are commonly used to evaluate crossbar schedulers in this worst-case sense. A scheduler is said to be work-conserving if an output link is kept busy so long as there are packets addressed to the output, anywhere in the system. A scheduler is said to be order-preserving if it is work-conserving and it always forwards packets in the order in which they arrived. A crossbar with an order-preserving scheduler faithfully emulates an ideal nonblocking switch with FIFO output queues. In their seminal paper, Chuang, et al. provided the first example of an order-preserving scheduler [3] for a crossbar with small speedup, where the speedup of a crossbar switch is the ratio of its ideal throughput to the total capacity of its external links. So a crossbar with a speedup of has the potential to forward data times faster than the input links can supply it. In fact, Chuang, et al. showed a stronger property; that certain schedulers can be specialized to emulate an output queued switch that implements any one of a large class of scheduling algorithms at the outputs.
Until recently, strong performance guarantees have been available only for crossbars that forward fixed length cells. There is a sound practical justification for concentrating on such systems, since routers commonly use cell-based crossbars. Variable length packets are received at input line cards, segmented into fixed length cells for transmission through the crossbar and reassembled at the output line cards. This simplifies the implementation of the crossbar and allows for synchronous operation, which allows the scheduler to make better decisions than would be possible with asynchronous operation. Unfortunately, cell-based crossbar schedulers that deliver strong performance guarantees when viewed from the edge of the crossbar, can fail to deliver those guarantees for the router as a whole. For example, a system using a work-conserving cell-based scheduler can fail to keep an outgoing link busy, even when there are complete packets for that output present in the system.
We show that strong performance guarantees can be provided for packets, using asynchronous crossbars that directly handle packets, rather than cells, if the crossbars are equipped with a moderate amount of internal buffer space. Specifically, we define packet-oriented derivatives of the Group by Virtual Output Queue algorithm (GVOQ) of [3] and the Least Occupied Output First Algorithm (LOOFA) of [8] , [18] and show that they can deliver strong performance guarantees for systems with a speedup of 2. Because our crossbar schedulers operate asynchronously, we have had to develop new methods for analyzing their performance. These methods now make it possible to evaluate asynchronous crossbars in a way that is directly comparable to synchronous crossbars.
The use of buffered crossbars is not new. An early ATM switch from Fujitsu used buffered crossbars, for example [17] . However, most systems use unbuffered crossbars, because the addition of buffers to each of the crosspoints in an crossbar has been viewed as prohibitively expensive. There has recently been renewed interest in buffered crossbars [4] , [6] , [9] , [10] , [12] , [16] , [19] , [20] . A recent paper by Chuang et al. [4] advocates the use of buffers in cell-based crossbars in order to reduce the complexity of the scheduling algorithms. The authors argue that ongoing improvements in electronics now make it feasible to add buffering to a crossbar, without requiring an increase in the number of integrated circuit components. Hence, the cost impact of adding buffering is no longer a serious obstacle. Our results add further weight to the case for buffered crossbars, as the use of buffering allows inputs and outputs to operate independently and asynchronously, allowing variable length packets to be handled directly. Katevenis et al. [9] , [10] have also advocated the use of buffered crossbars for variable length packets and have demonstrated their feasibility by implementing a 32 port buffered crossbar with 2 KB buffers at each crosspoint.
Section II discusses the differences between switching cells and switching packets, and explains how buffered crossbars are particularly advantageous for systems that directly switch packets. Section III defines the terminology and notation used in the analysis to follow. Section IV collects several key lemmas that are used repeatedly in the analysis. Section V presents strong performance guarantees for a packet variant of the Group by Virtual Output Queue crossbar scheduler. Section VI presents a similar set of guarantees for a packet variant of the Least Occupied Output First scheduler. Section VII explains how our asynchronous crossbar scheduling algorithms can be used in systems that switch variable length segments rather than cells, reducing the amount of memory required by crossbar buffers by more than order of magnitude. Finally, Section VIII provides some closing remarks, including a discussion of several ways this work can be extended.
II. SWITCHING PACKETS VS. CELLS
As noted in the introduction, most crossbar-based routers, segment packets into cells at input line cards, before forwarding them through the crossbar to output line cards, where they are reassembled into packets. This enables synchronous operation, allowing the crossbar scheduler to make decisions involving all inputs and outputs at one time.
Unfortunately, cell-based crossbars have some drawbacks. One is simply the added complication of segmentation and reassembly. More seriously, the segmentation of packets into cells can lead to degraded performance if the incoming packets cannot be efficiently packed into fixed length cells. In the worst-case, arriving packets may be slightly too large to fit into a single cell, forcing the input line cards to forward them in two cells. This effectively doubles the bandwidth that the crossbar requires in order to handle worst-case traffic. While one can reduce the impact of this problem by allowing parts of more than one packet to occupy the same cell, this adds complexity and does nothing to improve performance in the worst-case.
In addition, crossbar schedulers that operate on cells, without regard to packet boundaries, can fail to deliver the expected guarantees from the perspective of the system as a whole. In a system that uses a cell-based crossbar scheduler, an output line card can typically begin transmission of a packet on its outgoing link only after all cells of the packet have been received. Consider a scenario in which input line cards receive packets of length at time , all addressed to the same output. If the length of the cell used by the crossbar is , each packet must be segmented into cells for transmission through the fabric. A crossbar scheduler that operates on cells has no reason to prefer one input over another. Assuming that it forwards cells from each input in a fair fashion, at least cells must pass through the crossbar before the output line card has a complete packet that it can forward on the output link. While some delay between the arrival of a packet and its transmission on the output link is unavoidable, delays that are substantially longer than the time it takes to receive a packet on the link are clearly undesirable. In this situation, the delay is about times larger than the time taken for the packet to be received. Interestingly, one can obtain strong performance guarantees for packets using cell-based schedulers that are packet-aware. We discuss this in Section VII.
There are a few previous studies of the performance of bufferless crossbars that switch packets, rather than cells. References [5] , [13] focus on performance for well-behaved random traffic, so are not directly comparable to the results presented here. On the other hand, [2] studies packet-mode emulation of unbuffered crossbars and shows that strong performance guarantees can be obtained for such systems. However, the frame-based scheduling methods used in [2] impose a delay that can be several orders of magnitude larger than the very modest delays imposed by the schedulers studied here.
Asynchronous crossbars offer an alternative to cell-based crossbars. They eliminate the need for segmentation and reassembly and are not subject to bandwidth fragmentation, allowing one to halve the worst-case bandwidth required by the crossbar. Unfortunately, there is no obvious way to obtain strong performance guarantees for unbuffered asynchronous crossbars, since the ability of the scheduler to coordinate the movement of traffic through the system, seems to depend on its ability to make decisions involving all inputs and outputs at one time. A scheduler that operates on packets must deal with the asynchronous nature of packet arrivals, and must schedule packets as they arrive and as the inputs and outputs of the crossbar become available. In particular, if a given input line card finishes sending a packet to the crossbar at time , it must then select a new packet to send to the crossbar. It may have packets that it can send to several different outputs, but its choice of output is necessarily limited to those outputs that are not currently receiving packets from other inputs. This can prevent it from choosing the output that it would prefer, were its choices not so constrained. One can conceivably ameliorate this situation by allowing an input to select an output that will become available in the near future, but this adds complication and sacrifices some of the crossbar bandwidth. Moreover, it is not clear that such a strategy can lead to a scheduling algorithm with good worst-case performance and small speedup.
The use of buffered crossbars offers a way out of this dilemma. The addition of buffers to each crosspoint of an crossbar effectively decouples inputs from outputs, enabling the asynchronous operation that variable length packets seem to require. A diagram of a system using a buffered crossbar is shown in Fig. 1 . In addition to the now conventional Virtual Output Queues (VOQ) at each input, a buffered crossbar has a small buffer at each of its crosspoints. As pointed out in [4] , the buffers allow inputs and outputs to operate independently, enabling the use of simpler crossbar scheduling mechanisms, but the buffers have an even greater import for asynchronous crossbars. With buffers, whenever an input finishes sending a packet to the crossbar, it can select a packet from one of its VOQs, so long as the corresponding crosspoint buffer has room for the packet. We show that crosspoint buffers of modest size are sufficient to allow strong performance guarantees with the same speedup required by cell-based schedulers. III. PRELIMINARIES To start, we introduce common notations that will be used in the analysis to follow. We say a packet is an -packet if it arrived at input and is to be forwarded on output . We let denote the time at which the first bit of is received on an input link and we let be the time at which the last bit is received. We let denote the number of bits in and denote the maximum packet length (in bits). The time unit is the time it takes for a single bit to be transferred on an external link, so
. The time at which a new packet is selected by an input and sent to the crossbar is referred to as an input scheduling event. We also define to the time at which an active period ends to be an input event. The time at which an output selects a packet from one of its crosspoint buffers is referred to as an output scheduling event. We use event to refer to either type, when the type is clear from the context.
We let denote the VOQ at input that contains packets for output and we let denote the number of bits in at time . Similarly, we let denote the crosspoint buffer for packets from input to output denote the number of bits in at time , and denote the capacity of the crosspoint buffers. For all quantities that include a time parameter, we sometimes omit the time parameter.
We focus on schedulers for systems in which packets are fully buffered at the input line cards where they arrive before they are sent to the crossbar. A packet is deemed to have arrived only when the last bit has arrived. Consequently, an -packet that is in the process of arriving at time is not included in . We say that a VOQ is active, whenever the last bit of its first packet has been received. For an active VOQ , we refer to the time period since it last became active as the current active period. For a particular active period of , we define notations for several quantities. In particular, if was the first packet to arrive in the active period, we let . and . The time of the first input event in the active period is denoted by . We say an input event is a backlog event for if when the event occurs, is too full to accept the first packet in , and we let denote the time of the first backlog event of an active period. We say that is backlogged if it is active, and its most recent input event was a backlog event. These definitions are illustrated in Fig. 2 . Note that and that if , then . While we require that packets be fully buffered at inputs, we assume that packets can be streamed directly though crossbar buffers, and through output buffers to outgoing links. The first assumption is the natural design choice. The second was made to simplify the analysis slightly, but is not essential. Extending our analyses to the case where outputs fully buffer packets is straightforward.
To define a specific crossbar scheduler, we must specify an input scheduling policy and an output scheduling policy. The input scheduling policy selects an active VOQ from which to transfer a packet to the crossbar. We assume that the input scheduler is defined by an ordering of the active VOQs. At each input scheduling event, the scheduler selects the first active VOQ in this ordering that is not backlogged, and transfers the first packet in this VOQ to the crossbar. We also assume that the output scheduling policy is defined by an ordering imposed on the packets to be forwarded from each output. At each output scheduling event, the scheduler selects the crosspoint buffer whose first packet comes first in this packet ordering.
Given a VOQ ordering for an input, we say that one VOQ precedes another if it comes before the other in this VOQ ordering. We extend the precedes relation to the packets in the VOQs and the bits in those packets by ordering the packets (bits) in different VOQs according to the VOQ ordering, and packets (bits) in the same VOQ according to their position in the VOQ. To simplify the language used in the analysis to follow, we include the bits in in the set of bits that are said to precede . For packets (bits) at different inputs going to the same output, we say that one precedes the other, if it comes first in the ordering that defines the output scheduling policy.
For an active VOQ , we let equal the number of bits in VOQs at input that precede at time (note, this includes the bits in ), plus the number of bits in the current incoming packet that have been received so far (if there is such a packet). We define to be the number of bits at output at time and to be the number of bits at output that precede the last bit in . With these preliminaries, we can now define two key quantities, slack and margin. Specifically, we define slack and . In the analysis to follow, we will show that shortly after the start of an active period for slack becomes non-negative and stays non-negative. This is useful, because when an output becomes idle, is necessarily zero. If slack is not negative, then must be zero also. Since this implies that is empty, there can be no packet at input that should be going out on output . Consequently, we can show that a scheduler is work-conserving by showing that the slack is non-negative. We can use margin in a very similar way when showing that a crossbar-based system emulates an output-queued switch with a specific scheduling policy.
Our worst-case performance guarantees are defined relative to a reference system consisting of an ideal output-queued switch followed by a fixed delay of length . An output-queued switch is one in which packets are transferred directly to output-side queues as soon as they have been completely received. An output-queued switch is fully specified by the queueing discipline used at the outputs.
In [3] , the class of Push in, First Out (PIFO) queueing disciplines is defined to include all queueing disciplines that can be implemented by inserting arriving packets into a list, and selecting packets for transmission from the front of the list. That is, a PIFO discipline is one in which the relative transmission order of two packets is fixed when the later arriving packet arrives. Most queueing disciplines of practical interest belong to this class. In [4] , the restricted PIFO queueing disciplines are defined as those PIFO disciplines in which any two -packets are transmitted in the same order they were received. Note that this does not restrict the relative transmission order of packets received at different inputs. Our emulation results for buffered crossbars apply to restricted PIFO queueing disciplines.
We say that a crossbar -emulates an output-queued switch using a specific queueing discipline if, when presented with an input packet sequence, it forwards each packet in the sequence, at the same time that it would be forwarded by the reference system, with an output delay of . We say that a switch is work-conserving, if whenever there is a packet in the system for output , output is sending data. A crossbar-based system is -work-conserving if it -emulates some work-conserving output-queued switch. Alternatively, we can say that a system is -work-conserving if output is busy whenever there is a packet in the system for output that arrived at least time units before the current time.
A crossbar that -emulates an output-queued switch is defined by a specific crossbar scheduling algorithm and by the output queueing discipline of the emulated switch. To achieve the emulation property, the output line cards of the crossbar must hold each packet until time units have passed since its arrival. While it is being held, other packets that reach the output after it, may be inserted in front of it in the PIFO list. Whenever the output becomes idle, the linecard selects for transmission the first packet in the list which arrived at least time units in the past. This may not be the first packet in the list, since the PIFO ordering need not be consistent with the arrival order. In the next few sections, we will prove work-conservation and emulation results for two crossbar scheduling algorithms. These results all require that the speedup , crossbar buffer size , and time delay be at least as large as some minimum threshold. Fig. 3 summarizes these thresholds. Note that the values for and are stated relative to the maximum packet length .
IV. COMMON PROPERTIES
In this section, we prove a number of properties that apply to certain large classes of crossbar schedulers. Readers may want to skip this section on first reading, referring back to the individual lemmas as they are used in later sections.
A. Prompt Schedulers
All the schedulers we consider have the property that they keep the inputs and outputs busy whenever possible. In particular, if an input line card has any packet at the head of one of its VOQs and the VOQ is not backlogged, then the input must be transferring bits to some crosspoint buffer at rate . Similarly, if any crosspoint buffer for output is not empty, then output must be transferring bits from some crosspoint buffer at rate . A scheduler that satisfies these properties is called a prompt scheduler.
The first two lemmas provide lower bounds on that apply to all prompt schedulers. These are useful when attempting to establish lower bounds on slack .
Lemma
output is non-empty, the crossbar transfers bits to the output at rate . Since an output sends bits from the output queue to the link at rate 1, an output queue grows at rate during any period during which one or more of its crosspoint buffers is non-empty. It follows that . Combining the two inequalities yields the desired result.
B. Invariant Schedulers
We say that a scheduling algorithm is invariant if it does not change the relative order of any two VOQs during a period when they are both continuously active. This property is shared by a number of different crossbar schedulers, including one we consider in detail in the next section.
The next lemma can be used to show that for prompt and invariant schedulers, slack does not decrease following the first scheduling event of an active period, and it applies to any prompt and invariant scheduler. 
-FIFO Schedulers
We say that a system is -FIFO if for all inputs and outputs , all -packets are forwarded in the same order they were received. Note that systems that implement restricted PIFO queueing disciplines are -FIFO.
In this subsection, we prove several lemmas that are useful in proving emulation results. The first two lemmas provide lower bounds on for prompt and -FIFO schedulers. These are useful for proving lower bounds on . 
V. PACKET GROUP BY VOQ
Group by Virtual Output Queue (GVOQ) is a cell switch scheduling algorithm first described in [3] and extended to buffered crossbars in [4] . We define the Packet GVOQ (PGV) scheduler by defining an ordering that it imposes on the VOQs. In this ordering, the relative order of two VOQs does not change so long as they both remain active. Hence, PGV is invariant. When an inactive VOQ becomes active, it is placed first in the VOQ ordering. When a VOQ becomes inactive, it is removed from the VOQ ordering. Different variants of PGV can be defined by specifiying different output scheduling strategies.
A. -Work-Conservation
In this section, we show that regardless of the specific output scheduling policy used, PGV is -work-conserving. We prove two versions of the work-conservation result. The first is a bit weaker than the second, but is included because the analysis is more straightforward and hence it provides a useful stepping stone to the more difficult results to follow.
Theorem 1: Any PGV scheduler is -work-conserving if and . The proof of this theorem involves four steps. The first step is to show that slack does not decrease after the first scheduling event of an active period. This was shown in Lemma 3 in the previous section. The second step, is to show that a backlog event must occur near the start of an active period, and the third step is to show that when the first backlog event occurs, slack is non-negative. These two steps are shown in the proofs of the next two lemmas. The final step, which appears as the proof of the theorem, is to show that when an output is idle, no input can have a packet that has been present for more than time .
Lemma 8: Consider an active period for in a crossbar using a PGV scheduler with speedup . If the duration of the active period is at least , then it includes at least one backlog event for and . Proof: Suppose there is no backlog event in the interval for . Then, at each event in this interval, the input scheduler selects either or some other VOQ that precedes . Since the scheduling algorithm is invariant, any contribution to increasing during this interval can only result from the arrival of new bits from the input link. Consequently, decreases at a rate throughout this period. Since ,
The first line in the above inequality follows from the fact that and that can increase at rate at most 1 during the interval and must decrease at rate after . The second and third lines follow directly from the definitions, and the last line from the fact that . The above result contradicts the premise that the duration of the active period is at least . Our next lemma shows that within a short time following the start of an active period, slack . Proof of Theorem 2: Suppose some output is idle at time and no input is currently sending it a packet, but some input has a packet for output with . By Lemma 10, slack . Since, , this implies that , which contradicts the definition of .
B. -Emulation Results for PGV
We refer to a PGV scheduler defined by a restricted PIFO queueing discipline as a PGV-RP sechduler. We show that for any restricted PIFO queueing discipline, the corresponding PGV-RP scheduler -emulates an ideal output-queued switch using the same discipline. Our result for PGV generalizes the corresponding result for cell-based crossbars given in [3] .
Theorem 3: Let be an output-queued switch using a restricted PIFO scheduler. A crossbar using the corresponding PGV-RP scheduler -emulates if
, and .
The analysis leading to this result is similar to the analysis used to establish work-conservation. The first step is to show that does not decrease following the first input event of an active period (Lemma 7). The second step is to establish a lower bound on at the time of the first backlog event. We then use this lower bound to prove the emulation result. . We can now proceed to the proof of the theorem.
Proof of Theorem 3: Suppose that up until time , the PGV-RP crossbar faithfully emulates the output-queued switch, but that at time , the output-queued switch begins to forward an -packet , while the crossbar does not. Now suppose that in the crossbar, one or more bits of have reached by time . Note that the interval must contain at least one scheduling event at output and all such events must select packets that precede . However, this implies that during some non-zero time interval , output is continuously receiving bits that precede at a faster rate than it can forward them to the output. This contradicts that fact that by time , the crossbar has forwarded all the bits that precede (since it faithfully emulates the output-queued switch up until time ).
Assume then that at time , no bits of have reached . Since the output queued switch has an output delay of , so . Since the crossbar has sent everything sent by the output-queued switch up until , it follows that . By Lemma 11,  and hence , which is not possible. The analysis of Lemma 11 requires a crossbar buffer of size at least when . We conjecture that this can be reduced using a more sophisticated analysis.
VI. PACKET LOOFA
The Least Occupied Output First Algorithm (LOOFA) is a cell scheduling algorithm described in [8] . We define an asynchronous crossbar scheduling algorithm based on LOOFA, called Packet LOOFA (PLF). Like PGV, PLF is defined by the ordering it imposes on the VOQs at each input. The ordering of the VOQs is determined by the number of bits in the output queues. In particular, when a VOQ becomes active, it is inserted immediately after the last VOQ , for which . If there is no such VOQ, it is placed first in the ordering. At any time, active VOQs may be re-ordered, based on the output occupancy. We allow one VOQ to move ahead of another during this re-ordering, only if its output has strictly fewer bits. The work-conservation result for PLF is comparable to that for PGV, but the required analysis is technically more difficult because in PLF, the relative orders of VOQs can change.
Because the order of VOQs can change, PLF is also more responsive to changing traffic conditions than PGV. While this has no effect on work-conservation when , it does provide better fairness when used with smaller speedups. As one example of this, consider the following traffic pattern. From time 0 to time , a switch with speedup of , receives packets on inputs and for output at the link rate of 1. After time , input receives packets for output (at rate 1) while input receives packets for output . Due to the symmetry of the traffic pattern, a scheduler has no reason to favor one input over the other, so we assume that the inputs are treated fairly by the output scheduling policy. Up until time , the two inputs each send packets to at rate and forwards packets at rate 1, while building a backlog. If a PGV scheduler is used, then after time , input gives preference to output , while input gives preference to output . Consequently, output receives packets only at rate . As a result, the output side backlog at is fully consumed by time , after which starts forwarding packets at rate , while both outputs and continue to forward packets at rate 1. So for is limited to an output rate of 0.4. On the other hand, a PLF scheduler attempts to keep the output queue lengths equal, so after time , outputs and will all receive packets at rate . So for , all three outputs will forward packets at rate 0.8. This doubles the rate at which is able to send, dramatically improving the fairness with respect to the other outputs.
A. More Definitions
To facilitate the analysis of PLF, it's helpful to separate the analysis of "old bits" from "new bits". When considering an active period for , the old bits at input are those bits that arrived before . All other bits at input are considered new. Also, we say that a VOQ is older than a VOQ at time if both are active, and last became active before did. We say that a VOQ passes a VOQ during a given time interval, if precedes at the start of the interval and precedes at the end of the interval.
For an active VOQ, , we let new be the number of bits present at input at time that arrived in the interval . We let equal new plus the number of bits that precede at time that arrived before . Note that and consequently, slack slack and mar in
B. Additional General Lemmas
Here we give several more lemmas that apply to a broad class of scheduling algorithms and are useful for establishing both -work-conservation and -emulation results for PLF. The reader may want to skip this section on first reading, and refer back to the lemmas presented here, as they are used. The proofs of the first two lemmas are omitted, as they are very similar to proofs of earlier lemmas.
Lemma 12: Let be the time of an input scheduling event in an active period of and let be no later than the next event at input . For any prompt scheduler, slack slack if no older VOQ passes in and . Lemma 13: Let be the time of an input scheduling event in an active period of and let be no later than the next event at input . For any prompt, -FIFO scheduler, mar in mar in if no older VOQ passes in and . Our next lemma applies to any scheduling algorithm. Lemma 14: If there is some VOQ that is older than and that precedes at time , then there is some such VOQ for which . Proof: Let be a VOQ that is older than and that precedes at time . More specifically, let be that VOQ that comes latest in the VOQ ordering, among all VOQs that satisfy the condition. Let be the set of bits that precede at time but not . Note that and that all bits in must have arrived since (otherwise, there would be some VOQ older than that precedes and comes later in the VOQ ordering than ). Since is older than , these bits also arrived after . Let be the set of bits that arrived after and are still present at time and do not precede . Note that and that and have no bits in common. Now, let be the set of bits that arrived since and do not precede . Both and are subsets of and so . Consequently, which implies that . 
C. -Work-Conservation

D. -Emulation
In this section we show that a variant of the PLF algorithm is capable of emulating an output queued switch using any restricted PIFO queueing discipline. This variant differs from the standard PLF algorithm in that it orders VOQs based on the values of , rather than . That is, when becomes nonempty, it is inserted into the VOQ ordering after the last VOQ for which . If there is no such VOQ, is placed first in the ordering. Strictly speaking, this variant is different from PLF, so to avoid confusion we refer to it as Refined PLF or RPLF.
Theorem 5: Let be an output-queued switch using a restricted PIFO scheduler. A crossbar using the corresponding RPLF scheduler -emulates if and and . To prove the theorem, we need the following lemma. Proof of Theorem 5: Suppose that up until time , the PLF crossbar faithfully emulates the output-queued switch with added delay , but that at time , the output-queued switch begins to forward an -packet , while the crossbar does not. Now suppose that in the crossbar, one or more bits of have reached by time . Note that the interval must contain at least one scheduling event at output and all such events must select packets that precede . However, this implies that during some non-zero time interval , output is continuously receiving bits that precede at a faster rate than it can forward them to the output. This contradicts the fact that by time the crossbar forwards all bits that precede (since it faithfully emulates the output-queued switch up until time ). Assume then that at time , no bits of have reached . Since the output-queued switch has a delay of and so . Since the crossbar has sent everything sent by the output-queued switch up until time , it follows that . By Corollary 2, and hence , which is not possible.
VII. SEGMENT-BASED SWITCHING
Chuang et al. [3] showed that cell-based crossbars can emulate an output-queued switch using any push-in, first-out (PIFO) queueing discipline. It is straightforward to define PIFO scheduling policies that keep the cells of a packet together (simply insert later arriving cells of a given packet right after their immediate predecessors). This makes it possible to provide strong performance guarantees for packets not just cells, using variants of standard crossbar schedulers that are packet-aware. (Thanks to the anonymous referee who made this observation in his insightful review of an earlier version of this paper.) Note that this method may require that the output line card forward cells that form the initial part of a packet, before all cells in the packet are received, but this is feasible in this context, since the crossbar scheduler can guarantee that the remaining cells are received by the time they are needed. While packet-aware schedulers can provide packet-level performance guarantees in systems that use cell-based crossbars, such systems still suffer from bandwidth fragmentation, since packet lengths are generally not even multiples of the cell length.
One possible objection to the use of crosspoint buffers that are large enough to hold packets is that they might be too expensive, even for modern integrated circuit components. A 32 port crossbar equipped with buffers large enough to hold two 1500 byte packets would require a total of more than 3 MB of SRAM. In [10] , the authors propose switching variable length segments rather than cells, as a way of addressing the fragmentation problem with fixed-size cells. If this is coupled with a packet-aware crossbar scheduler that provides performance guarantees for variable length packets, we can reduce the crossbar buffer size to a multiple of the maximum segment length. For IP routers, a maximum segment length of 80 bytes is sufficient to eliminate bandwidth loss due to fragmentation effects. Even after adding 20 bytes for header information this reduces the required buffer size by a factor of 15, making it small enough to be easily accommodated within the constraints of current circuit technologies.
Also observe that in a segment-based system, an input line card can forward segments to an output before all segments of the packet have been received. The performance guarantee for the crossbar ensures that the remaining segments are transferred through the crossbar in time to be forwarded on the outgoing link, if the system is operated with a speedup of 2. Thus, we reduce both the amount of buffering required and the delay.
VIII. CONCLUDING REMARKS
The results of Sections V and VI can be extended to systems that place different constraints on where and when packets are buffered. In particular, most routers buffer packets at both input and output line cards, not just at the inputs. Modifying the analysis to handle this case is straightforward and requires only that the value of be increased by , to accommodate the added delay for a maximum length packet to be fully buffered at the outputs.
With an asynchronous crossbar, it is possible to build a system in which packets pass from inputs to outputs without ever being fully buffered. This is known as cut-through switching [7] and can provide superior delay performance when load is light. While our results cannot be directly applied to such systems, it seems likely that similar results could be developed for this model. Indeed, the segment-based switches already approach the behavior of a cut-through switch.
There are several ways the work described here can be extended. First, there are opportunities for tightening the results, particularly with respect to the crossbar buffer size. There seems to be no intrinsic reason that PLF should require a larger crossbar buffer size than PGV. An analysis that directly compares the behavior of a PLF scheduler to a PGV scheduler may be able to reduce the buffer size requirement for PLF.
It would also be interesting to see if the analysis techniques can be extended to provide stronger performance guarantees. In particular, it would be useful to show that an asynchronous buffered crossbar can emulate an output-queued switch using any PIFO queueing discipline, not just any restricted PIFO discipline. The difficulty in making the transition from restricted PIFO queueing disciplines to unrestricted PIFO disciplines is that once a packet is in a crossbar buffer, there is no way for a later arriving packet from the same input to reach the output line card before it does, even if the queueing discipline gives it higher priority. Reference [4] describes several techniques that can be used to allow cell switches using buffered crossbars to overcome this crosspoint blocking phenomenon. It seems likely that these methods can be generalized to accommodate asynchronous crossbars.
Still another direction to explore is how scheduling algorithms that deliver strong performance guarantees when operated with a speedup of 2 perform when operated with a smaller speedup. Since the crossbar cost increases in direct proportion to the speedup, there are practical reasons to be interested in the performance of systems with smaller speedup, even if they are not able to deliver strong performance guarantees. A comprehensive simulation study exploring how such systems perform under a wide range of conditions would have considerable practical value.
