Abstract --High performance packet switches frequently use a centralized scheduler (also known as an arbiter) to determine the configuration of a non-blocking crossbar. The scheduler often limits the scalability of the system because of the frequency and complexity of its decisions. A recent paper by C.-S. Chang et al. introduces an interesting two-stage switch, in which each stage uses a trivial deterministic sequence of configurations. The switch is simple to implement at high speed and has been proved to provide 100% throughput for a broad class of traffic. Furthermore, there is a bound between the average delay of the two-stage switch and that of an ideal output-queued switch. However, in its simplest form, the switch mis-sequences packets by an arbitrary amount. In this paper, building on the two-stage switch, we present an algorithm called Full Frames First (FFF) , that prevents mis-sequencing while maintaining the performance benefits (in terms of throughput and delay) of the basic two-stage switch.
I. INTRODUCTION
Most high performance packet switches today use input queueing, and a non-blocking (usually crossbar) switch fabric [1] [2] . To overcome head-of-line blocking and enable high throughput, the input buffers are arranged as virtual output queues (VOQs) [3] . To simplify the tasks of memory management and scheduling, a fixed sized time slot is used, and hence arriving variable length packets are segmented into fixed size packets, or "cells". Each time slot a centralized scheduler examines the contents of the VOQs to determine the configuration of the switch fabric for the next time slot. Numerous papers have studied this approach, and have proposed new scheduling algorithms that are simple to implement [4] [5] , provide throughput guarantees [6] [7] [8] or provide delay guarantees [9] [10] .
In 1993, Anderson et al. [4] observed that the job of the scheduler is equivalent to finding a matching in a bipartite graph. McKeown et al. [6] showed that 100% throughput could be guaranteed if a maximum weight matching is found, which has a complexity of , where N is the number of switch ports [11] . This has proved too complex for use in existing high performance packet switches. With the number of ports increasing (hence increasing the complexity) and line rates increasing (hence reducing the time in which the algorithm must complete), maximum weight matching algorithms will continue to be impractical.
Some maximal size matching algorithms and heuristics have been proposed that have a complexity of or lower [4] [5] [12] . While these algorithms have been widely used, the need for switches with more ports and faster line rates makes these algorithms harder and harder to implement. In fact, it appears that the scalability of most input-queued switches today is limited by the scheduling algorithm. At least four different approaches have been proposed in the literature to improve scalability. The first approach is to use a simple randomized scheduling algorithm that exploits the correlation between successive matchings [13] [14] . Tassiulas et al. [13] showed that a simple randomized scheduling algorithm could guarantee 100% throughput for Bernoulli i.i.d. arrivals, although packet delay is large. Shah et al. [14] recently introduced an alternative algorithm that leads to lower delays. The second approach is to increase the length of a cell, which in turn increases the time slot and gives the scheduler more time to complete [15] . The third approach attempts to pipeline the scheduler, allowing it to use out-of-date information [7] . Although this approach does not reduce the throughput, it increases packet delay.
The fourth approach, that motivated this paper, adopts a novel structure proposed by C.-S. Chang et al. [16] . Their switch consists of two stages, but has no scheduler. Both stages of the switch follow a deterministic sequence of different configurations. All that is required is that each input is connected to each output exactly once in the sequence. For example, the sequence that we will assume throughout this paper is one in which input is connected to output ( modulo ) at time in the first stage, and input is connected to output ( modulo ) at time in the second stage. A cell arriving to the first stage is immediately transferred without buffering to an input of the second stage switch. The cell is placed in a VOQ according to its output. The VOQs in the second stage are all serviced at the same rate by a second deterministic sequence.
The intuition behind the two-stage approach is as follows. It is known that a single-stage crossbar switch with VOQs that are served by such a deterministic sequence will provide 100% throughput for uniform 1 Bernoulli i.i.d. traffic; but no guarantees are possible when the traffic is non-uniform. In the twostage switch, the first stage effectively makes non-uniform traffic uniform by spreading it evenly over the second stage. Hence the two stages might be expected to provide 100% throughput. In [16] this is proved rigorously, for a particular definition of throughput and for a broad class of arrival processes.
A disadvantage of the two-stage switch is that cells can be mis-sequenced by an arbitrary amount. Although strictly not 
disallowed in an Internet router [17] , mis-sequencing can cause problems for current versions of TCP [18] [19] , and so common rules of practice dictate that routers should not mis-sequence packets.
In a second paper, Chang et al. [20] propose two different solutions that bound the amount of mis-sequencing, enabling the addition of a finite resequencing buffer after the second stage. This is similar to the parallel packet switch (PPS) [21] . Nevertheless, the first scheme proposed in [20] either requires up to memory accesses per time slot in any second-stage input (for packets that arrived to the switch at the same time), or needs to use a complex buffering mechanism. The second scheme, EDF, needs to retrieve the packet with the smallest timestamp from a queue, making it hard (but not impossible) to implement in a high performance switch.
In this work, our goal is to design a two-stage switch with the same throughput advantages. Instead of bounding the amount of mis-sequencing, our approach prevents mis-sequencing from taking place, eliminating the need for a resequencing buffer.
In the remainder of this paper, we present an algorithm, called "full frames first" (FFF) that leads to an average packet delay within a constant from the ideal output queuing (OQ), and therefore reaches the same throughput as OQ. It uses threedimensional queues (3DQs) (which are an extension of VOQs) to avoid packet mis-sequencing. FFF comes at a cost: the 3DQ queueing structure is more complex than simple VOQs; and although simple, FFF is not as trivial as the deterministic sequence of configurations. This paper is organized as follows. We first describe Chang's switch architecture and the EDF algorithm. Then we introduce 3DQ and show how it helps prevent mis-sequencing by giving some choice to the external outputs. Finally, we present the FFF algorithm, showing that it has no mis-sequencing and proving some theorems on its delay and throughput.
II. SWITCH ARCHITECTURE

A. Definitions
Throughout this paper, we'll use the terms "packets" and "cells" interchangeably to designate fixed-size cells. We'll denote the number of switch ports by , and assume . The switch architecture that we will use as the basis for this paper is taken from [20] , and shown in Figure 1 . Although it is more complex than the basic structure in [16] , the additional queues in the first stage help to limit the amount of missequencing. The switch architecture consists of two stages of switching. The inputs of the first stage are called external inputs (EIs), and numbered . The outputs of the first stage, called internal outputs (IOs), are collocated with the inputs of the second stage, called internal inputs (IIs). IOs and IIs will be used interchangeably in this paper, and are numbered . Finally, the outputs of the second stage, called external outputs (EOs), are numbered . Let's follow the path of packets through the switch. 1. First, a flow splitter labels each packet in EI as belonging to a given flow , where is the EO to which this packet is destined. There are therefore possible flows per EI representing the different EOs to which the packets may be destined. 2. A load balancer sends all the packets from to the VOQ 1 s (corresponding to the IOs), in a round-robin manner -i.e. the first packet from a given flow is sent to the VOQ 1 for IO , the second one is sent to the VOQ 1 for IO , and so on, independently of the packet arrival times. Because the load balancers are not necessarily synchronized with the sequence of configurations of the first-stage switch, arriving packets are buffered and do not necessarily immediately leave the VOQ 1 s. Note that the inputs of the VOQ 1 s are the EIs, their outputs are the IOs (collocated with the IIs), and there is a different load balancer for each flow. 3. The VOQ 1 s are served in deterministic order by the firststage switch, and when their turn comes the packets leave their VOQ 1 and pass through the first-stage switch. 4. After leaving the first-stage switch the packets are queued in the VOQ 2 s. The inputs of the VOQ 2 s are the IIs, and their outputs are the EOs. 5. The VOQ 2 s are served in deterministic order by the secondstage switch, and when their turn comes the packets leave their VOQ 2 and pass through the second-stage switch. 6. Finally, the packets leave the second-stage switch and exit through the EO. The following property of the switch will prove useful in this paper (proved in [20] ).
Property 1 If a packet arrives to the switch at time , it will arrive to the VOQ 2 s no sooner than , and no later than .
B. EDF: Example of Algorithm Using This Switch Architecture
Suppose that two packets belonging to flow arrive backto-back at EI . Because they may be placed in different VOQ 2 s at the second stage, they may both experience very different delays through the switch, and may become mis-sequenced by an arbitrary amount. 
F ik i
The earliest deadline first (EDF) algorithm prevents missequencing by serving cells in the VOQ 2 s in the order that they arrived to the switch, rather than strictly from the head of line. EDF has the following properties, proved in [20] .
Property 2 Packet mis-sequencing is bounded by . Note that it is therefore possible to add a finite resequencing buffer after the switch for each external output.
Property 3 The packet delay in EDF is bounded by the sum of the packet delay in a first-come-first-served (FCFS) OQ switch, and a constant equal to . This implies that the EDF algorithm has good delay and throughput properties, since it follows an FCFS OQ switch closely.
However, EDF requires up to timestamps to be compared at every time-slot in order to determine which cells to service, where is the maximum length of a VOQ 2 . This makes the EDF algorithm difficult to implement in practice. In what follows, we will first show how to simplify the EDF algorithm, and then eliminate the need for a resequencing buffer at the external outputs.
III. 3DQ, AN EXTENSION OF VOQ
A. The Return of HOL Blocking
Consider a packet, , that sits in the VOQ 2 . We'll assume that was the earliest arriving packet to the switch among all packets in its VOQ 2 , but that is not currently sitting at head-of-line (HOL) in its VOQ 2 . Packet is obviously the earliest arriving packet of its flow in VOQ 2 , and therefore sits in front of the other packets of its flow in VOQ 2 . However, it is blocked by packets ahead of it that arrived later to different external inputs and are also scheduled to depart from EO . This is classical HOL blocking, and the solution is to subdivide each VOQ 2 into a separate queue for each external input.
B. Three-Dimensional Queueing
VOQ 2 s transform one-dimensional queues into two-dimensional queues, one per (input, output) pair. There are therefore VOQ 2 s. In this switch, we will use three-dimensional queues (3DQs), with a different queue per ; hence, there are now a total of 3DQs. From hereon, we'll assume that we replace the VOQ 2 s by 3DQs.
C. An Application of 3DQ: EDF-3DQ
With 3DQs, the earliest cell for is always the HOL cell in its queue. Therefore, if we want to use the EDF algorithm with a 3DQ structure (we'll call it the EDF-3DQ algorithm), we only need a comparison among timestamps, instead of a comparison among timestamps. This simplification comes at the cost of using 3DQs instead of VOQ 2 s. Figure 2 compares a VOQ 2 structure with a 3DQ structure for a given II. The numbers on the packets represent their flow and their arrival time to the switch, and the packets with a bold border are the earliest ones in their VOQ 2 . The figure illustrates how HOL blocking with VOQ 2 s is solved using a 3DQ structure. For instance, in the VOQ 2 , packet from is blocked by the HOL packet from , arrived later ( ). However, in the 3DQ structure, packet is the HOL of its 3DQ and is not blocked anymore.
Our next step is to eliminate the resequencing buffer by preventing mis-sequencing from occurring in the first place.
IV. FULL FRAMES FIRST
A. Background
FFF (Full Frames First) is an algorithm that maintains packet order.
To understand how FFF works it helps to understand how the round-robin version of OQ works (called OQ-RR). Consider the illustration of OQ-RR for one output in Figure 3 , where all packets are assumed to have the same output destination. The numbers on the packets correspond to the order in which they will be serviced, assuming no future arrivals. Therefore, OQ-RR will service packet 5 before packet 6, even if packet 5 arrived later. Note that because the average delay is independent of the order in which the packets are serviced, OQ-RR will have the same average delay as OQ-FCFS (the FCFS version of OQ). Also, note that OQ-RR is work-conserving (i.e., if there is at least one packet in the queue, then OQ-RR is not idle). Now, assume that the algorithm doesn't deal with packets, but with frames, where one frame consists of packets. The new algorithm, called Frames-RR, first services all full frames in round-robin order (where a frame is considered to be full if its slot contains a packet). When there are no full frames requiring service, it services non-full frames in round-robin order. For instance, in Figure 4 , the frames are serviced in the order indicated. First, the frames 1 through 5 are serviced because they are full, including frame 3 which is considered full because its last slot is occupied by a packet. Afterwards, the non-full frames 6 and 7 are serviced in a round-robin order.
Frames-RR is clearly not work-conserving for packets. How- ever, it is work-conserving for full frames, in the sense that if there is at least one full frame left, then there is at least one full frame being serviced.
B. FFF: A Combination of Frames-RR and 3DQs
FFF applies Frames-RR to the 3DQs in the second-stage of the switch. To understand its operation, consider external output . We'll define a cycle to be the set of consecutive time slots during which EO receives cells successively from IIs through , and we'll define the candidate set of 3DQs for as . Assume that the last serviced cell in the candidate set came from II
. Then, because of the properties of the load-balancer, we know that the next in-order cell for the flow will come from II (modulo ). Let be the pointer to the II of the next in-order cell:
. For instance, if the last cell was read from II , then the next in-order cell will necessarily be read from the II numbered:
. Further, we know that if cell is in front of cell in any 3DQ, then necessarily arrived to the switch earlier than . Therefore, if there is any cell from flow that is head-of-line of its 3DQ in II , then this cell must be the next in-order one.
We define the frame for as , and we will say that frame is full if every 3DQ for is non-empty. We can see that if the frame is full then its next in-order cell is in 3DQ
, the one after is in , and so on, up until . In other words, a frame is said to be full if, and only if, it is possible to transfer in-order cells from up until . This is the key to preventing the cells within a frame from becoming mis-sequenced.
In FFF, an external output reads all the cells in a full frame from one external input, before moving on to read a full frame from the next external input. External output uses the roundrobin pointer to remember which EI the last full frame came from. In this manner, each external output gives an opportunity to each external input in turn to send a full frame to it. When there are no more full frames, EO serves the non-full frames in round-robin order, using the pointer .
More precisely, the following three computations are performed by external output at the beginning of every cycle: 1. Determine which of the frames is full, where . 2. Starting at , find the first full frame. If the first full frame arrived from EI , then , modulo .
If there is no full frame, doesn't change.
3. If there is no full frame, starting at , find the first nonfull frame. Update , modulo .
C. Illustration of the FFF Algorithm
Assume that at the beginning of a cycle for a given EO , the 3DQs are in the states shown in Figure 5 . In the figure, the 3DQs have been rearranged so that all of the queues containing cells from a given external input are adjacent to each other. In practice, of course, the queues are not arranged like this, but they have been redrawn to help explain the algorithm. The number in each packet represents its sequence number within its flow. The numbers above the frames (in bold) indicate the order in which they will be served. Assume that there are no further arrivals.
• Initially in the example, , and frame pointers are , and .
• At the first time-slot, FFF serves the first full frame that arrived from external input . The first full frame is , and so FFF serves it over three consecutive cell times, delivering the three cells in order to EO k. Pointers are updated:
, .
• FFF then serves the three cells from external input 1 in frame , then updates , . According to our definition, is a full frame from external input 2, even though it only contains one packet. FFF serves it and updates the pointers.
Since there is no full frame from external input 3, the next served full frame is , and then . The pointers are now: , and .
• There are no full frames left. FFF serves the non-full frames in round-robin order: , and . Pointers are updated to , , and
. Note that the cell numbered is not 
0-7803-7476-2/02/$17.00 (c) 2002 IEEE.
serviced, because there is no ordered cell in its frame at II (the expected cell numbered is still queued in its VOQ 1 ). Similarly, and will not be serviced as long as there is no cell at II .
D. Pros and Cons of FFF
The main advantage of FFF is that packets are not missequenced, and so we can eliminate the resequencing buffer at the external outputs.
It is also interesting to compare FFF with iSLIP [5] , which is a widely used practical heuristic for single-stage crossbar switches: 1. FFF has 100% throughput whenever OQ has 100% throughput (proved in Section V.C uses an N-bit programmable priority encoder to identify the first full frame. This is the same complexity as just one iteration of iSlip. It is also possible to use an version of FFF by exploiting slightly out-ofdate information, with similar delay and throughput properties. For brevity, this property is not developed in this paper. 4. Because of the predetermined, and non-conflicting schedule used by both stages of switching, FFF does not need a centralized scheduler. It is sufficient for each external output to schedule the frames (and hence cells) that it will receive. This is not practical in iterative algorithms such as iSLIP which need to be centralized because of the large amount of communications between inputs and outputs. 5. FFF does not require much information to be sent between each internal input and each external output. First, let's consider the communication from an internal input to the scheduler at an external output. Each II receives at most one new packet per time-slot, and it is known in advance from which EI it comes, because of the predetermined sequence of configurations. The II can tell the EO the packet's destination (and that a packet arrived) using bits. Now let's consider the communications from the external output to an II. Every time-slot , tells each internal input which frame (if any) it will be reading in this cycle, requiring bits. 6. FFF seems simple enough to be implemented in hardware. 7. FFF seems well suited to optical switch fabrics based on technologies such as MEMS [25] [26], VCSELs [27] , tunable lasers [28] , electro-holography [29] , etc. This is for two reasons. First, FFF allows the switch fabric to rotate through a simple deterministic sequence of configurations, that are known in advance. It seems reasonable to expect that for most optical technologies, a fixed rotational pattern of configurations is easier to implement and can be reconfigured faster than if the pattern was unpredictable. For example, with MEMs mirrors one could imagine a mirror with N facets that rotates by a fixed amount each time-slot. Second, since both stages are configured according to a fixed sequence, it may be possible to replace them with a single switch that is configured once per time-slot, with two cells transferred per configuration. In the first half of a time-slot, the switch transfers cells for the first stage (from EIs to IOs), and in the second half, it transfers cells for the second stage (from IIs to EOs).
However, FFF has some drawbacks. 1. FFF uses two switching stages instead of one. On the face of it, this is similar to using a crossbar switch with a speedup of two. In this case, there is a spatial speedup rather than a speedup in time. However, notice that the two components that normally limit the speed of the system -the bandwidth of the memories at each stage, and the scheduler -run at the same speed as the external line. 2. FFF needs buffers (first-and second-stage) instead of buffers. While it is possible to combine the buffers into shared buffers (if EIs and IIs share the same linecard), this would double the memory bandwidth. 3. FFF uses 3DQs in the internal buffer, while single stage switches usually use only VOQs (thus requiring more pointers, and a more complicated buffer management algorithm). 4. FFF requires a load balancer at the first stage.
V. FFF PERFORMANCE
In this section we show that the average delay for the FFF algorithm is less than the average delay for OQ plus a constant, and that FFF has the same throughput as OQ. The proofs rely on the observation that FFF is work-conserving for full frames.
A. Definitions
For simplicity, we will only consider the cells destined to a given EO . We'll define the following values, as illustrated in Figure 6 . 1.
is the cumulative number of cells destined to EO that have arrived to EI up to and including time-slot . It is therefore the index of the last cell from that has arrived to EI .
2.
is the total number of cells destined to that have arrived to the switch up until . . In Figure 5 , .
5
. is the number of ordered cells queued in the 3DQs that are destined to (e.g., ).
6. is the number of cells in full frames already arrived to the 3DQs and destined to . 
11.
, where is an integer, is any time-slot when the cycle for EO begins.
B. FFF Average Delay Within a Constant from OQ
In this section we will show that the average delay for FFF is within a constant delay of the average delay for an OQ switch for the same arriving traffic.
We will first compare FFF with the delayed OQ, which is an OQ having the same ordered arrivals as the second stage. We will show that FFF is work-conserving for full frames, and therefore services nearly as many full frames and as many cells as the delayed OQ model, with a queue size almost as small. This results in a bounded average delay difference with the delayed OQ model (Theorem 1). Then we compare the delayed OQ model with a regular OQ switch having the same packet arrivals as the first stage. Using a delay bound, we finally show that there exists a bounded average delay difference between FFF and an OQ switch (Theorem 2).
We start by establishing that whenever there is at least one full frame, the number of serviced full frames increases by one in the next cycle. 
A(t) arrived cells B OQ D (t) serviced cells
frame will be serviced.
Since Lemma 1 shows that FFF is work-conserving for full frames, Lemma 2 shows that the number of serviced packets is close to the number of packets serviced by an OQ switch.
Lemma 2
Proof: By induction on:
.
If , and , so the inequality holds. Now, let the inequality be true for .
Case 1:
. Then,
Case 2:
. Using Lemma 1, Since Lemma 2 shows that the number of packets serviced by FFF is close to the delayed OQ in some sense, Lemma 3 concludes that the queue size for FFF is bounded by the sum of the queue size for the delayed OQ and a constant.
Lemma 3
and . Proof: Using the definitions and Lemma 2:
The next two lemmas show that FFF efficiently uses bandwidth in order to remove the packets from the queues. Since FFF is work-conserving for full frames, Lemma 4 shows that if there are full frames at time , then exactly full frames will be served in the next cycles. Lemma 5 generalizes this idea and considers what FFF does with the remaining packets when there are no full frames. Hence .
Lemma 5
Proof:
We already know from Lemma 4 that . Now we need to show that during the next cycles, at least cells will be serviced. Let's distinguish between two cases. Case 1: during these cycles, the EIs are each serviced at least once in a round-robin fashion as non-full frames (i.e., is incremented at least times).
Note that there is no full frame to service any more if non-full ones are serviced (since full frames have priority over non-full ones). This implies that every cell that is in a non-full frame at time is either serviced in the round-robin among non-full frames, or has been already serviced as part of a full frame that has been formed since.
Case 2: during these cycles, there are at least full frames serviced (note that , so there is no other case by Dirichlet's pigeon-hole principle).
Therefore, using Lemma 1 and Lemma 4:
By definition for all , and the result follows.
We have shown that was tracking with a delay dependent on . Since we have linked with
, we can find a first bound on the average delay for FFF as a function of the average delay in the delayed OQ model. This bound will be useful in order to compare FFF with the regular (non-delayed) OQ model.
Theorem 1
The average delay for FFF is less than the average delay for the delayed OQ plus a constant .
Proof:
We know that and (Lemma 3 and Lemma 5).
(because is a non-decreasing function), i.e.
. But we also know that:
Hence, .
This implies that the average delay for a cell coming at time is the delay that it would have under the delayed OQ algorithm, plus at most . To see this, note that the average delay does not depend on the order in which the cells are picked. Thus, FFF has the same average delay as the FCFS algorithm which has the same cumulative number of arrivals and departures as FFF (hereafter called FFF-FCFS). FFF-FCFS would obviously satisfy the last formula. Therefore, the time spent by a new packet in the internal outputs is with the delayed OQ, and at most with FFF-FCFS. Hence the difference is bounded by .
Finally, note that all computations up until now were for a that begins a cycle for output . If we choose any nonnegative integer , then let be the beginning of the cycle to which belongs, and let . We get:
Hence, and the result is thus applicable to any time-slot .
Theorem 2
The average delay for FFF is less than the average delay for OQ plus a constant . Proof: We compare the delays for OQ and for the delayed version of OQ. Let . We'll first show that the delay for any cell in the delayed OQ is less than its delay for OQ plus .
For any time-slot s, let be the time-slot that marks the beginning of the cycle to which belongs:
, and .
Then, according to the properties of the delayed OQ we have:
Let . Since packets don't arrive before time-slot 0:
Hence, since the delayed OQ and OQ are both FCFS, the difference of delay for each cell between those two systems will be at most D, and the difference of average delay between FFF and OQ will be at most .
It is worth asking if the delay difference (approximately ) is significant. For a high-speed router with 32 ports, OC768 (40 Gb/s line-rates) and a cell size of 64 bytes, (the time taken for light to travel approximately 10 miles).
It is possible to improve this bound using a different algorithm that would take into account the number of cells present in the non-full frames, which FFF does not do. However, this would increase the complexity and the communication in the switch, and we believe that the trade-off is not worth it.
C. FFF Has the Same Throughput As OQ
Let's first provide a few definitions. Consider a switch with traffic arrival rates (from EI to EO ), and total queueing size , where is the current time-slot. 1. The load of the arrival traffic is:
. The arrival traffic is said to be admissible if . 2. The switch is said to be strongly stable if [6] [22].
3. The switch is said to have 100% throughput if it is strongly stable whenever the arrival traffic is admissible. Similarly, it is said to have a throughput of if it is strongly stable whenever . We have seen that there exists a bounded average delay difference between FFF and an OQ switch. As a consequence, FFF
has the same throughput as OQ, as Theorem 3 illustrates. Taking into account both the buffering in the 3DQs for the external outputs and the buffering in the VOQ 1 s from the external inputs, we get (using for EO ):
Thus, , since OQ is workconserving. Hence the result.
Note that this theorem is quite strong, because OQ is an ideal switch from a throughput point of view. In addition, note that the proof shows that at any time, the buffering needed with FFF is within a constant from the ideal buffering needed with OQ.
Property 4 FFF has 100% throughput with admissible Bernoulli i.i.d. arrival traffic.
Proof: OQ is known to have 100% throughput with admissible Bernoulli i.i.d. arrival traffic (this can be either proved directly, or by using the fact that OQ is work-conserving, thus , with Maximum Weight Matching (MWM) having 100% throughput [7] ). The result follows using Theorem 3. Since , the maximum queue length size in OQ is [24] . The result follows using Theorem 3.
Property 5
VI. CONCLUSION
Over the last few years, there have been many results that show the conditions under which a single-stage crossbar switch with input queues and no speedup can achieve 100% throughput. To our knowledge, there have been no results that bound the difference in average delay between an ideal output queued switch and an input queued switch without speedup. Such bounds have only been possible when the switch runs with a speedup of at least two, has two stages of buffering (input and output queues) and uses a complicated (impractical) scheduling algorithm.
The two-stage switch introduced by Chang achieves a 100% throughput as well as a bound on the delay between it and an output queued switch. This is achieved without speedup and without a complicated scheduling algorithm, and therefore represents an important step towards efficient, high capacity switches with delay guarantees.
In its simplest form, the two-stage switch mis-sequences packets, hence motivating the work presented in this paper. The Full Frames First algorithm prevents mis-sequencing while maintaining the throughput and delay properties of the basic switch. While it clearly introduces more complexity, the algorithm appears practical at high speed.
We believe that the most interesting application of the twostage switch is for use as the optical switching fabric in an otherwise electronic Internet router. The switch fabric in a router is generally limited by its power consumption, its size and the need for a complex scheduler. While optics can reduce both size and power, a single stage optical crossbar switch still requires an electronic scheduler. The two-stage switch can be incorporated without the need for a separate scheduler; because the switch moves through a deterministic sequence of configurations, and so scheduling packets consists only of distributing a timing reference to the linecards. Furthermore, since the two stages of the switch are configured according to a fixed sequence, it may be possible to replace them by a single switch that is configured once per time slot, with two cells transferred per configuration.
ACKNOWLEDGMENTS
The authors would like to thank Balaji Prabhakar, Rui Zhang, Shang-Tse Chuang, Devavrat Shah, and the anonymous referees for their valuable comments.
