Abstract-In this paper, we propose two schemes for the load balanced Birkhoff-von Neumann switches to provide guaranteed rate services. The first scheme is based on an earliest eligible time first (EETF) policy. In such a scheme, we assign every packet of a guaranteed rate flow a targeted departure time that is the departure time from the corresponding work conserving link with capacity equal to the guaranteed rate. By implementing the EETF policy with jitter control mechanisms and first come first serve (FCFS) queues, we show that the end-to-end delay for every packet of a guaranteed rate flow is bounded by the sum of its targeted departure time and a constant that only depends on the number of flows and the size of the switch.
Providing Guaranteed Rate Services in the Load
Balanced Birkhoff-von Neumann Switches
Cheng-Shang Chang, Fellow, IEEE, Duan-Shin Lee, Senior Member, IEEE, and Chi-Yao Yue Abstract-In this paper, we propose two schemes for the load balanced Birkhoff-von Neumann switches to provide guaranteed rate services. The first scheme is based on an earliest eligible time first (EETF) policy. In such a scheme, we assign every packet of a guaranteed rate flow a targeted departure time that is the departure time from the corresponding work conserving link with capacity equal to the guaranteed rate. By implementing the EETF policy with jitter control mechanisms and first come first serve (FCFS) queues, we show that the end-to-end delay for every packet of a guaranteed rate flow is bounded by the sum of its targeted departure time and a constant that only depends on the number of flows and the size of the switch.
Our second scheme is a frame based scheme as in Keslassy and McKeown, 2002. There, time slots are grouped into fixed size frames. Packets are placed in appropriate bins (buffers) according to their arrival times and their flows. We show that if the incoming traffic satisfies certain rate assumptions, then the end-to-end delay for every packet and the size of the central buffers are both bounded by constants that only depend on the size of the switches and the frame size. The second scheme is much simpler than the first one in many aspects: 1) the on-line complexity is (1) as there is no need for complicated scheduling; 2) central buffers are finite and thus can be built into a single chip; 3) connection patterns of the two switch fabrics are changed less frequently; 4) there is no need for resequencing-and-output buffer after the second stage; and 5) variable length packets may be handled without segmentation and reassembly.
Index Terms-Birkhoff-von Neumann switches, guaranteed rate services, multicasting flows, multi-stage switches, variable length packets.
I. INTRODUCTION

I
N ORDER TO provide the needed speedup to match the speed of fiber optics, parallel buffered switches, capable of performing parallel read/write, have received a lot of attention recently (see, e.g., [16] , [17] and references therein). Traditionally, the study of parallel buffered switches is limited to the (single-stage) input-buffered crossbar switch (see, e.g., [1] , [9] , [15] , [19] , [21] - [26] , [29] , [30] ), where each input has a segregated buffer. In such a switch, time is slotted and synchronized so that packets in different input buffers can be read out simultaneously within a time slot. There are two well-known problems in an input-buffered switch: low throughput due to head-of-line (HOL) blocking and the difficulty in controlling packet delay. The HOL problem can be solved by using the virtual output queueing (VOQ) technique. Instead of having a single first come first serve (FCFS) queue at each input port, the VOQ technique maintains a separate (logical) queue for each output port at each input port. To control packet delay, one easy solution is to provide bandwidth guarantees in an input-buffered switch. In [15] , Hung, Kesidis, and McKeown used an idling weighted round robin (WRR) algorithm in [2] to achieve rate guarantee for each input-output pair without internal speedup. Similar approaches are also addressed in [21] and [22] . As the usual WRR algorithm, all these are frame based schemes and might have the granularity problem for bandwidth guarantees.
To cope with the granularity problem due to framing, the Birkhoff-von Neumann input-buffered switch is proposed in [5] and [6] for guaranteed rate service between each input-output pair (see Fig. 1 ). As in most input-buffered switches, the Birkhoff-von Neumann switch uses the VOQ technique to solve the HOL blocking problem. The main idea of scheduling the connection patterns in the Birkhoff-von Neumann switch is to use the capacity decomposition approach by Birkhoff [3] and von Neumann [34] (for the details of the decomposition algorithm, we refer to [5] and [6] ). The computational complexity of the decomposition is for an switch. The on-line scheduling algorithm used there is a simplified version of the Packetized Generalized Processor (PGPS) algorithm in Parekh and Gallager [28] (or the Weighted Fair Queueing (WFQ) in Demers, Keshav, and Shenkar [12] ). The complexity of the on-line scheduling algorithm is . There are several drawbacks to the Birkhoff-von Neumann switches: 1) Computational complexity: the Birkhoff-von Neumann decomposition itself is non-trivial (with the order of complexity ), even though such a decomposition only needs to be computed when the rates change. 2) Memory complexity: the number of permutation matrices generated from the Birkhoff-von Neumann decomposition is . These matrices have to be stored in the switch. 3) Multicast: the Birkhoff-von Neumann switch does not support multicast. Multicasting flows can only be supported through point-to-point flows. 4) Variable length packets: in the Birkhoff-von Neumann switch, time is slotted and packets are assumed to fit in a time slot. Variable length packets have to be segmented at the inputs and then re-assembled at the outputs. To cope with the first three drawbacks in the Birkhoff-von Neumann switch, the load balanced Birkhoff-von Neumann switch with one-stage buffering is proposed in [7] . The main idea is to add a load balancing stage in front of the Birkhoff-von Neumann input-buffered switch (see Fig. 2 ). In a time slot, the crossbar switch at the first stage sets up connection patterns corresponding to permutation matrices that are periodically generated from a one-cycle permutation matrix. By so doing, the first stage performs load balancing for the incoming traffic. As the traffic coming into the second stage is load balanced, it suffices to use the same simple periodic connection patterns as in the first stage to perform switching at the second stage. Thus, there is no need to carry out the Birkhoff-von Neumann decomposition. To support multicast, fan-out splitting is done at the central buffer (the buffer between two crossbars). It is shown in [7] that the load balanced Birkhoff-von Neumann switch indeed achieves 100% throughput (under a mild technical condition) for both point-to-point and multicasting flows. However, the main drawback of the load balanced Birkhoff-von Neumann switch with one-stage buffering in [7] is that packets might be out of sequence.
In [8] , the load balanced Birkhoff-von Neumann switch with multi-stage buffering is proposed to solve the out-of-sequence problem. There, load-balancing buffers are added in front of the first switch and resequencing-and-output buffers are added after the second switch. As in [17] , packets are distributed in the round-robin fashion according to their flows in the load balanced Birkhoff-von Neumann switch with multi-stage buffering. By so doing, it is shown in [8] that the delay through the first stage can be bounded by a constant that only depends on the size of the switch and the number of flows supported by the switch.
Two scheduling policies in the central buffers are presented in [8] : the first come first serve (FCFS) policy (see Fig. 3 ) and the earliest deadline first (EDF) policy (see Fig. 4 ). For the FCFS policy, a jitter control mechanism is added in the VOQ in front of the second stage. It delays every packet to its maximum delay at the first stage so that the flows entering the second stage are simply time-shifted flows of the original ones. For the EDF policy, every packet is assigned a deadline that is the departure time from the corresponding output-buffered switch. The central buffers then schedule packets according to their deadlines.
After the second stage, packets are stored in the resequencing-and-output buffer. The resequencing-and-output buffer conceptually consists of two virtual buffers: 1) the resequencing buffer and 2) the output buffer. The objective of the resequencing buffer is to reorder the packets so that packets of the same flow depart in the same order as they arrive. After resequencing, packets are stored in the output buffer waiting for transmission from the output link. It is shown in [8] that for both the FCFS and EDF schemes the end-to-end delay is bounded above by the sum of the delay through the corresponding FCFS output-buffered switch and a constant that depends on the size of the switch and the maximum number of flows supported by the switch. Moreover, the size of the resequencing-and-output buffer for the FCFS (resp. EDF) policy is also bounded above by a constant that depends on the size of the switch and the maximum number of flows supported by the switch. In short, the load balanced Birkhoff-von Neumann switch with multi-stage buffering is able to emulate the ideal FCFS output-buffered switch up to a constant delay, and this is done without speedup and conflict resolution. We also note that the idea of using load balancing was previously explored in the literature via randomization (see, e.g., [27] , [33] ). However, load balancing via randomization does not yield deterministic bounds.
The drawback of the load balanced Birkhoff-von Neumann switch with multi-stage buffering is its hardware implementation complexity for the resequencing-and-output buffer and the jitter control mechanism. In [20] , Keslassy and McKeown developed a clever scheme that uses the full frame first (FFF) scheduling policy in the central buffers. In such a scheme, packets of the same flow at the central buffers are grouped into frames with frame size equal to the number of inputs. By so doing, packet of the same flow depart in the FCFS order. As such, there is no need for the resequencing-and-output buffer.
The load balanced Birkhoff-von Neumann switches in [7] , [8] , and [20] only provide the best effort service. The main objective of this paper is to investigate schemes for providing guaranteed rate services in the load balanced Birkhoff-von Neumann switches. We develop two schemes for doing this. As in [8] , the first scheme is based on an earliest deadline first (EDF) scheduling policy. Instead of using the departure time from the corresponding output-buffered switch, in the first scheme we assign every packet of a guaranteed rate flow a targeted departure time that is the departure time from the corresponding work conserving link with capacity equal to the guaranteed rate. The jitter control mechanism in front of the central buffer then uses the targeted departure time to regulate the traffic. By running the earliest eligible time first (EETF) policy, we show that the end-to-end delay for every packet of a guaranteed rate flow is bounded by the sum of its targeted departure time and a constant that only depends on the number of flows and the size of the switch. The detailed architecture and its analysis for this scheme will be presented in Section II.
The second scheme is a much simpler one and has a framed structure as in Keslassy and McKeown [20] . There, time slots are grouped into fixed size frames. Packets are placed in appropriate bins (buffers) according to their arrival times and their flows. We show that if the incoming traffic satisfies certain (rate) assumptions, then the end-to-end delay for every packet and the size of the central buffers are both bounded by constants that only depend on the size of the switches and the frame size. The second scheme is much simpler than the first one in many aspects: 1) the on-line complexity is as there is no need for complicated scheduling; 2) central buffers are finite and thus can be built into a single chip; 3) connection patterns of the two switch fabrics are changed less frequently; 4) there is no need for resequencing-and-output buffer after the second stage; and 5) variable length packets may be handled without segmentation and reassembly. The detailed architecture and its analysis will be shown in Section III.
For the ease of our presentation, in this paper we assume that packets are of the same size (unless otherwise specified). Moreover, time is slotted and synchronized so that a packet can be transmitted within a time slot.
II. LOAD BALANCED BIRKHOFF-VON NEUMANN SWITCH ARCHITECTURE WITH THE EARLIEST ELIGIBLE TIME FIRST POLICY
In this section, we propose a scheme for providing guaranteed rate services in an switch with multicasting flows. This scheme is based on the load balanced Birkhoff-von Neumann switch architecture in [8] . As shown in Fig. 5 , the switch architecture consists of two crossbar switch fabrics and three stages of buffers. These three stages of buffers are the load-balancing buffers, the central buffers, and the output buffers. As in [8] , the connection patterns of both crossbar switch fabrics are generated from an one-cycle permutation matrix. As such, these connection patterns are periodic with period and every input is connected to every output exactly once in every time slots. The objective of the first stage is to perform load balancing. The load-balancing buffer at an input consists of virtual output queues (VOQs). Packets in the th VOQ of the load-balancing buffer of the th input will be sent to the th central buffer. Suppose that there are multicasting flows at the th input port, . Packets that belong to the same multicasting flow are routed to the VOQs in a round-robin fashion. Without loss of generality, one may assume that the first packet in a flow is always routed to the first VOQ. To be precise, let be the cumulative number of arrivals of the th multicasting flow at the th input by time , and be the cumulative number of packets from that flow that are routed to the th VOQ at the th input by time . Then (1) One key result for such a load-balancing mechanism (shown in [8] ) is the following lemma. 
Lemma 1:
The maximum delay for a packet to depart from the first crossbar switch fabric is bounded above by a constant , where is the maximum number of flows supported at an input.
To provide guaranteed rate services, every packet of a (guaranteed rate) flow is assigned a targeted departure time that is the departure time from the corresponding FCFS work conserving link with capacity equal to the guaranteed rate of the flow. After leaving the first stage, a packet enters the jitter control stage in front of the central buffer. The time for a packet to leave the jitter control stage, called the eligible time of that packet, is set to be the sum of the targeted departure time and the maximum delay of the first stage (i.e., ). In the central buffer, packets are scheduled under the FCFS policy. We note that in implementation one may combine both the jitter control mechanism and the central buffer by using a single memory block. By time stamping every packet with its eligible time, the scheduling policy there is to schedule the first eligible packet. Such a policy is called the earliest eligible time first policy in this paper, and it can also be easily implemented by the well-known time wheel implementation for resequencing [32] . Another point is that best effort service can be provided as background traffic. Flows from best effort service can be assigned to a low priority queue and they are only served when there are no packets from guaranteed rate services in the central buffer.
To be precise, let be the guaranteed rate of the flow. Now consider feeding the flow to a fluid work conserving link with capacity (see Fig. 6 ). Assume that the buffer in the fluid work conserving link is infinite and empty at time 0. Every packet brings in one unit of fluid to the fluid work conserving link. Let be the cumulative number of fluid departures at the output by time . From [4] , Lemma 1.3.1, one has the following well-known representation:
(2) The eligible time of the th packet of the flow at the central buffer is then set to be . For the multicasting flows, fanout splitting is also performed at the central buffer. Thus, a packet departing from the jitter control mechanism is duplicated and distributed to the VOQs corresponding to its destined outputs. By scheduling the first eligible packet in every VOQ, we can show that the maximum delay for a packet to depart the second crossbar switch fabric is bounded. The proof of Lemma 2 is shown in Appendix A.
Lemma 2: Let be the set of flows through the th output, and be the number of multicasting flows through the th output port. Define as the maximum number of multicasting flow through an output port. Suppose that all the buffers are empty at time 0 and (5) Then the maximum delay for a packet to depart the second crossbar switch fabric is bounded by the sum of its target departure time and , where and . After the second crossbar switch fabric, a packet is placed in another jitter control mechanism. As there is a maximum delay for every packet to depart the second crossbar switch fabric, the eligible time for a packet at this jitter control mechanism is set to be the sum of its target departure time and
. Once a packet becomes eligible, it is placed in the output buffer. The scheduling policy for the output buffer is also FCFS. As addressed before, one may combine the jitter control mechanism with the FCFS buffer by using the earliest eligible time first policy. The following is the main theorem of this scheme. The proof of Theorem 3 is given in Appendix B.
Theorem 3: Suppose that all the buffers are empty at time 0 and that the rate condition in (5) holds. Then the following results hold.
1) Every packet of a guaranteed rate flow departs from the switch not later than the sum of its targeted departure time and , where , and , and 2) The output buffer at an output port of the second stage is bounded by . As pointed out by one of the reviewers, the constant delay bound in Theorem 3 is due to the granularity of packet size. If one uses the fluid model for the flows of packets, then there is no granularity. As such, flows are distributed uniformly among the central buffers and the central buffers behave as a single shared memory.
To implement this scheme, we note that the target departure time needs to be time stamped in front of every packet. The computation complexity of the target departure time in (4) is as it can be implemented recursively by leaky buckets [31] . Once the target departure time is available to a central buffer, there is no need for communication between central buffers. The complexity in implementing this scheme is the same as that for the implementation of the earliest eligible time first policy. For such a policy, there is a well-known time wheel implementation [32] . The size of the wheel depends on the size of the central buffer that one would like to implement.
III. FRAME-BASED SCHEME FOR GUARANTEED RATE SERVICES
The drawback of the previous scheme is its hardware implementation complexity of the earliest eligible time first policy (even though it can be implemented with time wheels). Moreover, only fixed size packets are considered. In order to provide guaranteed rate services for variable length packets, variable length packets have to be segmented into fixed size packets, transmitted through the switch, and reassembled at the output. The objective of this section is to propose a simpler scheme that does not require implementing complicated scheduling. Furthermore, variable length packets may not need to be segmented.
The idea of the second scheme, as in Keslassy and McKeown [20] , is to use a framed structure so that resequencing is not needed. The architecture of the scheme is shown in Fig. 7 . For ease of presentation, we shall describe the scheme for fixed size packets and point-to-point flows. Extensions to variable length packets and multicasting flows will be addressed at the end of this section. As in the load balanced Birkhoff-von Neumann switch in the previous section, there are two crossbar switch fabrics and buffers between these two crossbar switch fabrics. In this scheme, time slots are grouped into fixed size frames. Each frame has time slots. Thus, the th time frame is from time slot to time slot (see Fig. 8 ). Let be the cumulative number of (fixed size) packet arrivals by time at the th input port to the th output port, , . Let be the guaranteed rate of the flow. Assume that is chosen so that is an integer for , . We will show that the switch architecture in Fig. 7 provides guaranteed rate services under the following assumptions. Note that (A2) implies that , for . These inequalities and those in (A3) are known as the "no-overbooking" conditions in [15] , as they simply state that neither the total rate coming out from an input port nor the total rate to an output port can be larger than 1.
Before we give the detailed description, we first provide an intuitive argument how the second scheme works. Let be the traffic matrix with being the maximum number of packets of the flow in a time frame. Then from (A1), (A2), and (A3), we know that both the row sums and the column sums of the matrix are not greater than . During any frame, for each , the th input writes the th row of the matrix into one of the central buffers. Similarly, for each , the th output reads the th column of the matrix from one of the central buffers. As both the row sums and the column sums are not greater than , it is clear that a central buffer with buffer size (for each output) is enough to make sure that the number of reads and writes does not exceed the allowed amount. The only question left is how to coordinate the reads and writes in a time interleaved fashion for the inputs and outputs. For this, we need to specify the connection patterns of the two crossbar switch fabrics.
Unlike the last section, both switch fabrics now change their connection patterns according to time frames. In a time frame, both crossbar switches in Fig. 7 set up connection patterns corresponding to a "circular-shift" matrix. Specifically, if the th output port is connected to the th input port during the th time frame, then the th output port will be connected to the th input port during the th time frame, for . If the th output port is connected to the th input port during the th time frame, then the th output port will be connected to the first input port during the th time frame. Initially, we set the connection patterns so that the th output port is connected to the first input port during the th time frame. To be precise, we define the function (6) During the th time frame, the th input port is connected to the th output port of these two crossbar switch fabrics. As such, all the packets that arrive at the th input port during the th frame are all routed to the th output port. Note that the inverse function of is itself. Thus, the connection pattern is symmetric. As such, during the th frame, the th output port is also connected to the th input port for these two switch fabrics. To illustrate this, we show in Fig. 9 the periodic connection patterns for a 4 4 switch fabric. Specifically, the connection pattern in Fig. 9(a) is for , the connection pattern in Fig. 9(b) is for , the connection pattern in Fig. 9(c) is for , and the connection pattern in Fig. 9(d) is for . There are central buffers between these two switch fabrics, indexed from 1 to . Each central buffer consists of two alternating memory blocks as the double-queue structure in [13] (or the ping-pong buffer in [18] ). The buffer size of each memory block is , which is divided into bins, each with buffer size of . To ease the presentation for the operation of these central buffers, we introduce the concept of superframes. The th superframe of the th input port of both stages is defined to be the set of time slots in the time frames, starting from the th frame to the th frame. Note that the th time frame in the th superframe of the th input port (of both stages) is the th frame. Since it follows that during the th time frame in the th superframe of the th input port, the th input port is always connected to the th output port. Moreover, the th time frame in the th superframe of the th input port is also the th frame in the th superframe of the th input port. Consider a particular packet that arrives at the th input port of the first stage during the th time frame in the th superframe of the th input port. As just described, the th input is connected to the th output during that frame and the packet is thus sent to the th central buffer without delay. As there are two alternating memory blocks in the th central buffer, the packet is sent to the second (resp. first) memory block if is odd (resp. even). If, furthermore, the packet is destined for the th output port, it will be placed in the th bin of that memory block. As each bin only has a buffer of size , one might wonder whether there is enough buffer space for such an assignment. We will show in Theorem 5 that under the assumptions in (A1)-(A4) there are no packet overflows for such an assignment.
Without loss of generality, let us assume that is odd and the packet is placed in the th bin of the second memory block of the th central buffer. During the th time frame in the th superframe of the th input port of the second stage, the th input port of the second stage is connected to the th output of the second stage. As each frame has times slots and each bin can hold at most packets, during that frame all the packets in the th bin of the second memory block of the th central buffer are transmitted to the th output of the second stage.
Example 4: We illustrate this scheme by a 4 4 switch with frame . Consider the following 4 4 traffic matrix:
For ease of presentation, assume that is also the number of packets of the flow in every frame. Note that every row sum and column sum of in (7) is 10, which is the same as the frame size .
First, we illustrate the operation at the first input port. The operation for the other input ports are carried out in a similar manner. During each frame, there are 10 packets arriving at the first input port. Among these 10 packets, one of them is destined for the first output port, four of them are destined for the second output port, three of them are destined for the third output port, and two of them are destined for the fourth output port. During the first frame, as the first input port is connected to the first central buffer. These 10 packets arriving in the first frame at the first input port are written into the second memory block of the first central buffer. According of the placement rule, the packet destined for the first output is placed in the first bin, those four packets destined for the second output are placed in the second bin, those three packets destined for the third output are placed in third bin, and those two packets destined for the fourth output are placed in the fourth bin. Similarly, the 10 packets arriving at the first input port during the second (resp. third, fourth) frame are written into the second (third, fourth) central buffer by the same placement rule. The pattern is repeated every four frames, i.e., every four frames constitute a superframe.
In Fig. 10 , we show the whole operation at the first stage. We denote by the set of packets that arrive at the th input port of the first stage during the th time frame, and the set of packets that arrive at the th input port of the first stage during the th superframe of the th input port. Note that , , , and are all routed to the second memory block of the first central buffer. Each of the four frames is the first frame in the superframe of its input. Upon the arrival of each packet in these four frames, it is placed immediately in the bin that corresponds to its destined output. For each , contains the 10 packets described by the th row of the traffic matrix in (7) . As the column sum of in (7) is also 10, the number of packets destined for the th output port among the packets in , , , and is also 10, for . Thus, every bin in the second memory block of the first central buffer is well packed with the 10 packets destined for its corresponding output at the end of the first superframe of the first input (i.e., the end of the fourth frame). Similarly, , , , and are all routed to the second memory block of the second central buffer, and at the end of first superframe of the second input (i.e., the end of the fifth frame) every bin in the second memory block of the second central buffer is well packed with the 10 packets destined for its corresponding output. For packets in , ,
, and , they are all routed to the second memory block of the third central buffer, and at the end of first superframe of the third input (i.e., the end of the sixth frame) every bin in the second memory block of the third central buffer is well packed with the 10 packets destined for its corresponding output. Finally, for the packets in , , , and , they are all routed to the second memory block of the fourth central buffer, and at the end of first superframe of the fourth input (i.e., the end of the seventh frame) every bin in the second memory block of the fourth central buffer is well packed with the 10 packets destined for its corresponding output.
In Fig. 11 , we illustrate the whole operation for the second stage. We denote by the set of packets that depart from the th input port of the second stage during the th time frame, and the set of packets that depart from the th input port of the second stage during the th superframe of the th input port. Now consider the four bins at the second memory block of the first central buffer. Since they are ready at the end of the first superframe of the first input, packets in the first bin are routed to the first output during the first frame of the second superframe of the first input, i.e., the fifth frame. Similarly, packets in the second bin are routed to the second output during the second frame of the second superframe of the first input, i.e., the sixth frame, packets in the third bin are routed to the third output during the third frame of the second superframe of the first input, i.e., the seventh frame, and packets in the fourth bin are routed to the fourth output during the fourth frame of the second superframe of the first input, i.e., the eighth frame. In other words, contains the packets in the first bin, contains the packets in the second bin, contains the packets in the third bin, and contains the packets in the fourth bin. The four bins in the second memory block of the second central buffer are ready at the end of the fifth frame. These four bins are routed to the first output during the sixth frame, the second output during the seventh frame, the third output during the eighth frame, and the fourth output during the ninth frame. The operations for the other two central buffers are done in a similar manner as shown in Fig. 11 .
Theorem 5: Assume that (A1)-(A4) hold. A packet that arrives at the th input and destined to the th output during the th time frame in the th superframe of the th input of the first stage (i.e., the th time frame) will depart during the th time frame in the th superframe of the th input of the second stage (i.e., the th time frame), for and . There are several consequences of Theorem 5. 1) The frame at which a packet departs depends only on the frame at which it arrives. This is independent of all other traffic (assuming (A1)-(A4) hold). 2) Even though the central buffer is finite, no packets are lost inside the switch. 3) Packets of the same flow (the same and ) depart in the FCFS order. This is trivial for packets of the same flow that arrive within the same frame. For packets of the same flow that arrive in different frames, one can see from Theorem 5 that the departure time of a packet is increasing in both and . 4) From Theorem 5, the maximum delay for all arrivals from the th input port to the th output port through the switch fabric is bounded by (8) Thus, the maximum delay for all arrivals from the th input port through the switch fabric is bounded by , which in turn is bounded above by . Proof: (Theorem 5) From (A2), the number of packets of the flow that arrive during the th time frame in the th superframe of the th input port of the first stage (i.e., the th time frame) is bounded by . Without loss of generality, assume that is odd. The total number of packets that are placed in the th bin of the second memory block of the th central buffer during the th superframe of the th input port of the second stage is not greater than From (A3), it follows that Thus, if the th bin of the second memory block of the th buffer is empty at the beginning of the th superframe of the th input port of the second stage, then all the packets that arrive during this superframe can be placed in that bin without causing buffer overflow. During the th time frame in the th superframe of the th input port of the second stage (i.e., the th time frame), all of packets in that bin are routed to the th output port of the second stage. As a result, the th bin of the second memory block of the th buffer is empty again at the beginning of the th superframe of the th input port of the second stage! By induction, all packets of the flow in the th time frame of the th superframe of the th input port of the first stage (i.e., the th time frame) will depart during the th time frame in the th superframe of the th input of the second stage (i.e., the th time frame), for and . The argument for the case that is even is similar. Now we describe how we extend the scheme for variable length packets. As there is a limit on the number of packets that can be transmitted within a time frame for a flow, buffers have to be provided at the input ports. Thus, one can use the VOQ technique for input buffers as shown in Fig. 1 . Specifically, packets from the flow are queued at the th VOQ of the th input. In every time frame, one can now assign consecutive time slots for the flow at the th input. As such, variable length It is also possible to support the multicasting flows considered in Section II. Now the no-overbooking conditions are and Moreover, fan-out splitting needs to be carried out at the central buffers. This implies that a packet needs to be placed in multiple bins at the same time. As such, the implementation that use pointers to the memory addresses of packets might be better than duplicating multiple packets directly.
IV. CONCLUSION
In this paper, we proposed two schemes for the load balanced Birkhoff-von Neumann switches to provide guaranteed rate services. In the first scheme, we assign every packet a targeted departure time that is the departure time from the corresponding work conserving link with capacity equal to the guaranteed rate. By adding a jitter control mechanism in front of the buffer at the second stage and running the EETF policy, we showed that the end-to-end delay for every packet of a flow is bounded by the sum of its targeted departure time and a constant that only depends on the number of flows and the size of the switch. In comparison with the scheme for guaranteed rate services in [5] and [6] , this new scheme has the following advantages:
1) There is no need to perform the Birkhoff-von Neumann decomposition in [5] and [6] . 2) One only needs to implement connection patterns for each crossbar switch and these connection patterns are independent of the incoming traffic.
3) This scheme can support multicasting flows. The main drawback of this scheme is the hardware complexity of implementing the earliest eligible time first policy.
Our second scheme is much simpler than the first one. There, time slots are grouped into fixed size frames. We showed that if the incoming traffic satisfies assumptions in (A1)-(A4), then the end-to-end delay for every packet and the size of central buffers are both bounded by constants that only depend on the size of the switch and the frame size. The second scheme has the following advantages.
1) The on-line complexity is . 2) We still only need connection patterns for each crossbar switch. 3) Central buffers are finite and thus can be built into a single chip. 4) Since each crossbar switch changes its connection pattern according to time frames, the frequency of changing connection patterns for each switch in the second scheme is much slower than the frequency in the first scheme. This is a good aspect for an optical switch, since the frequency of changing connection patterns in an optical switch is constrained by its slow mechanical characteristic. 5) Since all the packets from the same flow leave the switch fabric in the FCFS order, there is no need for the resequencing-and-output buffer after the second stage. 6) This scheme may be able to handle variable length packets without segmentation and reassembly. To summarize, in Table I we compare various switch architectures, including the ideal output-buffered switch (OQ), the input-buffered switch with maximal matching (IQ(MM)) [11] , the input-buffered switch with maximum weighted matching (IQ(MWM)) [25] , the combined input-output queueing switch (CIOQ) [9] , [30] , the Birkhoff-von Neumann switch (BvN) [5] , [6] , the load balanced Birkhoff-von Neumann switch with onestage buffering (LBvN(I)) [7] , the load balanced Birkhoff-von Neumann switch with multi-stage buffering (LBvN(II)) [8] , the earliest eligible time first (EETF) scheme in this paper, and the frame based (Frame) scheme in this paper. In this table, the OQ switch is used as the ideal switch (in spite of the non-scalable speedup). Except the IQ(MM) switch, all the other switch architectures in the table achieves 100% throughput. However, only the family of the load balanced Birkhoff-von Neumann switches does this with complexity in deciding the connection patterns of the crossbars. To achieve 100% throughput, the BvN, EETF and Frame schemes need to know the rate information for each input-output pair. As such, they can pro-vide guaranteed rate services. The only switch architecture in this table that has the out-of-sequence problem is the LBvN(I) switch. To support multicasting, the family of the load balanced Birkhoff-von Neumann switches need to perform fanout splitting at the central buffers. Once fanout splitting is done, they can achieve 100% throughput for multicasting traffic. Only the Frame scheme proposed in this paper is capable of supporting various length packets without segmentation and reassembly. This is done by using a frame size larger than the maximum packet size. The CIOQ switch is capable of achieving exact emulation of the ideal OQ switch, while the LBvN(II), EETF and Fame schemes provide bounded packet delay when comparing to the ideal OQ switch.
APPENDIX A
In this section, we prove Lemma 2. For the proof of Lemma 2, we will need to use the following well-known properties for the ceiling and floor functions.
Proposition 6:
for any positive integer . Recall that is the cumulative number of the flow packets that are split into the th VOQ at the th input port of the first stage by time . Let be the number of the flow packets that have targeted departure times not greater than . Note that the first packet of a flow is always assigned to the first VOQ at the first stage. Thus, we have (9) and (10) where is defined in (3). Let be the cumulative number of the flow packets at the th input port of the second stage by time . From Lemma 1, we know that the maximum delay at the first stage is bounded by (11) As discussed in Section II, a jitter control stage is added in front of the VOQs in the second stage (see Fig. 12 ) and the eligible time of a packet is set to be the sum of its targeted departure time and the maximum delay . Thus, we have from (9) that (12) Now consider the th VOQ at the th central buffer of the second stage (see Fig. 12 ). Denote by (resp. ) the cumulative number of arrivals (resp. departures) at that VOQ by time . Then (13) Now we show that the traffic coming into this VOQ is rate controlled. Note from (13) and Proposition 6(i) that (14) From (3), Proposition 6(iv), and (2), it then follows that (15) Replacing this in (14) and using Proposition 6(v) yields (16) Let be the cumulative number of time slots assigned to this VOQ by time . As the link at the second stage is a FCFS work conserving link, it is well-known (cf. [4] , Lemma 1.3.1(b)) that (17) Moreover, as the connection patterns at the second stage are periodic with period for some one-cycle permutation matrix (18) In order to prove that the maximum delay for a packet to depart the second crossbar switch fabric is bounded by the sum of its target departure time and , it suffices to show that the maximum delay incurred at the VOQ is bounded above by , i.e., Let . Note from (17) that (19) All the terms in the second minimum are clearly nonnegative as both and are non-decreasing in . On the other hand, for , we have from (18) and (5) that Using (16) and Proposition 6 (ii), (iii) yields APPENDIX B
In this section, we prove Theorem 3.
(i) Let be the cumulative departures by time to the jitter control mechanism at the th output port (see Fig. 13 ). Note that is also the cumulative arrivals by time to the 
Let be the cumulative departures by time from the th output buffer. It is well-known (see, e.g., [4] , Lemma 1.3.1) that the input-output relation for a work conserving link with capacity 1 can be represented as follows: (21) As shown in Lemma 2, the maximum delay for a packet to depart from the second crossbar switch fabric is bounded by the sum of its target departure time and . The eligible time for an flow packet is set to be the sum of its target departure time and . Thus, we have (22) where and . Using (22) and (20) in (21) yields (23) To show that every packet of a guaranteed rate flow departs from the switch not later than the sum of its targeted departure time and
, it suffices to show that there is a bounded delay at the output buffer, where . As in the proof for Lemma 2, we only need to verify that (24) Note from (21) that (25) As in the proof of Lemma 2, all the terms in the second minimum are clearly nonnegative. Thus, we only need to verify the case for . Using the inequality in (15) , one can show from (22) and (20) that (26) Applying Proposition 6(i) and the assumption in (5) yields (27) Thus, all the terms in the first minimum are also nonnegative.
(ii) Let be the cumulative arrivals by time to the jitter control mechanism at the th output port. Note that there is already a jitter control mechanism in front of the center buffer. Thus, a packet cannot arrive at the jitter control mechanism at an output port before its eligible time set by the jitter control mechanism in front of the center buffer. Since the eligible time at the first jitter control mechanism is the sum of its targeted departure time and , we then have (28) where . Using (28) , (23) , (26) and (27) , the number of packet stored at the th output buffer at time is then bounded by
ACKNOWLEDGMENT
The authors would like to thank an anonymous reviewer of their INFOCOM 2003 paper for pointing out a granularity problem of the fluid work conserving link in the EETF scheme. The bound for the EETF scheme is now corrected in Theorem 3 of this paper. The authors would also like to thank the anonymous reviewers of this paper for various insightful comments that have greatly improved the presentation of the paper.
