Abstract-Load-balanced switches have received a great deal of attention recently as they are much more scalable than other existing switch architectures in the literature. However, as there exist multiple paths for flows of packets to traverse through load-balanced switches, packets in such switches may be delivered out of order. In this paper, we propose a new switch architecture, called the contention and reservation (CR) switch, that not only delivers packets in order but also guarantees 100% throughput. The key idea, as in a multiple-access channel, is to operate the CR switch in two modes: 1) the contention mode in light traffic and 2) the reservation mode in heavy traffic. To do this, we invent a new buffer management scheme, called virtual output queue with insertion (I-VOQ). With the I-VOQ scheme, we give rigorous mathematical proofs for 100% throughput and in-order packet delivery of the CR switch. By computer simulations, we also demonstrate that the average packet delay of the CR switch is considerably lower than other schemes in the literature, including the uniform frame spreading scheme [10], the padded frame scheme [8] , and the mailbox switch [5].
I. INTRODUCTION
L OAD-BALANCED switches (see, e.g., [3] , [5] , [6] , [8] , [10] , [11] ) have received a great deal of attention recently as they are much more scalable than other existing switch architectures in the literature. A typical load-balanced switch (see Fig. 1 ) consists of two stages: The first stage is for load-balancing that converts incoming traffic into the uniform traffic, and the second stage is for switching of the uniform traffic. The connection patterns in the switches of both stages are deterministic and periodic. As such, there is no need to find matchings as required in most input-buffered switches.
The problem of load-balanced switches is that there are multiple paths between each input/output pair. As such, packets of the same flow may be delivered out of sequence. To cope with this problem, there are several tentative solutions proposed in the literature. Among them, the uniform frame spreading (UFS) scheme [10] is the most simple one. The idea of the UFS scheme is to add virtual output queues (VOQs) at the inputs of the whole switch and operate the system in frames. Packets destined for the same output are stored in the same VOQ. Once a VOQ has more packets than the number of input/output ports, that VOQ is called a full-framed VOQ. At the beginning of a frame, a full-framed VOQ is selected and transmitted to the second stage.
If there is no full-framed VOQ, then nothing is transmitted. By so doing, a full-framed VOQ "reserves" a frame (of time slots) and transmits its packets consecutively in that frame. Though the UFS scheme is shown to achieve 100% throughput [10] , the packet delay is large (even in light traffic). This is known as the starvation problem as it takes time to accumulate packets for a full-framed VOQ.
On the other hand, the mailbox switch (with in [5] ) has only one buffer (for storing a packet) between two stages (see Fig. 2 ). Packets have to contend for that buffer and packets might be rejected in the central buffer due to contention. To obtain the information about whether a transmission is successful or not, the mailbox switch utilizes the symmetric TDM (S-TDM) switch to provide a feedback path. As there is only one buffer, packets from the same flow are delivered in order. However, as packets have to contend for that buffer, 100% throughput cannot be achieved. In fact, it was shown in [5] that the throughput for such a switch is only 58%. The advantage of the mailbox switch is its low packet delay in light traffic. In light traffic, collisions seldom occur and packets can be transmitted immediately after their arrivals.
The main contribution of our work is to propose a switch architecture, called the contention and reservation (CR) switch (see Fig. 3 ), that can have the advantages of both the UFS scheme in [10] and the mailbox switch in [5] . We show that the CR switch achieves 100% throughput and delivers packets in order (as in the UFS scheme) while maintaining low packet delay in light traffic (as in the mailbox switch). The main idea, as pointed out in the pioneer work by Tobagi and Kleinrock [16] for a multiple-access channel, is to have the CR switch operating in two modes: the contention mode (in light traffic) and the reservation mode (in heavy traffic). As in the UFS scheme, when there is a full-framed VOQ, the CR switch operates in the reservation mode and transmits a full frame of packets. However, when there is no full-framed VOQ, it is operated in the contention mode like the mailbox switch. The difference between our scheme and [16] is that our system has multiple parallel channels while there is only one in [16] . The challenge in multiple CR channels is to maintain packets in sequence.
The key innovation that enables us to do this is a new buffer management scheme, called virtual output queue with insertion (I-VOQ). There are three types of packets in an I-VOQ: fake packets, contention packets, and reservation packets. A fake packet is generated by the I-VOQ itself every time an I-VOQ becomes empty. A reservation packet (a packet transmitted in the reservation mode) is always stored at the end of an I-VOQ. A contention packet (a packet transmitted in the contention mode) can only be stored at the head-of-line position of an I-VOQ if the head-of-line packet is a fake packet. Otherwise, a contention packet is blocked and has to be retransmitted later.
With the I-VOQ scheme, we give rigorous mathematical proofs for 100% throughput and in-order packet delivery of the CR switch. By computer simulations, we also demonstrate that the average packet delay of the CR switch in light traffic is almost the same as that in the mailbox switch, and it is considerably smaller than that in the UFS scheme. Moreover, when compared with the padded frame (PF) scheme [8] , an improved scheme for the starvation problem in the UFS scheme, our delay performance is also much better in light traffic and comparable in heavy traffic.
In summary, the CR switch has the following advantages:
1) The CR switch achieves 100% throughput.
2) The CR switch maintains packets in order.
3) The communication overhead of the CR switch is . 4) The online computation overhead of the CR switch can be in the order of . 5) In light traffic, the average delay of the CR switch is about , as in the mailbox switch. 6) In heavy traffic, the average delay of the CR switch is still finite, as in the UFS scheme.
7) The CR switch transits between the contention mode and the reservation mode based on local queue lengths at each input. Hence, the control of the CR switch is distributed. 8) The size of each input buffer is bounded by . From simulation, we will show that the CR switch performs much better in average delay than the PF scheme [8] , the UFS scheme and the mailbox switch under all traffic loadings with uniform and nonuniform destination distributions. Compared with an input-buffered switch executing the iSLIP matching algorithm [13] , the CR switch performs distinctly better under heavy traffic conditions. When the traffic has nonuniform destination distributions, the iSLIP algorithm cannot achieve 100% throughput, while the CR switch can. However, the iSLIP switch has a better delay performance under light to medium traffic conditions. By simulation, we study two fairness problems of the CR switch. The first fairness problem arises because of the deterministic and periodic TDM connection pattern that the CR switch uses. This connection pattern produces a fixed priority order among inputs for any given output port. We propose a port remapping method to solve this fairness problem. In the second fairness problem, we observe that packets transmitted in the contention mode are likely to have longer delays than packets transmitted in the reservation mode. We note that a similar fairness problem exists in input-buffered switches with maximum weighted matching with longest queue first (MWM-LQF) algorithm [14] . This paper is organized as follows. In Section II, we propose the CR switch architecture and its operation. We then show that the CR switch delivers packets in order in Section III and achieves 100% throughput in Section IV. In Section V, by computer simulation, we study the delay of the CR switch and compare it with the padded frame scheme. The paper is concluded in Section VI, where we address further research problems of the CR switch.
II. THE SWITCH ARCHITECTURE
In Fig. 3 , we show the switch architecture for an CR switch. In the CR switch, there are input ports (resp. output ports), indexed by (resp. ). As in the generic load-balanced switches [3] , [4] , the CR switch also consists of two crossbar switches. The buffers between the two crossbar switches are called central buffers, indexed by , and the buffers in front of the first crossbar switch are called input buffers, indexed by . In the CR switch, we assume that packets are of the same size. Also, time is slotted and synchronized so that a packet can be transmitted within a time slot. We index time slots by . Unless otherwise specified, by input/output ports, we mean those of the whole CR switch instead of a single crossbar switch.
In each input buffer, there are VOQs. Each VOQ stores packets of the same output destination. We index the VOQ in input buffer with output destination by VOQ . Packets arriving at an input port are stored in one of the VOQs according to their output destinations. Then, packets in the input buffers are sent to the central buffers by the first symmetric TDM (S-TDM) switch. There are two modes to send packets from the input buffers to the central buffers. One is the contention mode; the other is the reservation mode. A packet transmitted under the contention (resp. reservation) mode is called a contention (resp. reservation) packet. In the central buffers, there are I-VOQs (VOQ with insertion). Similar to a VOQ, each I-VOQ stores packets of the same output destination. We index the I-VOQ in cental buffer with destination by I-VOQ . Finally, packets stored in the central buffers are transmitted to the output ports through the second symmetric TDM switch. In the following subsections, we will illustrate the function of the S-TDM switch, the I-VOQ, and the contention and reservation modes. Finally, we present an example to visualize the operation of the whole CR switch.
A. Symmetric TDM Switches
As shown in Fig. 3 , there are two symmetric TDM switches in the CR switch. The connection patterns of these two switch fabrics are identical at the same time slot. Each symmetric TDM switch consists of input ports (resp. output ports) generically indexed by (resp. ). As in the mailbox switch [5] , an symmetric TDM switch is merely an crossbar switch that implements the following periodic connection patterns: input is connected to output at time if and only if (1) In other words, for any positive integer , input is connected to output 1 at time , output 2 at time , and output at time . Also, it is clear from (1) that every connection pattern in a symmetric TDM switch is symmetric (as input is connected to output if and only if output is connected to input ). As such, output is connected to input 1 at time , input 2 at time , and input at time . If each input/ouput pair of the whole CR switch is built in the same line card, the symmetric connection patterns provide each central buffer a feedback path to its connected input buffer through its connected output port.
B. I-VOQs
To maintain packets (both contention packets and reservation packets) in order, we invent a new buffer management scheme, called I-VOQ, for the central buffers. Similar to a standard VOQ, an I-VOQ stores packets of the same destination. The difference is that in an I-VOQ, an arriving packet is allowed to replace its head-of-line (HOL) packet. There are three kinds of packets in an I-VOQ: fake packets, contention packets, and reservation packets. A fake packet is generated by the I-VOQ itself every time an I-VOQ becomes empty. By so doing, a fake packet is always stored as a HOL packet, and this guarantees that there exists at least one packet in an I-VOQ. When a contention packet arrives and the HOL packet of an I-VOQ is a fake packet, then the fake packet is replaced by the contention packet and the contention packet becomes the HOL packet. Otherwise, the arriving contention packet is rejected. On the other hand, when a reservation packet arrives, it is attached to the tail of an I-VOQ (we assume that the size of every I-VOQ is infinite so that no reservation packet is lost due to buffer overflow). As there is at least one packet in an I-VOQ, we note that a reservation packet cannot be stored as a HOL packet upon its arrival at an I-VOQ. When an I-VOQ is connected to its destination output, its HOL packet (fake or not) is transmitted to the output and removed from the I-VOQ. Packets behind the HOL packet are then moved up one position; i.e., the th packet becomes the th packet. We note that the CR switch needs 1 bit of feedback information from the central buffer to the connected input buffer to indicate whether the transmission of a contention packet is successful. (In practice, one also needs this for a reservation packet as it might also be rejected due to buffer overflow.) As in the mailbox switch [5] , this 1-bit information can be sent via the feedback path provided by the two symmetric TDM switches.
C. Contention Mode and Reservation Mode
As pointed out in the pioneer work by Tobagi and Kleinrock [16] , for a multiple-access channel, one should have the CR switch operating in the contention mode under light traffic to have low delay and in the reservation mode under heavy traffic to maintain system stability. The question is then how the CR switch knows whether the traffic is light or heavy without measuring it.
To answer this question, we operate the CR switch in a framebased manner as in the UFS scheme [10] . Every frame consists of consecutive time slots. However, the beginning time slots of frames are different for different inputs/outputs. Specifically, frame of input (resp. output ) begins at the th time when input (resp. output ) is connected to the first central buffer. As such, we have from (1) that frame of input (resp. output ) consists of time slots (resp. ). If the number of packets in a VOQ at an input port is not less than , that VOQ is called a full-framed VOQ. At the beginning of a frame, if an input has a full-framed VOQ, then it is considered in heavy traffic and is operated in the reservation mode. That frame is then called a reservation frame. Otherwise, it is considered in light traffic and is operated in the contention mode. Accordingly, that frame is called a contention frame. Now, we describe the detailed operations for these two modes.
The reservation mode: Each input keeps a reservation pointer for selecting a full-framed VOQ as in iSLIP [13] .
At the beginning of a reservation frame, the full-framed VOQ that is the closest clockwise to the pointer is selected. The pointer is then incremented clockwise to one location beyond the selected VOQ. Suppose that VOQ is selected. In each time slot of that frame, the HOL packet from VOQ is sent to the connected central buffer . One bit of information is also transmitted to indicate that this packet is a reservation packet. The packet is then stored at the tail of I-VOQ . The contention mode: Each input keeps a contention pointer for selecting a nonempty VOQ as in iSLIP [13] . In each time slot of a contention frame, the nonempty VOQ that is the closest clockwise to the pointer is selected. The pointer is then incremented clockwise to one location beyond the selected VOQ. Suppose that VOQ is selected in a time slot of that frame. The HOL packet of VOQ is copied and sent to the connected central buffer in that time slot. One bit of information is also transmitted to indicate that this packet is a contention packet. If the HOL packet of I-VOQ is a fake packet, we replace the HOL packet of I-VOQ by this contention packet and feed back 1 bit of information to indicate a successful transmission. Otherwise, we reject the contention packet and feed back 1 bit of information to indicate a failed transmission. If the transmission is successful, the HOL packet of VOQ is removed, and packets behind it are moved up one position. Otherwise, the HOL packet remains the HOL packet of VOQ . Note that there are various ways to select VOQs in the contention mode. This could result in different delay performance. We will discuss this issue in Section V-B.
Before we leave this section, we present an example to illustrate the operation of the CR switch. In this example, we consider a 3 3 switch and demonstrate the operation of the switch for three time slots. Assume that time for some integer . The connection pattern and the buffer contents right before the packets are moved from input buffers to central buffers and from central buffers to outputs are shown in Fig. 4(a) . The buffer contents after the packets are moved are shown in Fig. 4 (b). Note that from the second paragraph of Section II, packets in the input buffers are moved to the central buffers first, and then packets in the HOL of I-VOQs are moved to the connected outputs. The numbers in the buffers are the destination ports of the packets. Encircled numbers in the central buffers correspond to the packets that are moved in the shown time slot. In this example, we focus on the operation of the CR switch and ignore the new arriving packets for simplicity. Note that for input 1, frames begin at time slots , , , etc. For input 2 (resp. input 3), frames begin at , (resp. ) etc. Since input 1 has full-framed VOQs [VOQ (1,1) and VOQ (1,3)], input 1 chooses to operate in the reservation mode and transmit packets from VOQ (1,1). In this example, we assume that at time , input 2 chooses to operate in the reservation mode. Thus, at time , input 2 sends a packet from VOQ (2,2) to I-VOQ (3, 2) . Assume that at time , input 3 chooses the contention mode. Thus, at time , input 3 selects VOQ (3,2) and transmits its HOL packet to central buffer 2. Since the HOL of I-VOQ (2,2) is occupied, this transmission fails, and the transmitted packet remains in VOQ (3,2) for retransmission in the future. Then, the HOL packets in the I-VOQs are transmitted to their connected outputs. Specifically, since the connection patterns are symmetrical, central buffer 1 transmits a fake packet to output 1 and moves the newly arrived packet to the HOL position of I-VOQ (1,1). I-VOQ (2,3) transmits a packet to output 3 and inserts a fake packet to its HOL position. Similarly, central buffer 3 is connected to output 2. Thus, I-VOQ (3,2) transmits the fake HOL packet to output 2 and moves the newly arrived packet to its HOL position. The resulting buffer contents are shown in Fig. 4 
III. IN-ORDER DELIVERY
In this section, we show how the CR switch delivers packets in order. Let flow be the sequence of packets from input to output . Let packet be the th packet of flow . The CR switch delivers packets of the same flow in order if packet departs the switch earlier than packet .
A. General Properties of I-VOQs
Now, we show some general properties of I-VOQs that are needed for proving in-order delivery. Unless otherwise specified, we consider flow and central buffer . For clarity, indices , , and are sometimes omitted. Now, suppose that input is connected to central buffer at time . Let be the offset from time that central buffer is connected to output for the th time. Then, central buffer is connected to output at time . Clearly, we have if as the connection is symmetric (and the central buffers receive packets first and send packets later). As in (1), the connection is sequential and periodic. Thus, we have that if and if . For all these three cases, we have (2) As the connection patterns in symmetric TDM switches are periodic with period , it then follows that (3) Note that the waiting time only depends on and , and it does not depend on .
As every time central buffer is connected to output , the HOL packet (fake or not) of I-VOQ is sent to output , and every packet behind the HOL packet is moved up one position. As such, can be viewed as the (virtual) waiting time for the th packet in I-VOQ at time . This leads to the following properties.
Proposition 1:
Suppose that input is connected to central buffer at time . If packet is the th packet of I-VOQ at time , then (i) packet becomes the th packet at time , and (ii) packet departs I-VOQ at time . In the operation of the CR switch, contention packets can only be stored as HOL packets of I-VOQs. On the other hand, reservation packets, transmitted in a frame of consecutive time slots, can only be stored at the tails of I-VOQs. As in the UFS scheme [10] , one might expect that any reservation packets transmitted in the same frame from an input are also stored in the same position of I-VOQs. Moreover, as in (3) does not depend on , it follows from Proposition 1(ii) that any reservation packets transmitted in a frame of an input are also sent to their output consecutively in a frame of their output. This is stated in the following property. Its formal proof is given in Appendix A.
Proposition 2: Suppose that frame of input is a reservation frame that contains packets for output .
(i) For , the packet transmitted in the th slot of frame is stored as the th packet in I-VOQ for some fixed . (ii) For , the packet transmitted in the th slot of frame is sent to its output in the th time slot of frame of output for some . In view of Proposition 2(ii), a frame of an output can also be classified as a reservation frame if it contains all reservation packets, and as a contention frame, otherwise.
B. The Proof for In-Order Delivery
Now, we show that packets of the same flow are always delivered in order. Recall in the beginning of Section III that packet represents the th packet of flow . To prove in-order delivery, we will prove that packet departs earlier than packet for any integer . There are three cases that need to be considered: 1) packet is a contention packet, 2) both packet and packet are reservation packets, and 3) packet is a reservation packet and packet is a contention packet. First, if packet is a contention packet, then packet departs earlier than packet , no matter whether packet is a contention packet or a reservation packet. This is because packet is a contention packet, and it is stored as the HOL packet of an I-VOQ. From Proposition 1(ii), we know that if packet is transmitted to an I-VOQ at time , it will depart the switch at time . Also, if packet is transmitted at time , it will depart the switch at time , for some . Since and , packet departs earlier than packet . Then, if both packet and packet are reservation packets and they are in the same reservation frame, this is the case addressed in Proposition 2(ii). On the other hand, if packet belongs to a later frame, it is clear that packet departs in a later frame.
In the third case, packet must be the last packet in a reservation frame, and packet belongs to a later contention frame. Suppose that packet is transmitted to I-VOQ as the th packet at time for some . Then, it follows from Proposition 2(i) that packets , , are also transmitted to I-VOQ as the th packet at time . As reservation packets are attached to the tails of I-VOQs (and only reservation packets can be stored behind the HOL packet), we know that the th packet, , of I-VOQ are all reservation packets at time for . From Proposition 1(ii), the th packet of I-VOQ departs the switch at time . As and , time slots of output from to are reserved before packet is transmitted to central buffers. Therefore, packet cannot depart the switch between and . As packet is transmitted after time , from Proposition 1(ii), packet departs the switch on or after . As time slots from to are reserved by reservation packets, we conclude that packet departs the switch after which is, from Proposition 1(ii), the departure time of packet . Thus, packet must depart later than packet . From these three cases, we have the following theorem.
Theorem 3 (In-Order Delivery):
The CR switch delivers packets of the same flow in order.
IV. 100% THROUGHPUT
In this section, we show that the CR switch indeed achieves 100% throughput. This is done by showing two stronger results: 1) the total number of packets in every input buffer is bounded above by in Corollary 8, and 2) the total number of packets in the central buffers (I-VOQs) destined for a particular output is bounded above by the sum of the total number of packets in the corresponding output buffer of the output-buffered switch and in Corollary 12. To study the number of packets in the input buffers and the I-VOQs, we need to introduce the concepts of work-conserving modes for queues that have at most one packet departure in a time slot.
Definition 4 (WC Mode):
A queue is in the work-conserving mode if there is one departure in each time slot whenever the queue is nonempty.
Clearly, each output buffer of an output-buffered switch is in the work conserving mode for every time slot. However, both the input buffers and the I-VOQs of the CR switch are not in the work-conserving mode for every time slot. They fall in a weaker concept of work conserving mode defined below.
Definition 5 ( Queue):
A queue is work-conserving with response workload and response delay [denoted by ] if it satisfies the following: When the queue length is smaller than at time and becomes longer than or equal to at time , this queue begins to be in the mode no later than time . Moreover, this mode must continue until the queue length becomes smaller than again.
In the following lemma, we derive a bound between the queue length of a queue and that of a queue. Lemma 6: Let (resp. ) be the number of packets in a (resp. ) queue at time . Suppose that both queues are subject to the same arrival process and they both are empty at time 0. Then (4)
Proof: Let a busy period of a queue be the period of time in which there are more than or equal to packets in the queue. All we need to prove is that (4) holds for every time slot in a busy period of the queue. Let the busy period of the queue start from time . Also, let be the cumulative number of packets arriving at the queue by time . We first show that if , then (4) holds. By definition, we have (5) As there is at most one departure in each time slot, we have (6) As a queue might have no departure, we have (7) From (5)- (7), we have (8) Thus, (4) holds at where . On the other hand, if
, then the queue is in the work conserving mode between and . Then, we have (9) From (8) with substituted by , (6) with substituted by , and (9), we have . In this case, (4) holds, too. This completes the proof.
We have the following work-conserving property for input buffers.
Proposition 7: Each input buffer is work-conserving with response workload and response delay . Proof: Note that if there are more than packets in an input buffer, then there is a full-framed VOQ in that input buffer. As such, the input buffer will be in the reservation mode at the beginning slot of the next frame, and it will continue to be in the reservation mode until there is no full-framed VOQ. Note that there is exactly one packet sent out from that input buffer in every time slot when the input buffer is in the reservation mode. Thus, the response workload is . As the time it takes to the beginning time slot of the next frame is bounded above by , the response delay is . This completes the proof.
Note that there is at most one packet arrival at an input buffer in a time slot. If we put the same arrival process to a work-conserving queue, the number of packets in that work-conserving queue is at most 1. Thus, along with Lemma 6 and Proposition 7, we have the following corollary.
Corollary 8 (Packets in Input Buffers):
The number of packets in an input buffer is bounded above by . From Corollary 8, large memory space is only needed in the central buffers. To show the work conserving property for the I-VOQs, we need to introduce the following definition.
Definition 9: We define as a conceptual queue that contains the union of non-HOL packets in I-VOQ for . As a fake packet or a contention packet can be stored only as a HOL packet in an I-VOQ, a non-HOL packet must be a reservation packet. Thus, contains all non-HOL reservation packets with destination stored in the I-VOQs.
Proposition 10: For each , is work-conserving with response workload 1 and response delay . Proof: Suppose is empty at time and becomes nonempty at time . Since the first packet of a reservation frame of any input is always transmitted to the first central buffer, there is exactly one packet, called packet , transmitted at time to I-VOQ and stored as the second packet of I-VOQ . Without loss of generality, assume that packet is transmitted from input . From Proposition 1(i), we know that at time packet becomes the HOL packet of I-VOQ and thus leaves . Since , the response time of is at most time slots, and the response workload is 1. It remains to show that there is exactly one departure in each time slot from after until becomes empty again. From Proposition 2(i) and Proposition 1(i), there are packets departing from time to time . Also, we know that is the beginning time slot of a frame of output . At the beginning time slot of the next frame of output -i.e., -if is empty, then we complete our argument. On other hand, if is still nonempty at , then there is a reservation packet stored as the second packet of I-VOQ [since the first packet of a reservation frame of any input is always transmitted to I-VOQ ]. Using Proposition 2(i) and Proposition 1(i) again, there are packets departing from time to time . Repeating the same argument, we conclude that there is a departure from until is empty. Using Proposition 10 and Lemma 6, we derive in the following lemma a bound for the difference between the queue length of and that of the corresponding output-buffered switch. The proof is given in Appendix B.
Lemma 11: Suppose that the CR switch and the output-buffered switch are subject to the same arrival process. Let be the number of packets in at time and be the number of packets in the th output buffer of the corresponding output-buffered switch at time . Then (10) Observe that there are at most HOL packets destined for output in the central buffers at any time . This leads to the following corollary.
Corollary 12 (Packets in Central Buffers):
Suppose that the CR switch and the output-buffered switch are subject to the same arrival process. Let be the number of packets destined for output in the central buffers at time and be the number of packets in the th output buffer of the corresponding output-buffered switch at time . Then (11) V. SIMULATIONS
In this section, we study the delay of the CR switch. In the experiments, we set the switch size to be 32. The number of time slots for each experiment is . Let be the average arrival rate to an output of the switch. We assume that arrival processes to the input ports are independent, and consider the following four traffic models: 1) uniform i.i.d. traffic, 2) uniform Pareto traffic, 3) hotspot i.i.d. traffic, and 4) hotspot Pareto traffic. For the i.i.d traffic models in 1) and 3), a packet is generated independently in a time slot in an input with probability . On the other hand, for the Pareto traffic models (see [3] ) in 2) and 4), packets are generated in bursts. With probability , there are packets in a burst (and with probability , there are no packets in a burst). Packets in the same burst are sent to the same destination. The length of each burst is generated independently according to the following (truncated) Pareto distribution: (12) where , and is the normalization constant.
For the uniform traffic models in 1) and 2), the destination of a packet (or packets in the same burst) is selected according to the uniform distribution in , i.e., each output port is selected as the destination of a packet (or packets in the same burst) with the same probability . On the other hand, for the hotspot traffic models (see [6] ) in 3) and 4), packets from input are destined to output with probability 0.5 and to each of the other outputs with probability .
A. Average Delay
In the first experiment, we study the average delay of the CR switch under the uniform i.i.d. traffic. In Fig. 5 , we plot the average delay of three two-stage switches: the CR switch, the contention scheme, and the UFS scheme in [10] . Among them, the contention scheme is the CR switch without the reservation mode. On the other hand, the UFS scheme is the CR switch without the contention mode and with I-VOQs replaced by VOQs. In Fig. 5 , we observe that the advantage of the contention scheme yields very low delay under light traffic, while the advantage of the UFS scheme is maintaining system stability under heavy traffic. The CR switch, however, has both advantages.
For the contention scheme, the maximum throughput seems to be around , and the average delay seems to be around before reaching the maximum throughput. The intuition behind this is that there are few collisions under light traffic. A packet, upon its arrival, is transmitted immediately to the central buffer as a HOL packet. Thus, the delay of a packet is almost the same as the time that a HOL packet needs to wait for the connection to its destined output. Therefore, the average delay is around under light traffic. The quantity, , is known as the maximum throughput of an inputbuffered switch with collision dropping [9] . As argued in [5] for the mailbox switch with , one can argue that the contention scheme has the same maximum throughput as that of an inputbuffered switch with collision dropping.
For the CR switch, the average delay is low under light traffic as in the contention scheme. Then, it transits to the UFS scheme under medium traffic. As in the UFS scheme, the CR switch still has finite average delay under heavy traffic. In Fig. 5 , we observe that there are three regions in the delay curve of the CR switch. In the first region, , the delay curve coincides with that of the contention scheme. This is because there is almost no full-framed VOQ when the load is under 0.63. In the transition region, , the delay curve is below those of the other two schemes since, in the CR switch, packets can still be transmitted to I-VOQs before some full-framed VOQs are formed. In the heavy load region, , the delay curve is close to that of the UFS scheme. This is because it is very likely to have some full-framed VOQs in input buffers under very heavy traffic.
For the UFS scheme, even though the average delay is finite under heavy traffic, the average delay is large under light traffic. Moreover, as shown in Fig. 5 , there are two regions in the delay curve for the UFS scheme. In the first region, , the delay curve is monotonically decreasing, while in the second region, , the delay curve becomes monotonically increasing. This is because the delay of a packet consists of two parts: 1) the delays incurred in an input buffer, and 2) the delay incurred in a central buffer. In light traffic, the major portion of the delay of a packet is from the delay in an input buffer as it needs to wait until a full-framed VOQ in an input buffer is formed. Clearly, the lighter the traffic is, the longer it takes to heap up a full-framed VOQ. As such, the delay curve is decreasing in the first region. On the other hand, in heavy traffic, the delay of a packet is dominated by the queueing delay in a central buffer. As the queueing delay is increasing in the average arrival rate, the delay curve is increasing in the second region.
B. Advancing the Contention Pointers
For the CR switch, the delay in the transition region can be affected by how the VOQs are selected when their inputs are in the contention mode. We use a pointer called contention pointer to designate the selected VOQ from which a packet will be transmitted in contention mode. In the transition region, the arrival rate exceeds the maximum throughput of the contention scheme, and some full-framed VOQs start to form. As described in Proposition 2(ii), a full-framed VOQ, when selected, reserves a frame of consecutive output time slots and, hence, consecutive HOL packets of the I-VOQs during that frame of output time slots. As such, when a HOL packet transmitted in the contention mode to an I-VOQ is rejected, it is very likely that it will be rejected again if it is transmitted immediately in the next time slot. Thus, when the previous transmission is failed, it might be better to select another input VOQ by advancing the contention pointer. On the other hand, if the previous transmission is successful, it might be better to select the same input VOQ until it is empty. Before we present our study on the mechanisms to update the contention pointer, we present the following acronyms for easy referencing.
• In the generic algorithm presented in Section II-C, the contention pointer is advanced using the SAFA scheme as the contention pointer is always advanced. To verify the intuition described in the last paragraph, we simulate these four methods for both the uniform i.i.d. traffic and the hotspot Pareto traffic in Figs. 6 and 7. As shown in these figures, the SPFA scheme has the least average delay for the entire region of the arrival rates. As such, we suggest the SPFA scheme be used in the CR switch for advancing the contention pointers.
In the SPFA scheme described in the last paragraph, we simply advance the pointer to the next nonempty VOQ when the previous transmission is failed. The question is whether there is a better choice. Intuitively, the longer the VOQ is, the more consecutive packets can be transmitted successfully to reduce the average delay. In this experiment, three methods of selecting VOQs are investigated: 1) the next nonempty VOQ, 2) the next VOQ whose queue length is Longer than or equal to the Median Queue length (LMQ) of the nonempty VOQs in the input, and 3) the longest VOQ among the VOQs in the input. The first method is simply the SPFA scheme. We denote the second and the third methods by SPFA-LMQ and SPFA-Longest. In Figs. 6 and 7 , we plot the average delay for these three methods of selecting VOQs under the uniform i.i.d. traffic and the hotspot Pareto traffic, respectively. As expected (from the intuition of selecting a longer queue to reduce the average delay), the curve of the nonempty queue is higher than that of the curve of the LMQ. However, to our surprise, the curve of the longest queue is higher than that of the curve of the LMQ in most traffic conditions. This might be explained as follows: If the longest queue is selected and it results in a failed transmission, then, with high probability, the longest queue will be selected again. As such, it behaves like the SPFP scheme that yields large delay. As such, the right intuition is to select a VOQ long enough to have consecutive successful transmissions, but not too long to keep the freedom of advancing to other VOQs when there is a failed transmission. It seems that the LMQ method fits the intuition very well as there are often several VOQs with queue length longer than the median queue length. As such, there is no problem to advance the contention pointer to other VOQs in the LMQ method. To summarize, we suggest the contention pointer be advanced using the SPFA-LMQ scheme.
Before we close this subsection, we discuss the computation complexity of the LMQ method. The online computation overhead of the LMQ method involves searching for the median among the queue lengths of nonempty VOQs in an input buffer. We note that this can be done in the order of time complexity by maintaining heap structures. To do this, we maintain two heaps, and . Let be the number of nonempty VOQs in the input buffer. Heap keeps the queue length information of the lower VOQs, while heap keeps the queue length information of the remaining VOQs. Heap (resp. ) is maintained as a max-heap (resp. min-heap) in which each father is not smaller (resp. not larger) than all his children. Then, the root of can be considered as the median. The change of the value in one node requires steps to percolate or sift [1] . As there is at most one arrival and one departure in each time slot, the complexity of such an approach is then .
C. Comparison With the Padded Frame Scheme
In this section, we compare the average delay between the PF scheme in [8] and the CR switch with SPFA-LMQ. The PF scheme is an improved version of the UFS scheme. As the UFS scheme, it also operates in frames. If there is a full-framed VOQ, the longest VOQ is selected, and packets from that VOQ are sent to the central buffers. Otherwise, the longest VOQ is selected, and the partial frame of that VOQ is padded with fictitious packets to form a padded frame with packets. The padded frame is sent only if the total number of padded frames in the central buffers does not exceed a threshold . By so doing, the average packet delay can be reduced in light traffic. Clearly, when is 0, it reduces to the UFS scheme. The reason that we choose the PF scheme for comparison is that both the PF scheme and the CR switch are based on the load-balanced architecture. They both achieve 100% throughput without speedup and deliver packets in sequence without resequencing buffers.
In our experiments, we choose as it is the suggested threshold in [8] . As shown in Figs. 8 -11 , the average delay of the CR switch (CR_avg) is much lower than that of the PF scheme (PF_avg) under light traffic. Moreover, these two curves are very close to each other under heavy traffic.
D. Comparison With the iSLIP Algorithm and the Ideal Output-Buffered Switch
In this section, we first compare the delay performance of the CR switch with a famous practical input-buffered switch: the iSLIP [13] in Figs. 8-11 . From those figures, we observe the following.
1) Under the uniform traffic in Figs. 8 and 9 , the delay of both the CR switch and the iSLIP are finite. 2) Under the hotspot traffic in Figs. 10 and 11 , the iSLIP algorithm cannot achieve 100% throughput when the arrival rate is greater than 0.8. Nevertheless, the delay of the CR switch remains finite. Fig. 8 . The average delay of the PF scheme, the CR switch, the iSLIP algorithm, and the ideal output-buffered switch under the uniform i.i.d. traffic. Fig. 9 . The average delay of the PF scheme, the CR switch, the iSLIP algorithm, and the ideal output-buffered switch under the uniform Pareto traffic. Fig. 10 . The average delay of the PF scheme, the CR switch, the iSLIP algorithm, and the ideal output-buffered switch under the hotspot i.i.d. traffic.
3) Under the uniform i.i.d. traffic in Fig. 8 , the delay of the iSLIP algorithm is much lower than that of the CR switch. 4) Under the Pareto traffic, the delay difference between the iSLIP and the CR switch is much smaller than under the i.i.d. traffic. Fig. 11 . The average delay of the PF scheme, the CR switch, the iSLIP algorithm, and the ideal output-buffered switch under the hotspot Pareto traffic.
The last observation is due to the burst reduction property (as previously reported in [3] ) that the CR switch inherits from a generic load-balanced switch. As pointed out in [12] , the Internet traffic could be very bursty. Thus, we expect that the average delay of the CR switch might be comparable to that of the iSLIP algorithm when the Internet traffic is lightly loaded. However, the delay performance of the CR switch is much better when the Internet traffic is heavily loaded. From Figs. 8-11 , we see that the average delay of the CR switch converges to that of an ideal output-buffered switch under heavy traffic condition. This observation is consistent with the theoretical result in [3] that the average delay of a generic load-balanced switch converges to that of the ideal output-buffered switch for a certain uniform bursty traffic model in heavy load. As discussed in [3] , the first stage of a load-balanced switch effectively reduces burst lengths and can thus approach the performance of an ideal output-buffered switch under heavy traffic. From Figs. 9 and 11, we see that the average delay of the CR switch is very close to that of the ideal output-buffered switch under the heavily loaded Pareto traffic. As the average queue length can be derived from the average delay by using Little's formula, we also expect that the average memory requirement for the CR switch should be comparable to that for the ideal output-buffered switch when the traffic is heavy and bursty [even though the worst-case memory bound in Corollary 12 is ]. Finally, we note that there exist switches in the literature that guarantee delay bounds (see, e.g., [7] , [15] ). However, these delay bounds are at the cost of speedup of 2.
E. Fairness Issues
In this section, we discuss some fairness problems associated with the CR switches. It is well known that switches using deterministic and periodic TDM connection patterns can have fixed priority order among inputs for the same output port. One example is the mailbox switch [5] . The CR switch inherits a fairness problem from its predecessor, the mailbox switch. We now briefly describe this problem. Consider packets destined to output 32. Suppose input 1 is connected to central buffer at time . Then, output was connected to central buffer at time and retrieved the HOL packet of I-VOQ . Therefore, the HOL packet in VOQ 32 of input 1 can contend as the HOL packet of I-VOQ successfully at time if there were no reservation packets in I-VOQ at time . In general, suppose output retrieves the HOL packet of I-VOQ at time . Then, input can contend as the HOL packet of I-VOQ at time . Therefore, the contention priority among input VOQs of destination should decrease with the input indices in the right modulated fashion after ; i.e. input has higher priority than input for . To demonstrate the priority, we show the average delay of packets destined to 32 from input 32 and that from input 1 [CR_(32,32) and CR_(1,32)] by Figs. 8 and 9 . In these figures, we observe that the contention priority appears in the region between because the curve of CR_(32,32) is higher than that of CR_(1,32) in this region. As such, the fixed contention priority among flows might be a concern.
We can solve this fairness problem by remapping port indices. This technique was proposed to solve the fairness problem for the mailbox switch [5] . There are one-to-one and onto mappings from the set to itself. We uniformly select a mapping from those mappings. Then, we use that mapping for time frames in our simulation experiments. To uniformly select a mapping, we toss fair dice with values for the th dice. Then, we select one value from the remaining unused values as the th value of the permutation mapping. If we utilize enough mappings, by the law of large number, we can eventually equalize the priority orders of all output ports. In order to keep packets in sequence, we need to pause two frames during the transition from one mapping to another. During the pause, the output ports clear the possible HOL contention packets in the central buffers. This pause of sending and receiving packets would result in approximately throughput loss. In Figs. 8 and 9, we simulate for mappings at each data point. In these figures, the maximum average delay among all flows in the CR switch (CR_remap_max) and the minimum average delay among all flows in the CR switch (CR_remap_min) are very close to each other. Therefore, this fairness problem due to contention priorities can be successfully solved by remapping port indices. One can also observe that CR_remap_min is even higher than CR_avg in some data points. This is because of the throughput reduction due to the pause. This throughput reduction, however, can be made as small as possible by setting large enough. There is another fairness problem associated with the CR switch. In a CR switch, a flow of packets delivered mostly in the reservation mode may experience less expected delay than a flow of packets delivered mostly in the contention mode, even if the flow delivered in the contention mode has less arrival intensity. This phenomenon is more likely to happen if the arrival traffic is extremely unbalanced. Suppose that an input port sends most of its traffic to a particular output port. We call such a pair of input and output ports a hotspot flow. Packets generated by a hotspot flow with medium to heavy traffic loads are most likely delivered in the reservation mode. As a result, contention packets from other inputs to the same destination are more likely blocked because reservation packets from the hotspot flow are likely to occupy the HOL positions. The contention packets from other input ports can only use the remaining bandwidth left by reservation packets from the hotspot flow. Therefore, a fairness issue can arise under extremely unbalanced traffic.
To study the fairness issue for extremely unbalanced traffic, we simulate the CR switch equipped with port remapping and loaded with the hotspot traffic. We simulate for the total average delay of hotspot flows (CR_remap_hot) and the total average delay of all other flows (CR_remap_cold) in Figs. 10 and 11. As shown in those figures, there is a fairly wide gap between the curve CR_remap_cold and the curve CR_remap_hot. In comparison, we also simulate the PF scheme loaded with the hotspot traffic. The results are shown as curves PF_hot and PF_cold in Figs. 10 and 11 . We can observe that the gap between CR_remap_cold and CR_remap_hot is much smaller than the gap between PF_cold and PF_hot. Thus, the CR switch has a less serious fairness problem than the PF scheme under such traffic. However, the port remapping method cannot effectively equalize the average delay of packets delivered in the reservation mode and that of the packets delivered in the contention mode due to blocking of service in the reservation mode. Similar fairness problems among flows could also exist in the input-buffered switch under the MWM-LQF algorithm [14] . This is because an input VOQ is more likely to build up when its arrival traffic is bursty and heavy. As the MWM-LQF algorithm assigns the weight of an input VOQ proportional to its queue length, the input VOQ with bursty and heavy traffic will be matched most of the time and result in blocking of service for other input VOQs.
VI. CONCLUSION
In this paper, we proposed a new switch architecture called the CR switch. Packets are transferred within the CR switch in two modes: 1) the contention mode in light traffic and 2) the reservation mode in heavy traffic. One of the main contributions of this paper is that we proved that packets are delivered in sequence by the CR switch. In addition, the CR switch achieves 100% throughput.
For the performance of the CR switch, we simulate the average packet delay and compare with existing switches in the literature. We showed that the average packet delay of the CR switch is considerably lower than that of the existing load balanced switches in the literature, including the uniform frame spreading scheme [10] , the padded frame scheme [8] , and the mailbox switch [5] . We have also compared the average delay performance of the CR switch with that of the input-buffered switches with matching algorithms, specifically the iSLIP matching algorithm [13] . From simulation, we found that the CR switch performs distinctly better under heavy traffic condition. However, the iSLIP switch has a better delay performance under light to medium traffic conditions.
Finally we note that the CR switch has several technical issues listed below. Some of them may require further studies. 1) Fairness:
The CR switch has two fairness problems. The first fairness problem arises because of the deterministic and periodic TDM connection pattern that the CR switch uses.
We proposed a port remapping method to solve this fairness problem. In the second fairness problem, we observe that packets transmitted in the contention mode are likely to have longer delays than packets transmitted in the reservation mode. This fairness problem requires further study. 2) Large propagation delay:
In the CR switch, we need one bit of feedback information from the connected central buffer to indicate whether a transmission is successful or not. There might be a problem if the propagation delay from the connected central buffer to an input is large. 3) Heterogeneous line speeds:
We assume that the input line speeds are identical. This is a very common assumption in the literature for inputbuffered switches and load-balanced switches that require synchronous transmissions. To deal with the case with heterogeneous input line speeds, one common practice is to implement line-grouping-i.e., multiplexing the low speed lines into high speed lines before they go into the CR switch and then demultiplexing the traffic after leaving the CR switch. The drawback of doing line-grouping is that some bandwidth could be wasted as there could be residual bandwidth left unpacked.
4) Priority services:
In order to provide quality of service in the CR switch, one might need to consider the problem of providing priority services in the CR switch. A simple and straightforward method is to provide priority services directly in the input VOQs. However, we might not be able to retain the 100% throughput property by doing that. The problem arises when there does not exist a full-framed VOQ of highpriority packets while there are still full-framed VOQs of low-priority packets. If we choose to serve the high-priority packets in the contention mode, then we will waste some bandwidth and cannot maintain 100% throughput for low-priority packets. One tentative solution for this is to set a threshold like the PF scheme in [8] and serve the high-priority packets in the contention mode only when the total queue length of full-framed VOQs is below the threshold. However, how to set the threshold to achieve the right tradeoff between high-priority packets and low-priority packets requires further study.
APPENDIX A PROOF OF PROPOSITION 2
We prove (i) and (ii) simultaneously by induction on time. Suppose that Proposition 2(i) and (ii) are true up to time as the induction hypothesis. Without loss of generality, assume that is the th slot of frame of input . Moreover, frame is a reservation frame that contains packets for output . As such, a packet-say, packet -is transmitted to I-VOQ at time and stored as the th packet for some . As a reservation packet is always attached to the tail of an I-VOQ, to prove Proposition 2(i), it suffices to argue that before transmitting another packet from input at , the queue length of I-VOQ is exactly . We first show that the queue length is at least . If , nothing needs to be proved as an I-VOQ contains at least one packet. For , it suffices to show that the th packet of I-VOQ exists at . Since packet is stored as the th packet in I-VOQ at time , the th packet of I-VOQ -say, packet -exists at time . From Proposition 1(ii), packet will depart the switch at time . As , packet must be a reservation packet and it is transmitted to I-VOQ at some time from some input . As reservation packets are transmitted consecutively in each reservation frame, there is another reservation packet-say, packet -transmitted from input to I-VOQ at time . Thus, from (ii) in the induction hypothesis, packet will depart the switch at time . From Proposition 1(ii), the th packet in I-VOQ at time will depart the switch at time , which is the same as packet . Thus, packet exists as the th packet of I-VOQ at time , and there are at least packets. Now, we show that the queue length is at most . Suppose that the th packet-say, packet -exists in I-VOQ at . Then, packet is a reservation packet transmitted to I-VOQ before . Also, from Proposition 1(i), packet will depart the switch at time . As reservation packets are transmitted consecutively in a reservation frame, there is another packet-say, packet -transmitted from the input of packet to I-VOQ before . From (ii) in the induction hypothesis, packet will depart the switch at time . As argued in the previous paragraph, packet is the th packet in I-VOQ at time . This contradicts to the assumption that packet is stored as the th packet of I-VOQ at time . The induction of Proposition 2(ii) at time follows directly the inducted result of Proposition 2(i) at time and Proposition 1(ii).
APPENDIX B PROOF OF LEMMA 11

Let
[resp. , ] be the cumulative number of packets arriving at the CR switch (resp. the central buffers, ) for output by time . Let be the number of packets in at time . Also, let be the number of packets in output buffer of the output-buffered switch at time when the arrival process is , for , 2, 3. Note that an output-buffered switch is work-conserving. Thus, we have the following Lindley's equation: (13) for , 2, 3. From [2, Sec. 1.3], these Lindley's equations can be expanded recursively to the following forms: (14) for , 2, 3. As we have shown in Proposition 10 that is work-conserving with response workload 1 and response delay , it follows from Lemma 6 that (15) Since the packets arriving at are all reservation packets, they are only a subset of the packets arriving at the central buffers. This implies that , for all . Along with (14), we have (16) As the packets arriving at the central buffers are the packets departing from the input buffers, we have (17) Also, let be the number of packets destined to output stored in the input buffers of the CR switch at time . Then, we have Using (17) and (19) in (14), we have This completes the proof.
