Abstract-In this paper, we address the problem of fair scheduling of packets in Internet routers with input-queued switches. The goal is to ensure that packets leave the router in proportion to their reservation under heavy traffic. First, we examine the problem when fair queuing is applied only at output link of a router, and verify that this approach is ineffective. Second, we propose a flow-based iterative deficit-round-robin (iDRR) fair scheduling algorithm for the crossbar switch that supports fair bandwidth distribution among flows, and achieves asymptotically 100% throughput under uniform traffic. Since the flow-based algorithm is hard to implement in hardware, we finally propose a port-based version of iDRR (called iPDRR) and describe its hardware implementation.
I. INTRODUCTION

M
OST of the commercial routers employ input-queued switches (Cisco [1] , BBN [2] ) because output-queued switches are difficult to design. Fairness in resource allocation is important to support quality of service (QoS). In this paper, we address the problem of fair scheduling of packets in routers with input-queued switches. The goal is to make sure that packets of different flows leave a router in proportion to their reservation when the bandwidth of the router cannot accommodate all incoming traffic.
Generally, a router consists of line cards which contain input and output ports, a switch fabric and forwarding engines. When a packet arrives at an input port, its header is sent to the forwarding engine, which returns the output port and the new header. The packet is buffered in the input queues waiting for the switch to transfer it across the crossbar switch fabric. When the packet is received at the output port, it is pushed into the output queues. The packet leaves the router when the output link is ready.
Many fair queuing algorithms [3] - [7] have been developed to schedule packets on shared resources. Some of these have been implemented at the output links of commercial routers [1] , [2] to support QoS. However, with today's technology, the link speed is increasing almost exponentially. The backbone crossbar with its scheduler becomes the real bottleneck, and most of the packets are waiting at the input buffer instead of the output buffer. Therefore, applying fair scheduling at the crossbar switch should be more effective than that at the Manuscript received xx xx, xxxx; revised xx xx, xxxx. The research has been supported by NSF grant CCR 0105676. The authors are with Computer Science and Engineering, University of California, Riverside, CA 92521 USA (email: xzhang@cs.ucr.edu; bhuyan@cs.ucr.edu) output queue. Our simulation results in this paper confirm this hypothesis.
We propose a flow-based iterative deficit-round-robin fair scheduling algorithm (iDRR) for the crossbar switch. Our experiments shows that it provides high throughput, low latency and fair bandwidth allocation among competing flows. The algorithm is based on the deficit round robin (DRR) algorithm [6] . Another flow-based crossbar scheduling algorithm, called iFS, was proposed in [8] . It is based on virtual time [4] , [5] . We believe that the DRR algorithm is easier to implement in hardware than the virtual time. We also show that iDRR delivers packets at a rate close to an output-queued switch.
In practice, if an algorithm has to be fast, it is important that it be simple and easy to implement in hardware. A flowbased fair scheduling algorithm is desirable, but is difficult in terms of hardware implementation because of the large and variable number of flows. To solve this problem, we also develop a port-based algorithm (called iPDRR) that can work with virtual output queues (VOQs) [9] . We propose to divide the flow-based scheduling into two stages. First, a fair queuing algorithm is applied at the input buffer to resolve contentions among flows from same inputs to same outputs. Then a port-based fair scheduling algorithm is adopted to resolve contentions among ports.
Stiliadis and Varma also proposed a port-based fair scheduling algorithm weight probability iterative matching (WPIM) [10] . It was shown that iFS performs better than WPIM in terms of granularity of fairness [8] .
The rest of the paper is organized as follows. In section II, we describe the architecture of the input-queued switch used in this paper. In section III, we examine the problem when fair queuing is applied only at the output link. Then we propose a flow-based iterative deficit-round-robin fair scheduling (iDRR) for the switch in section IV. In section V, we present a portbased iterative deficit-round-robin fair scheduling algorithm (iPDRR). Its hardware implementation is described in section VI. Finally, section VII concludes this paper. Fig. 1 shows the internal structure of the input-queued backbone switch of a router. We explain the operations of various components of this figure in this section. Incoming packets are stored in the input queue, which has a separate queue per output port, called virtual output queue (VOQ), if a port-based switch scheduler is used. If the switch scheduler is flow-based, a separate queue per flow needs to be maintained. We call it virtual flow queue (VFQ). The size of the input buffer is finite. When the input buffer is full or congestion is anticipated, a de-congestion mechanism is needed to determine when to drop packets and which packets to drop. One function of the de-congestion mechanism is to isolate bad-behaved flows from well-behaved ones. Such flow isolation is critical to fair scheduling [8] . In the rest of this paper, we assume that such a mechanism is already provided.
II. OPERATION OF THE INPUT-QUEUED SWITCH
The center of the router is the switch fabric, usually an
¢ ¡ £
crossbar connecting inputs and outputs. It operates on small fixed-size units, or cells (in ATM terminology). A slot is the time to transfer one cell across the switch fabric. Variable-length packets are usually segmented into cells before being transferred across the crossbar. The switch scheduler selects a configuration of the crossbar to ensure that at any instance, each input is connected to at most one output, and each output is connected to at most one input. For each pair of matched input and output, a cell is selected to be transferred across the switch. In practice the switch scheduling functionality is distributed to each port, and an iterative scheduling algorithm is implemented to work independently for each input and output [8] , [10] - [16] . We modify this part of the hardware to implement fair scheduling.
The scheduling can be done at the cell level, i.e. all the input-output connections are torn down after each slot and the scheduler considers all inputs and outputs to find a new matching. Cells from different packets can be interleaved to one output. This is the scheme used in many current routers. The advantage of this scheme is its simplicity. The switch scheduler doesn't need to know the packet length.
Alternatively, the scheduling can be done in such a way that when an input and an output are matched, the connection is kept until a complete packet is received at the output. This approach is called packet-mode scheduling while the previous is called cell-mode scheduling. It was shown that packet-mode scheduling has better packet delay performance than cell-mode scheduling in case of packet length distribution with a small variance (the coefficient of variation of the service time is less than 1) [17] .
The re-assembly module holds cells until the last cell of a packet is received at the output port, and then reassembles these cells back into a complete packet. In cell-mode scheduling, re-assembly engines are required, where is the number of inputs if port-based scheduling is used or the number of flows if flow-based scheduling is adopted. In packet-mode scheduling, only one re-assembly engine is needed.
When a complete packet is received by the re-assembly module, it is stored in the output queue. The output queue can be a simple first-in-first-out (FIFO) queue if the basic first-come-first-served (FCFS) queuing algorithm is used, or a multiple-queue buffer if some fair queuing algorithm, such as DRR, is employed. When a output buffer fills up, the switch is notified to stop transferring packets to it until some buffer space frees up.
The output link is responsible for sending packets out of the router. When the output link is ready, it signals the output queuing engine to pick a packet from the output queue according to a certain criterion. Commercial routers, such as Cisco 12000 series [1] , implement DRR-based algorithms at this point.
Our simulation is based on the model shown in Fig. 1 . Packet arrival is modeled as a 2-state ON-OFF process. The number of ON state slots is defined as the packet length which is generated from a profile of NLANR trace at AIX site [18] We have varied these parameters to do a sensitivity analysis. However, due to page limit, we report only a few results in this paper based on above parameters.
III. PROBLEM WITH FAIR QUEUING AT THE OUTPUT LINK
Fair queuing is usually applied at the output ports of a router. In this section, we verify that this approach is ineffective in a router with input-queued switch, Consider a ( ¡ ( IP router configured as Fig. 1 . We choose iSLIP [12] as the switch scheduler and DRR [6] as the fair queuing algorithm. Each input has one flow destined to output , and reserves When the link speed is less than % cell/slot, DRR also fails to distinguish among flows. The underlying problem is that the output queue lacks a de-congestion mechanism to isolate different flows. The switch pushes packets of different flows into the output queue at the same rate. On the other hand, DRR pops more packet of flows with high reservation than those of flows with low reservation. Eventually, most of the output buffer space is filled up by flows with the lowest reservation.
However, applying a de-congestion mechanism at the output queue has its own problem. Suppose that we apply a decongestion scheme called equal size per flow (ESPF) 1 [8] right before the output queue, one solution could be: when a flow uses up its quota, the switch is informed to stop transferring packets belonging to that flow. However, this approach won't work since iSLIP doesn't distinguish among flows. Another way is to drop the packet when a flow uses up its quota. This approach works (as shown in Fig. 3 ), but is still not very attractive. First, accepting packets at input queue and dropping them at output queue wastes switch bandwidth and input buffer space. Second, applying a de-congestion mechanism at the output side is redundant since it is already done at the input side. Finally, this approach still cannot solve the problem when output-link speed is the same as the switch speed (see Fig. 4 ).
The conclusion is generally true. If the switch scheduler treats flows with different reservations equally, it is useless to apply fair queuing only at the output side. In other words, the fairness issue should be addressed at the switch and/or input side. In the following section, we develop such a flow-based scheduler for the switch and use simple FIFO output queues.
IV. A FLOW-BASED FAIR SCHEDULING ALGORITHM
Based on the observation in section III, a straightforward idea is to apply fair queuing at the switch to support fair bandwidth allocation. In this section, we first introduce the 1 ESPF: Each flow is assigned a quota which equals to the total buffer size divided by the number of flows. When the buffer occupancy is below a certain threshold, all incoming packets are accepted. After that, a packet is accepted only if the flow's quota has not been used up. definition of fairness in input-queued switch scheduling, and then present an iterative deficit-round-robin fair scheduling scheme. We call it iDRR.
A. Definition of fair scheduling
We follow the same definition of fairness as that in [8] . An input-queued switch is work-conserving. Let ¢ ¡ be the reservation of
. For any two backlogged flows ! § " and ! $ # that are in contention, a scheduling scheme is fair in
$ " # Intuitively, this definition means that when the bandwidth of a router cannot accommodate all incoming packets, packets of different flows leave the router in proportion to their reservations.
In a router with input-queued switch, the input/output line cards and the switch are usually of the same speed. In this scenario, the switch input port cannot be overloaded because the incoming traffic has been shaped by the input line card which receives packets at the rate of 1 cell/slot. The switch output port, however, can be overloaded when packets from different input ports go to the same output port, or the instantaneous output line speed is reduced because of the slowdown of the down link. These are some of the "heavy traffic" situations we consider in this paper.
B. Description of iDRR algorithm
The basic idea of iDRR is to assign each flow a quota which is in proportion to its reservation. When a flow's corresponding input and output are matched, we continue transferring packets of the flow until its quota is used up.
In an 
. In packet-mode scheduling, we assume that the maximum transfer unit (MTU) is known in advance, and the minimum quota is no smaller than MTU (This assumption is not necessary in cell-mode scheduling).
" is maintained to record the active flows from
is maintained to record active flows to ¤ ¦& ¦ , and initialize
becomes empty for a period of time after a packet of
. Initially, all inputs and outputs are unmatched. Then in each iteration:
sends a request to every output for which it has a queued cell. the first flow, say
and send a grant to
receives any grants, choose from 
C. Remarks 1) Reservation and quota:
For any flows wishing to receive guaranteed bandwidth, reservation is necessary. When a flow makes a reservation at a router, its identification and reservation information is entered into the flow list of the router. All packets of this flow will carry the identification information.
Flow reservations can be made as the percentage of the total bandwidth. The quota can be set statically or dynamically. In
. In the dynamic approach, the minimum quota is assigned to the flows with minimum reservation and quotas of other flows can be adjusted accordingly.
2) Flow deactivation: After a flow makes a reservation, it can be active or inactive depending on its queue status until the reservation is canceled. Initially When the above flow deactivation scheme is used, it may so happen that a flow is in the active flow list while its queue is empty. Therefore, in the grant step of iDRR, when selecting a flow, the condition "
is not empty" is necessary.
3) Complexity:
In the context of fair queuing, assuming that MTU is known in advance and letting all quotas no smaller than MTU ensure that DRR has the time complexity of % 5 4
, i.e., it is guaranteed that the top of the active flow list can be selected. In [19] , Kanhere et al. proposed an Elastic Round Robin (ERR) algorithm which removes this assumption by partitioning the time into rounds and using the maximum packet length during the previous round as the minimum quota in the current round. In the context of switch scheduling, however,
% 5 4
still cannot be guaranteed even with the above modifications. Because it may happen that the input port of the top flow in the outflowList has already been matched in the previous slots. In the worst case, the algorithm has to check all flows in the list before it selects one.
4) Cell-mode vs. packet-mode:
Scheduling can be cellmode or packet-mode. In the case of flow based scheduling, cell-mode may not be a good choice. Although cellmode scheduling simplifies the switch, it requires that the reassembly module can hold packets at one time, where is the number of flows and can be very large. Therefore we choose packet-mode in our simulation. (same as that in Tiny Tera [20] ). Under this circumstance, we observe that the average packet delay of iDRR is almost identical to that of iFS. They are all close to the delay when output queuing is used. Hence, iDRR is capable of achieving asymptotically 100% throughput for uniform traffic. For the non-uniform traffic, we consider the server-client model as used in [10] and [8] . In a %' ¡ %' switch, 4 ports are connected to servers and 12 to clients. Each client sends 10% of its generated traffic to each server, and the remainder is evenly distributed among other clients. For each server, 95% of its traffic goes to clients, and 5% to other servers. In this setting, the traffic from clients to servers is almost twice that from clients to clients. Fig. 6 shows the average packet delay of traffic from clients to servers as a function of the workload per input. As we can see, iDRR and iFS are almost indistinguishable and can reach a throughput of about 78%.
D. Simulation Results
To evaluate the fairness of iDRR. we run the simulation on a ( ¡ ( switch, where each input has two flows destined to output with different reservations. Each flow maintains the same arrival rate. Output link speed is 1 cell/slot. As shown Fig. 7 , the bandwidth is distributed in proportion to each flow's reservation when the link is over-subscribed.
V. A PORT-BASED FAIR SCHEDULING ALGORITHM
Flow-based fair scheduling algorithms are desirable in terms of fairness among flows. However, in terms of hardware imple- Fig. 7 . Throughput per flow using iDRR mentation, they are more complex than port-based algorithms. In iDRR and iFS, each port needs to maintain an active flow list, whose length varies from time to time and can be very large.
We propose to divide the flow-based scheduling problem into two stages. Fig. 8 shows two additional stages, VFQ and DRR, introduced at the input side of Fig. 1 . There is no fair queuing engine at the output side. First, we apply a fair queuing algorithm at the input buffer to resolve contentions among flows from same inputs to same outputs. The VFQ is implemented in DRAM and the DRR fair queuing is implemented in software. A separate input queue (VOQ) is maintained for each output, as required for iterative scheduling algorithms. Then we develop a port-based fair scheduling algorithm to resolve contentions among the input ports.
For a hardware scheduling algorithm to be useful, it is important that it be simple. That why iSLIP is the choice in Tiny Tera [20] and Cisco GSR [1] , although it doesn't offer the best performance compared to other schemes or provide 100% throughput under non-uniform traffic [21] . It is readily implemented in hardware and can operate at high speed. It was shown that iSLIP can find a matching using 3 iterations within 8 switch cycles (45 ns) for a 8 4 1 ¡ 8 2 1
switch [22] . Our iDRR can be readily modified to its port-based version. We call it iPDRR. Because of the fixed number of ports, it is easy to implement in hardware. In the rest of this section, we describe iPDRR, compare it with other schemes, and show our simulation results. Its hardware implementation will be discussed in details in the next section. . In packetmode scheduling, we assume that MTU is known in advance, and all quotas are no smaller than MTU (this assumption is not necessary in cell-mode scheduling). Notice that iPDRR differs from iDRR in that inactive ports are not removed from the linked list so that the size of the linked list is fixed. This modification makes hardware implementation much easier (see section VI).
Like iDRR, when 
B. iPDRR vs. iSLIP
In [12] , McKeown proposed a round-robin algorithm iSLIP. Instead of using a linked list, iSLIP uses a pointer at each output(input) to record the input(output) with the highest priority. Grant(Accept) is given in the order starting from the highest priority port.
A weighted iSLIP algorithm is also proposed in [12] . 
#
will be re-calculated. The hardware implementation of iPDRR, on the other hand, is very straightforward and easy.
C. iPDRR vs. iPFS
In [8] , Ni and Bhuyan proposed a switch scheduling algorithm, called iFS, which is based on virtual time. Each incoming packet is assigned a virtual time according to its flow's reservation. The iFS schedules packets in the increasing order of the virtual time.
The original iFS is flow-based. The time complexity of a schedule arbiter is # " can be achieved by using parallel comparison. Still we need to compare values and select the smallest in a very short time. For example, the scheduler of Tiny Tera [20] runs at a clock speed of 175 MHz, and a slot is composed of 9 cycles in which 8 cycles are for 3 iSLIP iterations and 1 cycle for crossbar configuration. If iFS uses such timing, the comparison of values has to be done within about % 2 % ns. Fast comparison prefers small register width. However, calculating virtual time may involve floating point which increases the register width. In addition, virtual time is monotonically increasing. The registers must be big enough to hold the flow of the longest life. An 16-bit register means that a flow can only live as long as 65535 slots (about 8 
9(
ms in a switch like Tiny Tera).
iPDRR, as we can see later in section VI, is easily implemented in hardware and can operate at high speed. Selecting a port from a list can be done by using a circular linked list and a simple combinational circuit with multiplexers and demultiplexers. And connection-tear-down logic can be carried out with 1 registers and an adder. All modules in iPDRR can be accomplished in % or 1 cycles. In addition, a %' -bit register is big enough to hold a flow's quota.
D. iPDRR vs. WPIM
In [10] , Stiliadis and Varma also proposed a port-based fair scheduling algorithm weight probability iterative matching (WPIM) which is based on the original parallel iterative matching (PIM) algorithm [11] . In WPIM, the time axis is divided into frames with a fixed number of slots per frame. The reservation unit is slot/frame. In the first iteration, for each output, an additional mask stage is introduced to block those inputs whose credits are used up, thus allowing other inputs to receive their shares of the bandwidth. Clearly, the bandwidth guarantee is provided at coarse granularity of a frame [8] .
The WPIM scheduler uses random selection which is an expensive operation, particularly for a variable number of input requests. It is hard to implement in very short time. In [11] , a slot time is slots/frame respectively. The iPDRR, on the other hand, can always provide fair sharing regardless of the change of the output link speed.
When the actual reservation to an output port is less than its capacity, WPIM equally allocates the rest of the bandwidth among all competing flows. In iPDRR, the rest of the bandwidth is still allocated in proportion to competing flows' reservations. heavy workload. This is because that unlike the original definition of cell-mode scheduling [17] , one input-output match can last for more than one slot even in the cell-mode iPDRR/iDRR so that the cell-mode iPDRR/iDRR in fact behaves similarly to its packet-mode counterpart.
E. Simulation Results
For the non-uniform traffic, we run the simulation under the same setting as that in section IV-D. The results are shown in Fig. 11 and Fig. 12 . VI. HARDWARE IMPLEMENTATION OF IPDRR Now let's consider the hardware implementation of iPDRR. Fig. 15 shows the block diagram of how request modules, grant arbiters, accept arbiters and decision modules are connected to construct an iPDRR scheduler (for convenience, we only show the implementation for a ( ¡ ( switch; it is straightforward to extend it to larger switches).
A. Request module
The request module " at
, as illustrated in Fig. 16 , is responsible for sending requests to grant arbiters
is not empty. Like iSLIP scheduler [22] , the request-grant-accept iteration in iPDRR is also pipelined, i.e., requests in the next iteration can be sent at the same time when the accept decisions in the current iteration are made. Whenever accept arbiter
"
(see section VI-B) receives at least one grant, anyGrant signal is set and " is disabled in the next iteration. The request module " also takes as inputs the input-output connection information made in the previous slot by decision module " (see section VI-C). Together with the state of
, this connection information determines whether the corresponding input-output connection will be kept or not in the current slot. If so, the corresponding grant and accept arbiters are disabled in the current slot. The accept arbiter (see Fig. 18 ) is almost the same as the grant arbiter, except that it doesn't need a register to hold the accept decision to update ¤ ¦& ¤ © % 6 9 8
B. Grant / Accept arbiter
because an input that receives at least one grant will definitely be matched. . When there is a new matching, the corresponding register, say " , and all registers after " will be enabled so that " " "
999
" can be done in one shot. Note that if the last port in the list is matched, we don't have to update the list. Also, when there is a new matching, the last two ports will always be enabled. So there are 
C. Decision module
After the accept arbiter makes a decision, the result goes to the decision module (see Fig. 20 ) and is stored in the decision register. The main functionality of the decision module is to perform bandwidth fair sharing among ports. Each decision module has registers for quotas and registers for counters. Since updating quotas is not timecritical, for each quota, we keep a copy in memory. When a flow comes and goes, the corresponding quota is updated in memory and then copied to its register. In packet-mode scheduling,
The & I $ ¤ signals go to the request module, which further decides whether a connection needs to be kept or not in the next slot.
VII. CONCLUSION
In this paper, we first demonstrated that applying fair queuing only at the output link is not very effective, because the number of packets competing for the output link is limited in input-queued switches. Therefore, we proposed iDRR, a flow-based fair scheduling algorithm which can allocate the switch bandwidth in proportion to each flow's reservation. The iDRR is implemented to an iterative-based switch scheduler so that packets are properly selected from the input queues for transmission. We showed that such a scheme achieves fair scheduling while providing high throughput and low latency. Since flow-based fair scheduling schemes are difficult to implement in hardware, we also proposed a port-based fair scheduling algorithm iPDRR, and described its hardware implementation in details. We also compared the performance of iPDRR with other schemes to demonstrate the superiority of our technique.
