ABSTRACT: This paper describes the architecture for a workconserving server using i3 combined input output buffered crossbar switch. The switch employs a novel algorithm based on output occupancy (Lowest Occupancy Output First Algorithm (LOOFA)), anti a speedup of only 2. A workconserving switch provideis the same throughput performance as an output buffered switch. The work conserving property of the switch is independent of the switch size and input traffic pattem. We also present a suite of algorithms that can be used in combination with LOOFA. These algorithms determine the faimess and delay propertiles of the switch. We also describe a mechanism to provide delay bounds for real time trafk using LUUFA. These delay bounds are achievable without requiring output buffered switch emulation.
Introduction
The rapid growth in the popularity of the internet has caused the traffic on the internet to double every year for the last several yms. It has also spurred the emergence of many Internet Service Providers (ISPs) whose revenues are primarily from providing internet access to individuals and corporate customers. The ISPs typically lease wide-area links that cost them, for some very high capacity links (e.g., OC-48), up to a dollar per second. In order to remain profitable in such an environment, the ISPs need to keep these links fully utilized.
There is a dire need to reduce/eliminate link idling. Also, to sustain growth, they need to provide new differentiated services -e.g., tiered service, support for multimedia applications, etc.
The switches/routers in the ISP's networks play a critical role in providing these features. There has to be mechanisms built into these switches so that it can be work conserving (to eliminate link idling), support pricritization (for tiered service), and provide rate and delay guarantees (to support multimedia applications). In addition, the switch has to be of a very high capacity. Optimal thraughput and delay performance is obtained using output buffered switches. As long as input port and output port is under-subscribed (i.e., feasible loads), 100% throughput is achieved. Moreover, since upon arrival, the packets are immediately placed in the output buffers, it is possible to better control the latency of the packet. This helps in providing QOS guarantees. To achieve this, the switch fabric must operate at a rate at least equal to the aggregate of all the input links connected to the switch. However, increasing line rates (B) and increasing switch size (N) make it extremely difficult to significantly speedup the switch fabric, and also build memories .with a bandwidth of O(NB). This has renewed interest in switches with lower complexity (and cost) such as input buffered switches despite their deficiencies.
One of the most popular interconnection networks used for building input buffered switches is the crossbar because of its (i) low cost, (ii) good scalability and (iii) non-blocking properties. An input buffered crossbar switch has the crossbar fabric running at the link rate. If each input maintains a single FIFO queue, packets suffer from head of line (HOL) blocking. This limits the maximum throughput achievable. Karol et al. [l] showed that the maximum throughput of an input buffered crossbar switch operating under uniform traffic is limited to about 58%. Moreover, Li [2] has shown that the maximum throughput of the switch decreases monotonically with increasing burst size. Considerable amount of work has been done in recent years to build input buffered switches that match the performance of an output buffered switch. To eliminate HOL blocking, virtual output queues (VOQs) were proposed at the inputs. However, since there could be contention at the inputs and outputs, there is a necessity for an arbitratian algorithm to schedule packets between various inputs and ' Currentlv with Svcamore.
With Digital Ne<work Products Group, a subsidiary of Cabletron Systems. Currently with NexaBit.
0-7803-4482-0/98/$10.00 0 1998 IEEE outputs (equivalent to the matching problem for bipartite graphs).
Definition 1: A maximal match on a bipartite graph is one where no more matches can be made trivially. A maximum match on the other hand is one that matches the maximum number of inputs and outputs, i.e., there is no other match that matches more inputs and outputs. A maximum match is maximal, however, the reverse is not true.
It has been shown that an input buffered switch with VOQs can provide asymptotic 100% throughput using a maximum matching algorithm [3] . However, the complexity of the best known maximum match algorithm is too high (O(N2.p) [4] for high speed implementations. Moreover, under certain traffic conditions, maximum matching can lead to starvation. Over the years, a number of maximal matching algorithms (e.g., PIM [5], WPIM [6] , SLIP [7] , FAAR [8] , LPF[I2]) have been proposed.
However, none of these algorithms match the performance of an output buffered switch.
Increasing the speedup of the switch fabric has also been proposed as one of the ways to improve the performance of an input buffered switch. Speedup is defined as the ratio of the switch fabric bandwidth and the bandwidth of the input links.
(Until otherwise mentioned, we will be assuming that all input links and output links of the switch have the = e capacity.) However, when the switch fabric has a higher bandwidth than the l i s , buffering is required at the outputs too. Thus, a combination of input buffered and output buffered switch is required -CIOB (Combined Input and Output Buffered). The goal then, is to find the minimum speedup required to match the performance of an output buffered switch using a CIOB and
VOQS.
Identical behavior as an output buffered switch means that under identical input traffic, (a) the CIOB switch is busy at the same time as the emulated switch, and (b) the packet departure order is the same. If only (a) is satisfied, then the throughput performance is matched, and if both (a) and (b) are satisfied, then delay performance is also matched. A work-conserving switch will satisfy condition (a).
McKeown et al. [9] showed that a CIOB switch with VOQs is always work conserving if speedup is greater than N/2. In a recent work, Prabhakar et al. [lO] showed that a speedup of 4 is sufficient to emulate an output buffered switch (with an output FIFO) using a CIOB switch with VOQs. In [ll] , a simulation study of input buffered switches suggested that speedup of 2 is sufficient to provide both throughput and delay performance of an output buffered switch. However, there were no proofs to hack their claims.
In this work we present a novel scheduling algorithm (LOOFALowest Occupancy Output First Algorithm). We prove that a CIOB switch with VOQs operating under W O F A and speedup of 2, is work conserving at all times. The work conserving property of a switch operating under LOOFA is independent of the switch size and input traffic pattern. We also present a suite of algorithms that can be used in combination with LOOFA. These algorithms determine the delay and fairness properties of the switch. In this work we will also present a mechanism that when augmented with a switch operating under LOOFA and speedup of 2 can not only provide work conserving properties, but also provide delay guarantees. The arbitration delay is the key component in the end to end switch delay in a crossbarbased switch, partly because it is the one that is the most difficult to control. We derive expressions for the arbitration and output queuing delay bounds far a LOOFA switch. To our knowledge, LOOFA is the only arbitration algorithm that is not only simple to implement, but also provides both work- 
High Level Description of the Archi tecture4
The interconnection architecture is a N X N crossbar with a speedup of S, where, N is the number of crossbar ports. The crossbar connections between the input and output ports are called channels. The variable length packets are broken into fixed sized cells before being transmitted across the crossbar. The cells are reassembled at the output of the switch. In practice, the cell size is chosen such that the arbitration and scheduling functions can be performed within a single cell time. A cell time is the time to transmit a cell across the input and output link, and is equal to C/B, where C and B are the cell size and the bandwidth of the link, respectively. To simplify the discussions, we are going to assume that all packets are of the The queuing architecture described in this section suffices to explain the LOOFA algorithm. However, it should be noted that the architecture needs to be augmented in order to provide features like per-flow fairness and delay guarantees. In this paper we will only describe the queuing architecture to provide delay guarantees (Section 4.1). Each input line card has N virtual output queues (VOQs), each corresponding to a crossbar output port. Packets upon arrival from the physical input link get stored in the memory, and a pointer to the packet is appended to the VOQ corresponding to the output channel of the: packet. The switch has a central arbiter that executes the iterative maximal-matching algorithm described in the next section. In each cell slot, the arbiter schedules and transfers at most S cells from an input and S cells to an output, where S is the speedup. The arbiter operates on the contents of all the VOQs across the switch. During each phase, upon completion of the matching algorithm, the arbiter sends to each input channel I, the identifier of the output channel to which the input channel 1 can send a cell to. Once a cell is transmitted across the croszibar, the cell is stored in the memory at the output side of the switch, waiting to be transmitted onto the outgoing link.
Description of the Algorithm
Time is divided into cell slots. Each cell slot is divided into phases. For a speedup of S, there are S phases per cell slot. In each phase, an input can transfer at most One cell, and an output can receive at most one celll. During each phase, an execution of the matching algorithm takes place. It is assumed that sufficient number of iterations is completed during the execution such that no more trivial matches can be added. For a switch of size N, N iterations are sufficient. New cell arrivals take place at the beginning of a cell slot. An input can receive at most one new cell per cell slot ftom its incoming link. Departures from the output take place at the end of a cell slot. An output can transmit at most one cell per cell slot to its outgoing link. As mentioned earlier, the input channels maintain a VOQ per output channel. Output channels maintain a queue to receive cells from input channels. An integer variable, named occupancy, is associated with each output queue. As the name implies, the occupancy of an output, j, at any time is simply the number of cells currently residing in output j's queue. The natural place to maintain occupancy values is at the arbiter. The arbiter keeps track of the number of times (ncells) an output got matched in the current cell slot. At the end of the cell slot, the arbiter increments the occupancy of an output by (ncells -I ) -simulating a cell departure from the output. However, if ncells is zero and the occupancy is zero, the occupancy is not updated. Zero occupancy is permissible, implying that a single cell could have entered a zero occupancy output and exited it during the cell slot.
Definition 2:
A switch is work conserving if and only if an output of such a switch is not idle at the end of a cell slot T, if at the beginning of the cell slot T, (a) there was at least one cell at any input of the switch destined for this output, and/or (b) the output FIFO had at least one cell queued in it.
Under a speedup of 2, each cell slot has 2 phases. During each phase: 1.
2.

3.
4.
Initially, all inputs and outputs are unmatched. Each unmatched input selects the active (i.e.$ a VOQ that has at least one cell queued) VOQ going to an unmatched output with the lowest occupancy, and sends a request to that output. (OUTPUT SELECTIOM). The output upon receiving requests from multiple inputs, selects one and sends a grant to that input. (INPUT SELECTION).
Return to step 2 until no more connections can be made.
The algorithm essentially gives priority to output channels with low occupancy, thereby attempting to simultaneously maintain work conservation across all output channels. It is interesting that a crossbar speedup of only 2 is necessary and sufficient to ensure the goal. The proof is presented in the appendix.
The above mentioned version of the algorithm is the "greedy" one. In step (2) if we do not restrict to selecting an active VOQ, and also consider inactive VOQs, then the algorithm acts in a "best-first" fashion -i.e., in each iteration, the inputs across the switch pick the lowest occupancy unmatched output across the switch and perform step (3) for that output. If a unmatched input does not have any cell for the lowest occupancy output, the input sends a null request to the arbiter. Thus, the "bestfirst" version of the algorithm has to execute N iterations to perform correctly. However, in the "greedy" version of the algorithm, multiple outputs can get matched in an iteration, thus, requiring on average lesser than N iterations. It should be noted that both the greedy and the best-first version of the LOOFA algorithm are work-conserving. The work-conserving feature of the switch is also independent of the selection algorithm used at the outputs. Table 1 lists some examples of the selection algorithm that can be employed at the outputs.
The choice of the selection algorithm will determine the fairness properties in an overloaded switch, and delay properties under feasible rates. In this paper we will only present the delay properties of the switch.
Providing Rate and Delay Guarantees using LOOFA
In order to provide rate or delay guarantees for the quality of service (QOS) traffk, admission control is essential. In the absence of any admission control, the guarantees achievable for the QOS traffic effectively converges to that of the best effort traffic. The QOS traffk also becomes less useful if the admitted flows are not policed somewhere -be it at the network edge or within the switch itself. In the absence of policing, flows that do not conform to their negotiated rates can cause problems to On the other hand, if the flows cannot be trusted to conform to their negotiated rates, some policing mechanism is needed to ensure that guarantees for each flow are met. An effective way to police traffic is to incorporate a rate controller that l i m i t s the maximum rate of a flow to its requested rate. An example implementation of a rate controller is the rate-controlled version of worst case fair weighted fair queuing (WF2Q) [13] . The use of rate controllers for shaping QOS traffic can also yield delay bounds for the QOS traffic. We will now describe a queuing architecture to provide delay bounds in a crossbar switch. switches attempting to provide rate and delay guarantees to d admitted flows. In this paper, we will assume that the switch has to perform the policing itself. It is also implicitly assumed that the switch control processor performs the admission control and the relevant per-flow parameters are communicated to the scheduling and arbitration modules in the switch.
A Queueing Architecture to Supporf QOS Traffic
It is implicitly assumed that there is a rate associated with each flow. The service parameters of each flow are communicated to the switch during connection establishment. It is also assumed that the flows are feasible. Please refer to Figure 1 for the In the absence of policing, and under the assumption that flows do in fact conform to their requested rate in long term average, the task of providing long term rate guarantees is relatively simple. By admitting flows whose aggregate rate is less than the switch capacity (in particular, the aggregate rate of all flows to each output port must not exceed the port's capacity), one may simply use LOOFA and a speedup of 2 (which provides 100% throughput for feasible loads). No per-flow state needs to be maintained in the switch. following discussion. Each input channel maintains a per flow queue. Packets upon arrival from the physical input link get mapped to a flow through a flow mapper. The flow can be based on various classifiers: source address, destination address, protocol type, etc. The packet gets stored in the memory, and a pointer to that packet gets added to the queue corresponding to the flow. There is a (politer) rate controller Sf, one per VOQ, which schedules flows into their respective VOQs. When a flow gets scheduled, the pointer to the HOL cell in the selected flow's queue is removed and added to the tail of the VOQ corresponding to the flow. However, since we have It should be noted that only for the cells in the scheduled VOQs does the matching algorithm maintain the work conserving property. The matching algorithm does not provide work-conserving property for the cell arrivals into unscheduled VOQs or the flow queues. Intuitively, both the rate controllers Sf and S,, present cells to the arbiter and hide from the arbiter the existence of actual cell arrivals at the input of the switch. Thus, for tlhe arbiter, the actual cell arrivals into the switch occur only when the cells arrive into the scheduled VOQs. As will be shown in the next section, the combination of the use of rate controllers, and the work conserving property of LOOFA provide delay gUiUanteeS for the scheduled cells.
Arbitration Delay Bounds
We assume that both Sr and S , are rate-controlled version of WF2Q7. The end to end switch delay of a cell, D, is the sum of the delay incurred at the input, the arbitration delay (Da), and the delay incurred at the output (Do). The input delays are due to the rate controllers and not dependent on the matching algorithm used. In this work we will derive the arbitration and the output delays due to LOOFA. The key delay component in crossbar switches is the arbitration delay, which critically depends on the matching algorithm used. The arbitration delay of a cell is the time takein to reach the output channel after it gets scheduled. We prove in the appendix that using the bestfirst LOOFA-OCF (best-first version of LOOFA and employing OCF as the input releetion) algorithm, for S 2 2 , D, I 4N/(S -1) cell :dots. It is also shown that if the output is arranged in a FIFO manner, Do I 2N cell slots. Thus, for S=2, the bound on the sun? of the arbitration and output delay is 6N cell slots. This delay bound did not require the switch to emulate an output buffered switch.
Limiting to one cell arrivrd per cell slot is essential for the correct operation of LOOFA. 60f course, we need not haw: two separate queues, to implement it. We could have just maintained a bit with each cell, and set the bit when it g.ts scheduled. We explain it this way purely for the sake of clarity. The delay bounds depend on the discrepancy bounds of the rate controller. WFQ was chosen because it is known to have a very small discrepancy bound, and therefore can ensure. small delays [13] .
In [14] , it has been proved that best-firsr LOOFA-OCF emulates an output buffered switch (with the output arranged in a FWO) for a speedup of only 3. It is shown in the appdix that using the queuing architecture presented in the previous section at the input, we can obtain a bound of only 2N cell slots for (Da+Do) for a switch which emulates an output buffered switch (with the output arranged in a FIFO). Thus, the bound for (Da+Do) using best-first LOOFA-OCF for S=2 is slightly worse than S=3. On the other hand, for a delay of 6N to be a significant factor in the total delay of a particular flow, the flow must have a rate of more than CI6NT bits/s, where C is the cell size in bits and T is the cell time in seconds. For N = 32, T = lp, C = 256 bytes, we obtain a maximum supportable rate of almost 10 Mb/s, which is much more than the typical rates of real-time rraffic.
Conclusion
There is an immediate demand for high capacity switches and routers that can provide both work-conserving properties and also rate and delay guarantees. Presented in this paper was a novel crossbar arbitration algorithm, LOOFA, which is workconserving for all traffic patterns and switch sizes for a speedup of only 2. Also presented were a number of schemes that can be used in combination with LOOFA to provide a wide range of delay and fairness properties. For a rate-controlled input, and using Best-first LOOFA-OCF with a speedup of 2, bounds were derived for the arbitration and output queuing delays. We also show that these delay bounds are achievable without the need to emulate an output buffered switch. To the best of our knowledge, LOOFA is the only arbitration algorithm that is easy to implement, and provides both work-conservation and delay guarantees in a crossbar for a speedup of only 2.
APPENDIX: Proof for the workconserving property
The following terminology is used in the claims and proofs:
C , = A cell going kom input i to output j. S = Speedup of the switch fabric. (c, (t)) = For a cell c, going from input i to output j,   it is the set of cells c,, (where, 1 I j ' S N ) that have occupancy X i' (t) I X, (t) at t, except itself. and/or gone to its output. Unlike best-first version, it is not guaranteed that if a cell dlid not leave its input thread, a cell from its output thread would have left. However, for work conservation, it suffices that L.emma 1 is true -which happens to be the case for both the greedy and the best-first version of
X (t) =
Ifl(c,
(
LOOFA.
Corollary 1: Consider a switch running the lowest occupancy first algorithm with a speedup of S. Consider a tagged cell occupancy of the output j, X i , is decremented by 1 (simulating a cell departure to its outgoing link) only, if there was a cell in the output FIFO at the beginning of the cell slot T ( Tb ), or, if there was at least one cell at any input of the switch destined for this output at Tb . Thus, in the presence of such an updating procedure for X j , the instant X for any output j of the switch becomes less than zero, the switch will cease to be work conserving. In the following Theorem, we will prove a stronger invariant: For each and every cell (say, C,) in the switch operating under LOOFA and has a speedup of 2, during it's tenure at the input of the switch, X (t,) 2 )IT(C, (t, ))I .
Therefore, the occupancy of all outputs in a switch operating under LOOFA and a speedup of 2 is always non-negative.
Theorem 1: Consider an N X N switch operating under LOOFA with a speedup of S. Suppose that the switch has been operating from cell slot 1, having been empty before that time.
For each cell c, in the switch, as long as the cell remains at the input, x (t,) 2 I'T(c, (t,))I holds, so long as s 22 holds, Torr I t , I Tgere : where, Tw,e is the end of the cell slot Torr (the cell C, arrives at Twrb ), and TMere is the end of the cell slot TMer (the cellC, gets transferred to its output during Proof: New arrivals for the t-th cell slot take place at t, . Given that there can at most be one new arrival into an input
We will now show that there is no violation at t. -i.e.,
VC, E C(t,),X j ( t , ) 2 lIT(C,(t,))l holds.
Case (1 k
As per assumption (Al), -1),) 
VL, E c1(t,), x j ( ( t -I),) 2 I I T( L, ( (
VCij E q' (t, ), vj E J ' (t, >,
Case(3):
As per assumption (Al),
vc, E C 3 ( t , ) , X j ( ( f --l ) , ) + T ( C , ( ( t -l ) , )~. Let 
N g ( t ) I (Ljt+ N ) -( L i t -N ) I 2 N .
Corollary 2: V j , t , X ( t ) I 2N and vc, E c(t), t, loT(c, (t))l I 2N holds true in an N X N switch running best-jirst LOOFA-OCF algorithm for a W2Q
based rate controlled inputs, as long as S 2 2.
Proof: Since, all cells scheduled for an output either reside at the input side of the switch and the output FIFO, it follows from Lemma 3. It also follows that the delay any cell encounters at the output, Do, is bounded by 2N.
Corollary 3: vc, E c(t),t,IIT(C,(t))l I 2 N h o l d s m e in
an N X N switch running best-frsr LOOFA-OCF algorithm for a WF2Q based rate controlled inputs, as long as S 2 2. It also follows that Vi, t, Nf ( t ) I 2N + 1 also holds.
Proofi Follows from Corollary 2 and Theorem 1. 
( N f ( t ) + O T ( c j . ( t ) )
decreases at least by 1. N f ( t ) can increase by 1 during each cell slot due to a new cell arrival.
Thus, under a speedup of S, the sum (Nf(t)+OT(C,(t))
decreases at least by S-1 .
Thus at the end of 4N/(S -1) cell slots, if the tagged cell is still at the input, thae will be no cells competing with it -in other words, there will be no cells in its output thread nor in its input (apart from itself). Thus, in the next cell slot, only the new cell that comes into its input competes with it, and hence the tagged cell will go to its output in one of the S phases. 
