Abstract| There is considerable interest in the provision of guaranteed-rate services for IP and ATM networks. Simultaneously, bandwidth demands make input-bu ered architectures attractive, and in some cases, necessary. In this paper, we consider the problem of how to support guaranteed-rate services in a single-stage, input-bu ered switch suitable for a LAN switch, an ATM switch or an IP router. Such a switch m ust be feasible at high transmission speeds, o ering both guaranteed-rate performance for CBR channelse.g. for real-timeconnectionsand beste ort services for traditional data tra c. We consider a switch scheduling mechanism that employs idling hierarchical roundrobin HRR scheduling and fabric arbitration at the connectionlevel for guaranteed-rate service using the Slepian-Duguid algorithm. The switch uses cell level arbitrationfor best-e ortservice. This overall switch s c heduling mechanism is a variation of DEC's AN2 design 2 .
I. Introduction
There is a strong desire to support a guaranteed-rate service for both IP and ATM networks, in particular for real-time tra c. For the Internet, the IETF is pursuing a guaranteed-rate service 23 based on generalized processor sharing 21 . For ATM networks, the ATM Forum has proposed a CBR service described in 4 .
In this paper, we consider the provision of a guaranteed-rate service on a network switch or router. In particular, we consider the support of this service over a switching fabric that employs input-bu ering. Whereas many researchers have studied provisioning of guaranteed-rate services over output-bu ered switches, there has been little work reported for input-bu ered switches. Interest is growing in input-bu ering: today's memory bandwidths cannot keep up with the demand for faster line-rates and greater switching capacity. And the problem is getting worse: memory access times are barely improving, whereas the demand for switching capacity continues to grow exponentially.
As a result, many of the fastest commercial 2 , 3 and research 19 , 22 switches and routers today are based on an input-bu ered crossbar switch. 1 Each o f these systems internally uses a small xed-size packet, G. Kesidis is supported by the NSERC of Canada. 1 The main alternative architectures to single-stage inputqueued switches are multi-stage switches, e.g., 7 . similar or equal in length to a 53-byte ATM cell. Each system contains line cards that accept variable length packets from the outside world. When all of the cells have been switched, they are reassembled into variable length packets before being sent on their way. Because of their widespread use, we focus our attention on switches that use xed-size packets. For obvious reasons, we refer to them as cells" but make n o a ssumptions about their xed length.
In this paper, we address the problem of scheduling bandwidth for input-bu ered switches. We assume that the network provides perhaps renegotiable CBR virtual channels for real-time services. We focus on how an input-bu ered switch can provide a guaranteed-rate service. In addition, the switch m ust support a beste ort" service for traditional data tra c. Other services can be supported over these two basic" services.
A 2 2, single-stage, input-bu ered switch is illustrated in Figure 1 . By N N" w e mean that the switch has N input ports 2 and N output ports. The switch operates on a cell-time" clock where one cell-time is the common transmission time of a cell on the links connected to the switch, e.g., at 155 Mbps, one cell-time is approximately 2:8s. At most one cell arrives at each input port every cell-time and, similarly, at most one cell departs from each output port every cell-time. In general for single-stage switches, there is an inputside switch fabric between the input ports and the single queueing stage, and there is an output-side switch fabric between the queueing stage and the output ports. The queueing stage is simply a bank of logically separate queues. The queues are distributed among blocks" of memory where each memory block has a separate input output bus and, therefore, can operate independently from the other blocks of memory. W e also assume each memory block has a single address decoder allowing only one read or write operation at a time. Each logical queue is served in a rst-in-rst-out FIFO fashion. Both the number of queues and the number of blocks of memory can be di erent from N. Each memory block of a single-stage, input-bu ered switch has a single associated input port. So, each memory block will experience at most one cell write operation per cell-time. In the scope of this paper, the inputside fabric merely determines where in bu er memory each arriving cell is written and is not illustrated in Figure 1 . The output-side fabric has an associated arbiter". At each cell-time, the arbiter decides which cells at most N in total from the memory blocks traverse the fabric and are transmitted onto the output links.
Every cell-time, the input-side switch fabric simultaneously removes at most N cells from the input ports and places them in the queueing stage. Similarly, e v ery cell-time the output-side switch fabric simultaneously removes at most N cells from the queueing stage and places them in output ports for transmission onto the output links. We assume that the switch fabrics are nonblocking, i.e., cells are never dropped while passing through" a fabric. On the other hand, a cell may b e dropped by the queueing stage if, for example, it arrives to a full queue. The particular queue visited by a cell is determined by its input port and the address eld in its header. For example, an ATM switch has a look-up table mapping input-port, VPI VCI to the appropriate output port for each cell. This look-up table is modi ed at call set-up and termination. An IP router supporting RSVP 24 has a routing table mapping destination IP address to the appropriate output port, and a table mapping an IP ow to the appropriate queue. A 2 2 input-bu ered switch that uses a separate queue for all tra c with the same input link, output link combination is depicted Figure 2 .
In this paper, we will observe the following general design goals: ON cell memory blocks, i.e., scalability in the number of memory blocks. Practical prioritized input-bu ered switches are considered in, e.g., 16 , 18 , 15 . DEC's AN2 switch 2 goes further and employs connection-level 3 arbitration" for CBR ows. The connection-level arbitration uses an idling weighted round-robin WRR scheduling mechanism with a common frame size. The precomputed schedule is based on the Slepian-Duguid algorithm, see Chapter 3 of 10 . In DEC's AN2 switch, best-e ort tra c is supported by a separate cell-level arbiter".
In this paper, we revisit DEC's switch design. We describe a per-connection version of this switch that uses idling hierarchical round-robin HRR schedulers 13 , 12 . We believe that idling round-robin schedulers are an e cient w ay to support guaranteed-rate service with minimal per-cell scheduling computation. Moreover, they allow for a controllable distribution of excess unused or unreserved capacity t o a c hieve given fairness" criteria.
In Section 2, the tra c and queueing variables are de ned for the input-bu ered switch. In Section 3, connection-level arbitration for CBR ows is described. The handling of best-e ort tra c is described in Section 4. A summary is given in Section 5.
Memory operations may limit the speed of operation of a switch. So, in input-bu ered switches, all memory blocks are restricted to one cell read operation per celltime. Consistent with the switch design goals stated above, we assume that there is just one memory block per input port processor IPP. In this case, we m a y have contention at each input port among ows that wish to connect to di erent output ports. This contention is resolved by the bandwidth schedulers situated at the IPPs as we will see below.
Consider an N N , single-stage, input-bu ered switch handling tra c that is classi ed into several priorities. In the rst priority priority-1 are connections that require bandwidth guarantees from the switch. Connections that have best-e ort varieties of service belong to subsequent priorities. In the following, we will focus on the handling of priority-1 connections. Beste ort ows are considered in Section 5. In an input-bu ered switch, aggregating the ows in this manner is called virtual output queueing" VOQ or per-output-port queueing" 1 . Under VOQ, the S i 's are idling weighted round robin WRR, i.e., singleframe HRR schedulers with common frame size. The slot assignments output-port indexes of S i are based on the aggregate bandwidth allotments f^ i;j : 1 j Ng, c.f., Section 3. VOQ is the CBR" scheme used in DEC's AN2 switch 2 .
With VOQ, cell head-of-line blocking" can be eliminated entirely 17 . We will see that under the proposed architecture, the term VOQ" is especially appropriate because the switch behaves like an output-bu ered switch.
In per-connection a.k.a. per-virtual-channel or per-VC" memory management, there is a separate FIFO queue, Q 1 k K i;j g. An example frame structure is given in Figure 3 below.
Best-e ort ows can be separated into FIFO queues according to cell input port, output port pair under per-VC Queueing or VOQ. So, under VOQ, each input port has 2N associated FIFO queues: N for priority-1 ows and N for best-e ort ows. Under per-VC Queueing, each input port has N associated FIFO queues for best-e ort ows c.f., Section 4 and a potentially large number P N j=1 K i;j of logical FIFO queues for priority-1 o ws.
III. Fabric Arbitration for Priority-1 Service
Recall that the idling WRR schedulers of VOQ and the level-one frames of the idling HRR schedulers of per-VC Queueing partition bandwidth according to outputport indexes. The slot assignments of the S i 's must be coordinated so that no two of them choose the same output port in any given cell-time. This coordination is called contention resolution" or fabric arbitration".
For simplicity, consider VOQ. At a n y given time, all the S i have a common frame size of f slots cells.
So, there will be r i;j := d^ i;j fe slots reserved for the priority-1 ows to j th output port in each frame of S i .
Thus, S i has f , P N j =1 r i;j slots that are unreserved and intended for best-e ort ows, c.f., Section 4. So, the stronger no overbooking" conditions are An N f slot assignment matrix" for the level-one frames of all of the S i schedulers can now be de ned. No column of this matrix contains the same numeral more than once; as these numerals correspond to output ports, cell collisions" at the output ports will consequently not occur. Also, the number of slots assigned to output port j in row i i.e., in the level-one frame of S i i s r i;j . Let R be the N N matrix whose i; j th entry is r i;j . Under the no overbooking" conditions 3, determining such a n N f slot assignment matrix given R and f is the priority-1 fabric arbitration problem.
For example, consider the case of a 33 switch which, at some given time, has f = 6 and Note that the blanks" in the slot assignment matrix represent unreserved slots that may potentially be used for best-e ort cells, c.f., Section 5.
The priority-1 fabric arbitration problem can generally be solved by applying the Slepian-Duguid approach for a circuit-switched Clos network, see Chapter 3 of 10 . Using this approach, the entire slot assignment matrix takes ON 2 f time to calculate, which can be a signi cant computational expense at high ATM speeds. Thus, this computation cannot occur at the cell level. Modi cations of bandwidth allotments, priority-1 slot assignments or frame structures would occur at the connection level. In response to changing priority-1 tra c demands, the slot assignment matrix would be only periodically modi ed where the period between modi cations clearly depends on the speed of the implemented Slepian-Duguid algorithm.
A. An Example Per-VC F rame Structure Consider a 2 2 switch handling two priority-1 connections for each input port, output port pair. The level-one frame size is f = 5 and the bandwidth allotments in this example are: The guaranteed-rate performance of multilevelassignment HRR is given in 12 , 11 , 14 . Bu er sizing rules are given in Section 4.2 of 14 see also 9 . Results for queues in either idling or nonidling mode are available. Fo r a V OQ switch, end-to-end bu er sizing can be obtained from the results in Section 4.5 of 14 .
and output ports for best-e ort cells. In 2 , a randomized parallel iterative match PIM is suggested; however, iSLIP 16 may also be used and has certain performance advantages. See Section 3.4 of 5 or 18 for discussions of cell-level arbiters for input-bu ered switches. The best-e ort cell-level arbitration may b e governed by a ow control" entity and related fairness" considerations, as mandated in Section 5.2 of 4 . A concern of cell-level arbitration is that the required signaling each cell-time among OPPs and IPPs is costly. An alternative w ould be to divide the unreserved slots among best-e ort ows just prior to the connectionlevel priority-1 arbitration process without violating the no overbooking" conditions. Clearly, this would result in smaller aggregate throughput of best-e ort tra c compared to the fully-shared" approach based on cell-level arbitration.
In general, best-e ort service can be accommodated by adding a priority indication to the queues involved, see Section 2.2.2 of 6 . That is, best-e ort FIFO queues would be assigned priorities from the set f2; 3; 4:::g with priority-1 indicating queues with bandwidth guarantees.
For example, we can consider a two-priority switch 8 handling:
connections requiring bandwidth guarantees with priority 1 IP data tra c with priority 2 For both VOQ and per-VC Queueing, we can arrange each IPP to have N priority-2 best-e ort FIFO queues: one for each switch output port. Note how the use of idling round-robin scheduling allows the switch t o control how excess capacity unused or unreserved slots is distributed among the queues of an IPP to achieve given fairness criteria.
V. Summary
We have presented a method for supporting guaranteed-rate service in a single-stage, input-bu ered switch. Idling multiple-branch HRR schedulers were employed for per-connection, guaranteed-rate management or idling WRR schedulers were employed for peroutput-port queueing VOQ guaranteed-rate management. The problem of connection-level guaranteedrate fabric arbitration can be readily solved using the Slepian-Duguid method. Best-e ort tra c is squeezed into" unreserved or unused time slots, under the control of a centralized cell-arbiter. VOQ is basically DEC's AN2 switch design 2 . The guaranteed-rate performance and bu er sizing rules are available, 12 , 11 , 14 . The described switch is the only input-bu ered, single-stage switch with a guaranteed-rate property. The guaranteed-rate property enables CBR service for real-time connections in particular.
