ABSTRACT -This paper presents a VESI implementation of a resource allocation scheme, based on the concept of weighted fair queueing. The design can be used in Asynchronous Transfer Mode (ATM) networks to ensure fairness and robustness. Weighted fair queueing is a scheduling and buffer management scheme that can provide a resource allocation policy and enforcement of this policy. It can be used in networks in order to provide defined allocation policies (fiairness) and improve network robustness. The presented design illustrates how the theoretical weighted fair queueing model can be approximated with a model feasible for practical implementation. This approxiniated model has been implemented as a VLSI component.
1. INTRODUCTION One of the major advantages of the B-ISDNATM is the integration of dBerent services. Multiple traffic management policies can be integrated and high network utilization can be achieved. The network can use sophisticated cell scheduling and queue management methods in order to support complex trtaffk management policies.
The ITU-T [ I ] and The ATM Forum [ 2 ] are currently worlking on a new service class. The standardization term is Available Bit Rate (ABR). The ABR service class is intended for applications with elastic bandwidth ]requirements. Resources are allocated according to defined fairness policies. In order to support the ABR service class, the network can implement sophisticated cell scheduling and buffer management methods in the network switch elements. A network switch element can employ cell queueing in order to prevent cell loss when congestion occurs in the switch element. Congestion takes place, when the traffic load on an output port exceeds the output port's link rate. Fig. 1 illustrates a switch element with queueing units. The queueing units can be located at the input, the output or internal to the switch unit ports.
The service classes that can be supported in a network will depend on the architecture of the queueing units in the switch elements. The architecture of the queueing units can be divided into three categories, which are illustrated in. The cell scheduling and queue managlement methods most often found in ATM networks of today, is first-in first-out (FIFO) scheduling with a single queue shaired by all connections (see Fig. 2a ). In order to support integration of traffk with different service requirements, the shared queue can be segmented into multiple shared queues that are scheduled according to defined priorities (see Fig. 2b ). A cell scheduling and buffer management method that offeirs several advantages compared to shared queue methods, is weighted fair queueing [5] [6] [7] (see Fig. 2c ). The queueing is per-connection wlth FIFO queues, and scheduling is performed according to a weighted allocation policy. The weighted Fair queueing method can provide:
Weighted bandwidth allocation policies with enforcement of that policy. Bandwidth is allocated on a per-connection basis according to relative weights (fairness). Isolation of users. Well-behaving users can be protected from misbelhaving users exceeding their allowed transmission rates, thereby improving network robustness. Queue length can be allocated along a network connection path on a per-connection basis.
Scalability. Queue capacity can be scaled to match a connections distance, speed, and number of traversed nodes.
Smooth flow of cells.
From the atiove listed features, it can be understood that weighted fair queueing has strong theoretical support. Fairness and robustness are some of the major advantages that weighted fair queueing can provide. These advantages are important criteria for networks supporting the ABR service class. This paper will illustrate that weighted fair queueing not only has strong theoretical support, but is also well suited for practical realization at relatively low implementation cost. The theoretical model has been approximated in order to make practical realization feasible. This approximated weighted fair queueing model has been implemented as a VLST design.
WEIGHTED FAIR QUELJEING

A . Theory
Fair queueing originated as a congestion control device preventing misbehaving users from affecting the service offered to others p]. -Transmit the cell with the lowest timestamp value.
-Set Lasf-Timestamp equal to the timestamp of the transmitted cell. The algorithm can be explained in few words: When cells arrive at the switch element they are queued in per-connection FIFO queues. A cell will reach the front of its queue when the preceding cell in the queue is transmitted, or at the instant of arrival if the queue is empty. When a cell reaches the front of it's queue, the cell receives a timestamp. The timestamp is the timestamp of the last transmitted cell plus a per connection spacing constant. All cells that have received a timestamp are sorted in increasing timestamp order. Asynchronously with the queueing of cells, the cells are transmitted in increasing timestamp order. The algorithm is illustrated in Fig. 3 . Three connections share the bandwidth on a link. The cells from a connection are spaced according to the connections relative weights. A connection i receives a bandwidth share SW, of the available bandwidth BW,,,, on the outgoing link given by,
where A is the set of connections with non-empty queues. Note that the timestamp domain is not real time, but only a relative time domain, which we define as the virtual time domain" The virtual time spacing interval is proportional to the inverse of the connection's relative weight.
H. Realization
The weighted fair queueing algorithm transmits cells in increasing timestamp order, and the timestamps must therefore be sorted in increasing order. A very important implementation issue is realization of the sorting mechanism. The sorting mechanism must be capable of sorting N timestamps per cell slot period, where N i s the maximum number of connections sharing the outgoing link. The implementation cost can be high if the sorting function is to be capable of sorting the timestamps correctly (see e.g. IS]). The sorting function can be approximated with a relatively simple bucket sort^ The concept of the bucket sort mechanism i s illustrated in Fig. 4 .
The virtual time scale is divided into intervals. Each intervals is represented by a 'bucket'. Timestamps are inserted in the bucket which represents the virtual time interval in which the timestamp is included. Together with the timestamp a connection identifier is inserted in the bucket. When a cell slot is available for transmission, a timestamp and connection identifier is removed from the current bucket or the first succeeding non-empty bucket if the current bucket is empty. The first cell from the identified connections queue is then transmitted. A variable active-bucket maintains the current bucket number. Since timestamp and connection identifier pairs are inserted and removed in the same order, cells with timestamps included in the same virtual time bucket interval are not guaranteed to be transmitted in increasing timestamp order. If the bucket time interval is equal to the resolution of the timescale, the bucket sort will sort correctly. If the bucket time interval is greater than the resolution of the time scale, the bucket sort will approximate a correct sorting function.
The infinite virtual time scale can be implemented with a virtual timewindow of finite length, that is cyclic traversed. The length of the timewindow defines the number and range of perconnection weights that can be supported. A spacing constant rwI is defined for every per-connection weight w,.
where k is an implementation dependent constant. The length of the timewindow Twlndows must be large enough to suppoxr the largest spacing constant, which is defined for the lowest perconnection weight w,,,.
?V 1 10 100 1000 where B is the number of bucket,s, and rWma is the spacing constant defined for the largest relative weight w , , , To sinriplffy the calculation of the appropriate bucket for a given timestamp the length of the timewindow and the number of buckets should follow the criteria below, so that divisions can be performed with shift right bit operations.
The spacing constants and we:ights are implemente'd as integers, which limits the number of weights that can be defined. E the lowest relative weight Bevel is 1, and the highest relative weight is wmmr spacing constant can only be defined for the relative weights w given by If an error margin is allowed for the definition of the relative weights, the number of weights that can be defined in the range of 1 to w , , can be increased. Table I illustrates the number of weights that can be defined for a given relative error margin fox different weight level ranges and awlma values. An outgoing ]link is shared by N con.nections with sources transmitting ,at maximum %ink rate, and non-empty perconnection quleues. The variance sf the output process depends on the accuracy of the sorting mechanism, Minimum variance is achieved with a correct sorting mechanism.
The first simulation was made with a rehtively low number of connections. A total ~f 50 connections (are sharing the same link, and their relative weights are defined in the range of 11 to 100. The simulation results are presented in Table IT . From the results it i s seen that a bucket sort implemented with 128 or more buckets can provide a relatively smooth cell flow. With 5 112 buckets, the performance is nearly optiimal.
The second simulation was made for a relatively large number of connection:;. A total of 1500 connections are sharing the same link, andl their relative weights are defined in the range of 1. to 100. The simulation results are presenited in Table 111 . Sim. = 100 x max(output process (mean)) = 266000 cell p.
It can be seen that a smooth cell flow can be provided for a large number of connections. One of the major advantages of this bucket sort mechanism, is that the number of supported connections has minimum impact on the implementation complexity The mechanism is capable of supp~rting a Barge number of simultaneously transmitting sources. Furthermore, a large range of relative weights can be supported.
HARDWARE MODLJEE ARCHITECTURE
The proposed approximation to weighted fair queueing has been implemented in a queueing unit. Fig. S. illustrates the architecture of the queueing unit, and how the queueing unit can be used together with a switch unit. The connections can be mapped to one of 16 weighted fair queueing modules that are implemented per input queueing unit. A module consists of a weighted fair queueing approximation, and a traffic shaping mechanism for dynamic control of the maximum output rate from the module. Backpressure from the switch unit to the input queueing units adjusts the transmission rate from the weighted fair queueing modules to the total load per switch unit output port. The queueing units is based on a VLS% design with a block of external S W memory. The architecture 0% the K S I design i s illustrated in Fig. 6 .
The design implements the following: Input//Outpu% Line i/f9 DatdAddress ctrl for external S W i/f3 CPU i/f ~ modules for bucket sorting, cell queueing, port scheduling and a global control unit.
A. Bucker sort
The bucket sort controller inserts and removes timestamp and connection identifier pairs from buckets. The buckets are searches for non-empty buckets, and spacing time is updated.
B. Cell queueing
The queue controller maintains per-connections cell queues. Cells are discarded when the queues are full. The allocation of b d e r length is controlled via external CPU interface. Queue length is allocated from a pool of cell buffers, shared by all connections.
C. Port scheduler
The output from the weighted fair queueing modules are scheduled with a round robin discipline. If a module has any non-empty per-connection cell queues, the weighted fair queueing approximation will decide from which queue the next cell is transmitted. Incoming cell that are not mapped to any of the modules, are passed directly from input to output of the queueing unit.
D. GIobal ctd
A central contxoll unit (not shown in) implements a stare machine, that controls the other modules.
E. Line i / f
The incoming line interface implements an 8 bit parallel synchronous interface. Synchronization to the incoming cell flow is performed and mis-matched cells are discarded. The cell labels are decoded. The outgoing line interface implements a 8 bit parallel synchronous interface with control signals for an external FIFO module.
F. External memory i / f
The external SRAM interface consists of an address bus controller, and a data bus controller. Data for write operations are multiplexed and memory addresses are calculated. The databus is 48 bit wide, and the address bus 21 bits wide. The data bus controller implements a hamming parity codingldecoding, capable of detecting and correcting single bit errors.
G. CP I/ if
The VLSI design implements app. 50 registers that can be accessed by an external processor unit in the MC68000 series. Readwrite accesses are with SI4 CPU cycles. A single bit interrupt is included. The interface implements registers for CPU access to the external SRAM.
IY IMPLEMENTATION DETAILS
The management of the bucket sort mechanism is ilhstrated in Fig. 7 . A bucket consists of head and tail pointers to ip. linked list of bucket entries. The bucket is empty if both these pointers are nil. A bucket entry consists: of a connection identifier, a timestamp offset and a pointer to another bucket entry the where b is the bucket number. It is therefore sufficient to store a timestamp offset in the bucket entry. If N connectioins are supported, there must be a total of N bucket entries. A bucker entry is idle for every connection queue that i s empty. Two global head and tail pointers, IdleEntryHead and ldleEm@Tail maintain a linked list of idle bucket entries.
The bottleneck in the bucket soirt mechanism's scalability for speed, is the search for the next nlon-empty bucket, which must be performed per cell slot period. In the worst case, all buckets must be scanned, which will require a relatively large bucket memory access bandwidth. In order to reduce the memory bandwidth demands, a single bit status flag indicating w:hether a bucket i s empty or non-empty is maintained. When searching for a non-empty bucket, the single bit status flags are scanned.
The organization of the per-connection queues is illustrated in Fig. S . Every connection maintains a queue length counter and a maximum queue length value. Incoming cells are discarded when queue length is equal to maximum queue length. The cell queues are maintained as linked lists. Every connections maintains head and tail pointers to a linked list of cells buffers. Two global pointers, IdleCellHead and IdleCellTail, maintain a linked list of idle cell buffers. The total imemory requirement is a function of the maximum number of c:onnections N. the number of buckets B, length of the timewidow Tw,,,dow and the maximum per-connection cell queue length. Table IV illustrates memory size and bandwidth access for the different memory blocks used to implement weighted fair queueing. The memory size is for a single port The bucket empty status flags and bucket memory require a relatively small amount of memory, and consumes a relatively large part of the total memory access bandwidth. This memory is therefore well suited to be realized as on-chip memory in a VLSI design. The memory for the bucket entries, queue ctrli and cell buffers is well suited for realization as stand alone memory external to the %SI design, because of the relatively large memory size.
The problem of implementing linked lists, is that the entire list must be re-initialized when bit errors; occurs. If extra parity bit for error detection and correction is added to the memory interface, e.g. a hamming parity coding, the stability of the design can be increased significantly.
V. VLSI IMPLEMENTATION
The VLSI design was implemented in <a lpm CMOS process, using a standard cell layout approach. The design process was fully automated, with gate level synthesis from a VHDL description. The VLSI characteristics are presented in Table V . 
