Today, ATM networks are being used to cany hursty data traffic with large and highly variable rates, and burst sizes ranging from kilobytes to megabytes. Obtaining good statistical multiplexing performance for this kind of traffic requires much larger buffers than are needed for more predictable applications or for hursty data applications with more limited burst transmission rates. Largc buffers lead to large queueing delays, making it necessary for switches to implement more sophisticated qucueing mechanisms in order to deliver acceptable Quality of Service (QoS). This paper describes a 2.4 Ghl s ATM queue management chip that has practically unlimited buffer scaling and which supports dynamic per VC qucueing, an efficiently implementahle form of weighted fair queueing, a novel packet-level discarding algorithm and the ability to support multiple output links. We give a detailed description of our weighted fair queueing scheduling method, which we call the Binary Scheduling Wheels (BSW) algorithm. The BSW algorithm smooths bursty traffic and guarantees minimum transmission rates during overload. The BSW algorithm uses a binary counter based scheduling mechanism and is wcll-suited to hardware implementation.
Introduction
When ATM network technology was first developed in the 1980s, its developers envisioned a comprehensive traffic management methodology, with explicit reservation of resources, end-to-end pacing of user data streams to conform to resource reservations and nctwork-level enforcement mechanisms to protect against iaadvcrtent or intentional violation of resource reservations. In thc context of such a methodology, efficient statistical multiplexing performance could he achieved withoui large amounts of buffering in the network and with very simple queueing mechanisms.
As 4TM was deployed in the 1990s, the original expectations for traffic management were found to he unrealistic. ATM is now being used largely to support internet data traffic which is highly unpredictable and for which the traffic management philosophy of ATM is difficult to apply. In the current application context,
Washington University
St. Louis, MO 63130 Tel: (314) Fax: (314)935-7302 E-mail: jst@cs.wustl.edu resources are generally not explicitly reserved, end systems do not pace their transmissions and most network equipment cannot enforce resource usage limits. In this environment, to obtain good statistical multiplexing performance and high link utilization, one needs l a g c buffers.
In particular, one needs buffers that are at least comparahlc, and preferably an order of magnitude larger than user data bursts, which range in size from kilobytes to megabytes. Unfortunately, the use of large buffers with simple FIFO queueing disciplines leads to poor performancc for real-time traffic and allows "greedy" applications to appropriate an unfair portion of network resources. Providing good quality of service (QoS) to real-time applications and fair treatment to hursty data applications requires more sophisticated queueing and cell scheduling mechanisms
This paper describes a design for an ATM queue manager that supports separate queues for each application data stream and buffer sizes that are limited only by the cost of memory. The design can be implemented with a single application-specific integrated circuit (ASIC) in 0.35 micron CMOS technology together with SRAM components. The design will support a total output rate of 2.4 Gh/s and can support either a single OC-48 link, or a combination of lower speed links.
Section 2 provides an overview of the ATM queue management chip, detailing its principle features and its overall architecture. Section 3 contains a detailed description of our novel implementation of weighted fair queueing called the Binary Scheduling Wheels (BSW) algorithm. Section 4 gives the hardware implementation and cost estimation. The BSW algorithm is well-suited to hardware implementation. It schedules and forwards cells in essentially constant time, and can accommodate a large range of weights.
Overview of Dynamic Queue Manager
The Dynamic Queue Manager (DQM) is designed to connect to thc output side of a high performance ATM switch, such as the Washington University Gigabit Switch, described in [3] . The major features of the DQM chip arc listed below:
Dynamic Queue Assignment -_ The DQM implements per VC queueing using dynamic assignment, which allows the chip to support virmal path and virtual circuit connections with arbitrary choices of VPIs and VCIs and no explicit configuration of VCI ranges to particular VPIs. This greatly simplifies the use of the chip and enables optimal use of the chip's per channel data structures. Details can be found in [5] .
Unlimited Buffer Scaling --The DQM chip is designed so that the cell buffer can be scaled up to very large sizes without increasing the chip complexity significantly. Both the cell buffer and all information to maintain the cell buffer (that is, all the links for the linked list queues and the free slot list) are stored in external memory. The only constraint that the DQM chip places on the buffer capacity is through the choice of pointers. With 20 bit pointers, the chip can support buffer sizes over 50 Mbytes, 24 bit pointers would allow for up to 800 Mbytes. For all practical purposes, the buffer capacity is not constrained by the DQM chip.
Efficient Implementation of Weighted Fair
Queueing --The DQM chip implements weighted fair queueing using a novel approach called the Binary Scheduling Wheels (BSW) algorithm. The BSW algorithm allows cells to be scheduled and forwarded in essentially constant time. Minimum transmission rates of individual virtual circuits can be explicitly specified as a multiple of the lowest rate supported. These weights determine the relative frequency with which cells are forwarded, allowing link bandwidth to be allocated appropriately during congestion periods. With 32 bits to specify the smallest rate, the BSW algorithm can assign bandwidth in amounts ranging from 2.4 Gbls to less than one bit per second. ..
Packet

Dynamic Queue Manager
Figure 1 Block Diagram of Dynamic Queue Manager manage the free slot list that is stored in the external memory, along with the waiting cells. The DQM chip incorporates an on-chip cache that allows the free slot list to be maintained using only memory cycles that would otherwise go unused. This cache stores the location of a number of available cell storage slots. Storage slots can usually be assigned to arriving cells from the cache and departing cells can usually retum their cell slots to the cache, rather than accessing the offchip free slot list. The off-chip list is only accessed to refresh or free up space in the cache, but these operations can he performed during periods when there are guaranteed to be unused memory cycles available. 
Scheduling Wheels Algorithm
The Weighted Fair Queueing cell scheduler is implemented in the Output Scheduler using the Binary Schedufing Wheels Algorithm (BSW). The BSW provides a wide range of rate options at minimal cost and guarantees minimum transmission rates during overload periods. In the cunent design, all virtual circuits share the extra link bandwidth in proportion to their weights. The extra bandwidth can also be equally shared by all virtual circuits with a slight modification of the algorithm and some additional hardware. Because bursty virtual circuits with high peak-to-average ratios are more likely to cause congestion in the downstream switches, the BSW interleaves cells Crom different channels to reduce the hurstiness of the output streams.
Scheduling Wheels
To describe the algorithm, we specify rates as a Craciion of link bandwidth. In order to simplify the computation logic, the smallest rate is restricted to be power of 2 rate, which is 112", where n=O, 1, ___ . We also restrict supported rates to be a multiple of the smallest rate. In general, suppose m distinct rates are supported, rate i can be calculated as wi1Zn, where weight w i is an integer.
The Binary Scheduling Wheels algorithm works as follows. For cach output, we first construct m scheduling wheels, one for each weight. VCs are placed on scheduling wheels according to their rates and destinations. VCs on the same wheel arc organized as a circular list. Weights OC the wheels are stored in memory and can he modified by control cclls. Therefore, supported rates can he changed on a dynamic basis, as long as the total number of different rates does not exceed m. Figure 2 shows an example with m scheduling wheels. Each little box in the figure represents a nonempty virtual circuit queue.
We define the following variables: m = number of distinct rates 
BinaryRate, =BinaryProduct, /2" NextQ, = Pointer to the next queue to be served on wheel i For each wheel, we calculate the weight-count product and binary product. We also calculate the difference between these two. As long as thc allocated bandwidth does not exceed 1, we have at most n distinct binary rates. Because the calculation of binary products is equivalent to rounding up the weight-count product to the closest power of 2 value, scheduling wheels with different weight-count product can have the same binary product and binary rate. The rounding up of rates can lead to some VCs getting more than their allocated share of the link bandwidth. In Section 3.3, we show how to correct for this effect.
We construct a Binary Rate List with n+l slots, each of which is associated with a particular binary rate. Scheduling wheels with the same binary rate are placed on one slot. Figure3 shows thc construction of the binary rate list.
Generally, once a slot in the binary rate list is selected, all scheduling wheels on that slot are visited once. When a scheduling wheel is visited, the current queue on that wheel forwards one cell to the output link. The current pointer is then advanced to the ncxt qucuc on the wheel. This selection process allows cells from different VCs to interleave with each other. Suppose Scheduling wheels in slot 1 are visited twice as frequently as the scheduling wheels in slot 2, four times as frequently as wheels in slot 3, and so forth. A n+l hit binary counter can he used to select slots on the binary rate list. In a binary counter, the least significant hit changes twice as often as the next lowest order hit, four times as often as the next hit and so forth. This propeay matches nicely with our slot selection requirement. As the counter advances, a change in hit j triggers servicing of the scheduling wheels in slot j.
Fast Forward Counter
The binary counter increments by 1 each time it advances. Slots Corresponding to changing hits are called eligible slots. However, it is possible that when the counter advances, none of the eligible slots have scheduling wheels attached to them. In this case, we must increment the counter again to find a non-empty slot. In the worst-case it may take 2n increment steps to find a non-empty slot and during these steps, link bandwidth may he wasted. To avoid this, we introduce a fast forward counter that skips past empty slots. The fast forward algorithm is shown below. The idea is to increment the counter with a cany-in at the position of the right-most non-empty slot. We keep a mask register to indicate non-empty slots that have not been served. We also keep a carry-in register with only one hit set at the position corresponding to the least significant '1' hit of the mask register. After all eligible slots have been served, the value in the carry-in register is added to the counter. Ihe resulting right-most changing hit always corresponds to a non-empty slot.
With the fast forward counter, the selection time becomes essentially independent of the total number of slots. While the time to select the least significant 'I' bit does require more than constant time, hardware implementation can easily he made fast enough that this does not become an issue for realistic values of n. Consequently, cells can he selected and forwarded in essentially constant time.
Accurate Rate Control
We place scheduling wheels on the binary rate list according to their binary rates, or equivalently, their binary products. However, because of the rounding up effect, binary products are usually larger than the actual weight-count products. As a result, scheduling wheels whose weight-count products round up to larger hinary products are visited more frequently than they should he. If the total binary rate of all scheduling wheels does not exceed 1, all queues can receive their guaranteed rates, hut queues on the wheels that round up may F a h more excess bandwidth than those that do not round up. However, if the total binary rate exceeds 1, queues on the wheels that do not round up may not he able to obtain their guaranteed rates.
To avoid this problem while keeping the simple counter-based selection mechanism, we introduce a scheduling wheel skipping scheme. With scheduling wheel skipping, the selection rate of each scheduling wheel can he accurately controlled. The skipping algorithm is descrihed as follows.
For each scheduling wheel, we use a credit counter to decide whether the wheel needs to be served or he skipped. When a wheel is served, we subtract the extra bandwidth the wheel receives from the credit counter. When a wheel is skipped, we add the bandwidth the wheel needs to receive to the credit counter. A scheduling wheel is served only if the value in the credit counter is non-negative.
Instead of using the single list shown in Figure 3 for each slot, we place a scheduling wheel in slot j on one of three different lists: active list, inactive list and temporary list based on the following conditions.
The active list contains scheduling wheels to he served the next time a slot is selected. The inactive list contains wheels to he skipped the next time a slot is selected. When a hit in the binary rate list changes, all wheels on the active list are allowed to send one cell from their next queue. When we serve a wheel on the active list, we also decide whether it should he skipped the next time around. If so, we remove it from the active list and place it on a temporary list. When all the wheels on the active list have been served, the wheels on the inactive list are moved to the active list. while the wheels that have been put on the temporary list are moved to the inactive list. Thc key behind this list manipulation is that because the scheduling is based on a set of binary rates, we never need to skip any single scheduling wheel twice in a row. Using this property of binary rates, we only need to update the credit value when a wheel is served using credit pre-calculation, In other words, after a scheduling wheel is served, we determines whether the wheel is to be skipped the next time the slot is selected. If so, in addition to putting thc wheel on thc inactive list, the credit value after the wheel is skipped is calculated in advance. As a result, we only need to update state when a cell is actually sent. This allows all the states to be stored in memory, rather than requiring the use of hardware registers. The whecl skipping algorithm is shown below.
Initially, for every scheduling wheel i Credit, = 0:
Dec, = Difl,; While (a slot in hinary rate list is selected) for (every scheduling wheel i in active list)
Read out entry i of status The fast forward counter needs to he modified to take into account the activc and inactive lists. Instead of a simple mask, we need a 2 m s k , with two bits for every distinct hinary rate. A value of (00)2 for the 2 w k mcans that thcre are no scheduling wheels associated with that position of the binary rate list. A value of (01)2 means that there are scheduling wheels associated with that position of the binary rate list but they are to be skipped when their next scheduling opportunity c o m a up. A value of means that there are scheduling wheels associatcd with that position oi the binary rate list and at least one of them is to be served next time. We increment the fast forward counter with a carry-in a1 the lowest order position for which the 2 m s k value is 
4
Hardware Implementation
This section descrihes the date structures used in thc hardware implementation. Let N he the number of queues supported by the DQM. Let m be the total number of weights. Let n be the number of bits used to specify the smallest rate. Let r he the number of outputs. Figure 4 shows the cell scheduler stmctures in the chip. Since only one output is enabled at a time, the rate control engine can be shared among all outputs. The only parallelism required is the fast forward counter. Each output needs a counter to record the current selection state.
In the cell scheduler, all data structures are stored in on-chip SRAM. The states of the scheduling wheels are also kept in memory. Figure 5 shows the data structures to implement the Binary Scheduling Wheels algorithm. The Status Table stores state information of scheduling wheels. Each entry contains NextQ,, wi, counf,, Inc,, Dee,, and Credit,, as previously defined. A Queue List is used to construct scheduling wheels. Since a queue can only he on one of the wheels at any time, the queue list can he shared by all outputs. The Wheel List is used to form linked lists in the active and inactive lists. The Slot List Table keeps track of the first and last wheels on thc active and inactive lists.
We need to understand how the number of weights and the number of virtual circuits in the system affect the hardware complexity. Figure 6 shows the required memory size for a single output cell scheduler. The total memory requirement for tbc schcduler supporting 1024 weights is 22 KBytes, with a total of 1024 virtual circuits, which means arbitrary rates can be assigned to individual connections. Even with 8192 virtual circuits, thc memory required to implement 1024 weights is less than 35 KBytes.
In the calculation above, all values arc stored as integers. However, we can use floating point valucs of 12 bits each (7 bits of mantissa and 5 hits of exponent) to store the increment, decrement and credit values. This Status Table  Queue Figure 5 Data Structures in Cell Scheduler basically cuts the memory requirement in half. In addition, if the arithmetic operations are fast enough, the increment and decrement values can be calculated on the fly, which reduces the required memory even further.
In the structures shown in Figure 5 , memory grows as the number of supported outputs increases. Therefore, if on-chip memory is limited, the maximum number of weights that can be supported per output goes down as the number of output increases. However, if the number of outputs is large, we can modify the data sfxucture so that states such as the weight and credit are stored with every queue instead of rate groups. The list nodes can then be shared across all the outputs. After this modification, the data structure stays essentially the same size, no matter how many outputs we have. Each queue needs about 5 bytes of memory to keep the status information. This implementation is more cost efficient for large numbers of outputs.
Summary
In this paper, we have described a dynamic queue manager for gigabit ATM networks and presented a detailed description of the Binary Scheduling Wheels (BSW) algorithm used in the design. The BSW algorithm implements weighted fair queueing in a cost efficient way. VC queues are placed on different scheduling wheels based on their weights. A fast forward mechanism allows cells to be scheduled in essentially constant time. With scheduling wheel skipping, the cell scheduler can guarantee minimum transmission rates during overload. It also spreads out cells from bursty sources. The BSW algorithm uses a simple binary-counter based selection scheme, and is suitable for hardware implementation.
