Abstract -Quality-of-Service (QoS) guarantees in networks are increasingly based on per-flow queueing and sophisticated scheduling. Most advanced scheduling algorithms rely on a common computational primitive: priority queues. Large priority queues are built using calendar queue or heap data structures. To support advanced scheduling at OC-192 (10 Gbps) rates and above, pipelined management of the priority queue is needed. We present a pipelined heap manager that we have designed as a core integratable into ASIC's, in synthesizable Verilog form. We discuss how to use it in switches and routers, its advantages over calendar queues, and we present cost-performance tradeoffs. Our design can be configured to any heap size. We have verified and synthesized our design and present cost and performance analysis information.
I INTRODUCTION
The speed of networks is increasing at a dramatic pace. Significant advances also occur in network architecture, and in particular in the provision of quality of service (QoS) guarantees. Switches and routers increasingly rely on specialized hardware to provide the desired high throughput and advanced QoS. Such supporting hardware becomes feasible and economical owing to the advances in semiconductor technology. To be able to provide top-level QoS guarantees, network switches and routers increasingly rely on per-$ow queueing and advanced scheduling [14] . The topic of this paper is hardware support for advanced scheduling.
Per-flow queueing refers to the architecture where the packets contending for and awaiting transmission on a given output link are kept in multiple queues, thus providing isolation between flows. A scheduler must then serve these queues in an order that fairly allocates the available throughput to the active flows. Commercial switches and routers currently have multiple queues per output, but their number is limited (a few tens), so their schedulers are relatively simple. When more queues are desired, the hardware architecture has to be t also with the Department of Computer Science, University of Crete, Heraklion. Crete, Greece. adapted accordingly. Managing many thousands of queues at high speed is feasible, today, using modern VLSI technology This paper deals with the next problem: implementing sophisticated scheduling algorithms at high speed, when there are many thousands of contending flows. Section I1 presents an overview of various advanced scheduling algorithms. They all rely on a common computational primitive for their most time-consuming operation: finding the minimum (or maximum) among a large number of values. Previous work on implementing this primitive at high speed is reviewed in section 1I.C. However, for OC-192 (10 Gbps) and higher rates, and for packets as short as about 40 bytes, even higher operation rate is needed. To achieve such higher rates, pipelining must be used. This paper presents a pipelined heap manager that we have designed in the form of a core, integratable into ASIC's. Pipelining the heap operations requires some modifications to the normal (software) heap algorithms, as described in section 111. Section IV presents cost-performance tradeoffs. Section V describes our implementation, which is in synthesizable Verilog form. The ASIC core that we have designed is configurable to any size of priority queue. A new operation can be issued in every clock cycle, except that an insert operation or an idle cycle is needed between two successive delete operations. ~3 1 .
PRIORITY QUEUES FOR ADVANCED SCHEDULING
Many advanced scheduling algorithms have been proposed; good overviews appear in [17] and [12, chapter 91. Priorities is a first, important mechanism; usually a few levels of priority suffice, so this mechanism is easy to implement. Aggregation (hierarchical scheduling) is a second mechanism: first choose among a number of flow aggregates, then choose a flow within the given aggregate [ 11. Some levels of the hierarchy contain few aggregates, while others may contain thousands of flows; this paper concerns the latter levels. The hardest scheduling disciplines are those belonging to the weighted round robin family; we review these, next.
A
With weighted round robin scheduling a scheduler must serve the active flows in an order such that the service received by each active flow in any long enough time interval is in proportion to a weight factor associated with the flow. It is not acceptable to visit the flows in plain round robin order, serving each in proportion to its weight, because service times for heavy-weight flows would become clustered together, leading to burstiness and large service time jitter. So, the scheduler will have to operate by keeping track of a "next service time" number for each active flow. In each step, we must find the minimum of these numbers, and then increment it if the flow remains active, or delete it if the flow becomes inactive. When a new packet of an inactive flow arrives, that flow has to be reinserted into the schedule.
Many scheduling algorithms belong to this family. This includes both work-conserving and non-work-conserving disciplines. Other important constituents of a scheduling algorithm such as the mechanism for updating the service time of a served flow, or that of a newly-active one, account for algorithm variants such as the virtual clock algorithm, and the earliest-due-date and rate-controlled disciplines [ 12, ch.91.
The Weighted Round Robin Family
pushed all the way up and left. The entry in each node is smaller than the entries in its two children (the heapproperty).
Insertions are performed at the leftmost empty entry, and then possibly interchanged with their ancestors to re-establish the heap property. The minimum entry is always at the root; to delete it, move the last filled entry to the root, and possibly interchange it with descendants of it that may be smaller. In the worst case, a heap operation takes a number of interchanges equal to the tree height.
A calendar queue [3] is an array of buckets. Entries are placed in the bucket indicated by a linear hash function. The next minimum entry is found by searching in the current bucket, then searching for the next non-empty bucket. Calendar queues have a good average performance, but in the short-term, some operations may be quite expensive.
C Related Work
For small priority queues (a few tens of entries) or for special cases such as plain round robin or round robin with only a small set of weight factors, simple implementations work effectively [16] [l I]. Priority queues with up to hundreds of entries using specialized hardware were reported in [ 101 [4] ; however, they do not outperform our pipelined heap manager,
B Prioritv Oueue Imulementations
while their cost is higher.
All of the above scheduling algorithms rely on a common computational primitive for their most time-consuming operation: aprioriry queue, i.e. finding the minimum (or maximum) of a given set of numbers. Priority queues with only a few tens of entries or with priority numbers drawn from a small menu of allowable values are easy to implement, e.g. using combinational priority encoder circuits. However, for priority queues with many thousand entries and with values drawn from a large set of allowable numbers, heap or calendar queue data structures must be used. Other heap-like structures [7] are interesting in software but are not adaptable to high speed hardware implementation.
For priority queues with many thousands of entries, calendar queues are a viable alternative. In high-speed switches and routers the delay of resizing the calendar queue -as in [3] -is usually unacceptable, so a large size is chosen from the beginning. However, the large size creates long sequences of empty buckets, thus requiring a mechanism to quickly search for the next non-empty bucket [8] [5]. No specific implementations of calendar queues at the performance range considered in this paper have been reported in the literature.
However it is hard to give to calendar queues a deterministic response time like the one featured by the pipelined heap, while their cost is higher, because rehashing or linked lists are needed to handle collisions. Also, in order to be efficient, they use significantly more memory , which is the dominant cost at large sizes.
Finally, concerning heap management at high speed, we had studied how fast it can be performed using a hardware FSM manager with the heap stored in one or two off-chip SRAM's [15] . In the present paper, we look at higher-speed heap management, using pipelining. As far as we know, nobody else had looked at pipelined heap management before this work, while a parallel and independent study appeared in [ 2 ] . In that paper, Bhagwan and Lin introduce a variant of the conventional heap, which they call P-heap. However the P-heap has two disadvantages relative to our architecture.
First, the issue rate of insert operations cannot exceed that of consecutive deletes, while we achieve twice this speed. Second, the forest-of-heaps optimization (section 1II.B) is not replace the minimum with a new entry that has a higher value (on packet departure, when the flow remains non-idle). When a stage is requested to perform an operation, it performs the operation on the appropriate node at its level, and then it recursively asks the level below to also perform an induced operation. For levels 2 and below, the node index, i, must also be specified. Each stage is thus able to process a new operation as soon as it has completed the previous operation at its own level only.
The replace operation is the easiest to understand. In Fig.   2 , the given argl must replace the root at level 1. Stage 1 reads its two children from L2, to determine which of the three values is the new minimum, to be written into L1; if one of the ex-children was the minimum, the given argl must The delete operation is similar to replace. The argl is now either the rightmost non-empty entry of the bottom-most non-empty level (which is then deleted), or, when multiple operations are in progress in various pipeline stages, it comes from the youngest-in-progress insert (which is then aborted).
The IastEntry bus is now used to provide argl .
The traditional insert algorithm needs to be modified [9] [15]. Instead of inserting the new entry at the bottom, it is inserted at the root, in order for all the operations to proceed top-to-bottom. Recursive repositionings are then performed to the proper of the two sub-heaps. By properly steering -left or right sub-heap-this chain of insertions at each level, we can ensure that the last insertion will be guided to occur at precisely the heap node next to the previously-last entry.
Each operation on a node i, in each stage of Fig. 2 , takes 3 clock cycles: i) read from memory; ii) compare two or three values to find the minimum; iii) write this minimum into the memory of this stage. Using such an execution pattern, operations ripple down the pipeline at the rate of one stage every 3 clocks, allowing an operation initiation rate no higher than 1 every 3 cycles. We can improve on this rate by overlapping the operation of stages. In this way an operation can start working on consecutive levels, before the work to be done on previous levels has completed. We can thus end up with a ripple-down rate of one stage every 
