Abstract-The implementation of packet-fair queuing (PFQ) schedulers, which aim at approximating the generalized processor sharing (GPS) policy, is a central issue for providing multimedia services with various quality-of-service (QoS) requirements in packet-switching networks. In the PFQ scheduler, packets are usually time stamped with a value based on some algorithm and are transmitted with an increasing order of the time-stamp values. One of the most challenging issues is to search for the smallest time-stamp value among hundreds of thousands of sessions. In this paper, we propose a novel RAM-based searching engine (RSE) to speed up the searching process by using the concept of hierarchical searching with a tree data structure. The time for searching the smallest time stamp is independent of the number of sessions in the system and is only bounded by the memory accesses needed. The RSE can be implemented with commercial memory and field programmable gate array (FPGA) chips in a cost-effective manner. With the extension of the RSE, we propose a two-dimensional (2-D) RSE architecture to implement a general shaper-scheduler. Other challenging issues, such as time-stamp overflow and aging, are also addressed in the paper.
I. INTRODUCTION
T HE implementation of packet-fair queuing (PFQ) schedulers, which aim at approximating the generalized processor sharing (GPS) policy [5] , [8] , is a central issue for providing multimedia services with various quality-of-service (QoS) requirements in ATM switches and next-generation IP routers [1] , [2] . The objective is to design an efficient and scalable architecture that can support hundreds of thousands of sessions (virtual channels in ATM or flows in IP) in a cost-effective manner.
The GPS is an ideal weighted fair queuing service policy based on a fluid-flow model. It can provide network delay bound for leaky bucket constrained traffic and has been intensively studied [3] - [6] , [9] . However, because the fluid GPS is not practical, a class of PFQ algorithms has been proposed to emulate the fluid GPS to achieve the desired performance [7] , [11] - [19] , [22] . All of them are based on maintaining a global function, referred to as either system virtual time or system potential, which tracks the progress of the GPS. This global function is used to compute a virtual Manuscript received July 14, 1998 ; revised February 15, 1999 . H. J. Chao, X. Guo, and C. H. Lam are with the Department of Electrical Engineering, Polytechnic University, Brooklyn, NY 11201 USA (e-mail: chao@antioch.poly.edu; xlguo@stimpson.poly.edu; chlam@kings.poly.edu).
Y.-R. Jenq is with Fujitsu Network Communications, Inc., Pearl River, NY 10965 USA (e-mail: yjenq@tddny.fujitsu.com).
Publisher Item Identifier S 0733-8716(99)04489-3. finish time (or time stamp) for each packet or the head-of-line (HOL) packet of each session in the system. The time stamp of a packet is the sum of its virtual start time and the time needed to transmit this packet at its reserved bandwidth. Packets are served by an increasing order of their time stamps.
The implementation cost of a PFQ algorithm is determined by two components: 1) computing the system virtual time function and 2) maintaining the relative ordering of the packets via their time stamps in a priority queue mechanism. Fig. 1 illustrates a packet scheduler that, for instance, can be located at the output of a switch/router. The CPU is responsible for computing time stamps and other system controls. The packet search engine is responsible for selecting the next packet for transmission according to the time-stamp values.
An efficient hardware-based priority queuing architecture [21] , [24] - [28] , where packet transmissions are arranged based on their time-stamp values, is required for high-speed networks. A binary tree of comparators [26] - [28] is the most straightforward way to implement the priority queue with levels, where is the number of sessions in the system. But its time complexity is high for large , and it is expensive to implement. An application-specific integrated circuit (ASIC) called a sequencer chip [35] was used to facilitate the priority queue with a time complexity of , independent of the number of sessions in the system. However, each sequencer chip can only handle up to 256 sessions. For a practical application where there are hundreds of thousands of sessions, the number of required sequencer chips would be too large to be cost-effective.
A searching-based approach, rather than a sorting-based one, has been presented in [29] , where a number of timing queues are maintained for distinct time-stamp values, resulting in a calendar queue. The HOL packets from different sessions that have the same time-stamp value are linked together to form a timing queue. The priority queue selects a packet with 0733-8716/99$10.00 © 1999 IEEE the smallest time stamp to send out. This can create a system bottleneck when the number of distinct time-stamp values is large. A new ASIC called a priority content addressable memory (PCAM) chip [37] can search for the minimum timestamp value at a very high speed. However, due to a sizable on-chip memory requirement, the PCAM is still too expensive to implement. This motivates us to use off-chip memory and implement hierarchical searching to further reduce the hardware cost. We generalized the two-level searching in the PCAM chip to hierarchical searching with a tree data structure [25] [10, p. 189] and proposed a novel RAM-based searching engine (RSE) [38] for efficient implementation, as compared to the brute-force approach using a tree of priority encoders/decoders [15, Sec. V] . The time to search the smallest time-stamp value in the RSE is independent of the number of sessions in the system and is only bounded by the memory accesses needed. It can be implemented with commercial memory and field programmable gate array (FPGA) chips.
Recently, the worst-case fairness index (WFI) [17] has been introduced to measure how closely a packet-by-packet scheduler approximates the GPS system in a hierarchical scheduling environment. Another class of scheduling algorithms called shaper-schedulers [13] , [15] , [17] , [18] , [22] has been proposed to achieve minimum WFI and have better worst-case fairness properties. With these algorithms, when the server is picking the next packet to transmit, it chooses, among all the eligible packets, the one with the smallest time stamp. A packet is eligible if its virtual start time is no greater than the currentsystem virtual time. This is called the eligibility test or smallest eligible virtual finish time first (SEFF) policy [13] , [17] .
However, these shaper-schedulers are difficult to implement mainly because of the eligibility test. In particular, whenever the server selects the next packet for service, it needs first to move all the eligible packets from the priority queue based on eligibility times (called the shaper queue), to the priority queue based on time stamps (called the scheduler queue). In the worst case, a maximum of packets must be moved from the shaper queue to the scheduler queue before selecting the next session for service. In [13] , by taking advantage of the fixedlength ATM cells, a simple shaper-scheduler implementation architecture was proposed for ATM switches. Since the system virtual time function can be increased by one after sending a cell, only two eligible cells must be moved in the worst case. However, this is generally not true in packet networks due to the variable packet size.
In the second part of this paper, we propose a general shaper-scheduler for both ATM switches and IP routers. A slotted mechanism is introduced to update the system virtual time. With extension of the RSE, we propose a two-dimensional (2-D) RSE architecture [38] to facilitate the operations. We show that only two eligible packets at most need to be transferred to the scheduler queue in each time slot. The inaccuracy caused by the slotted scheme can be reduced by choosing a smaller update interval. The problems of time-stamp overflow and aging due to finite bits are also addressed. Note that throughout this paper, we use packet schedulers as a general term to refer to both stand-alone schedulers and shaper-schedulers.
The rest of this paper is organized as follows. Section II highlights the conceptual framework and design issues of a packet scheduler. Section III presents the RSE. Section IV describes our general shaper-scheduler and the 2-D RSE, where the problem of time-stamp overflow is also addressed. Section V discusses the time-stamp aging problem. Section VI presents conclusions.
II. CONCEPTUAL FRAMEWORK AND DESIGN ISSUES

A. Background
A GPS server has queues, each for one session (flow). Each queue has a minimum bandwidth allocation, denoted by , where . During any time interval when there are exactly nonempty queues, the server serves the HOL packets for these queues simultaneously, and the service share for each of them is proportional to their minimum bandwidth allocations.
To approximate the GPS, a PFQ algorithm maintains a system virtual time function , a virtual start time , and a virtual finish time (or time stamp) for each queue . and are updated on arrival of the HOL packet for each queue. A packet departure occurs when its last bit is sent out, while an HOL packet arrival occurs in either of two cases: 1) a previously empty queue has an incoming packet that immediately becomes the HOL or 2) the packet next to the previous HOL packet in a nonempty queue immediately becomes the HOL after its predecessor departs. Obviously, a packet departure and a packet arrival in Case 2 could happen at the same time. Therefore for packet arrival in Case 1 for packet arrival in Case 2
(1)
where is the finish time of queue before the update, and is the length of the HOL packet for queue . The way of determining is the major distinction among proposed PFQ algorithms [10] , [20] . The role of is to reset the value of when queue becomes active (arrival in Case 1) to account for the service it missed [10] , [20] . Therefore, the start time of each backlogged queue can stay close to each other, as required for a PFQ algorithm.
WFQ or packet GPS (PGPS) is probably the first PFQ algorithm [5] in which the state of the GPS is tracked precisely. Although (in terms of the number of bits served for a session), the WFQ has been proven to not fall behind the GPS by one maximum size packet, it can be far ahead of the GPS. In other words, it is not worst-case fair, as indicated with a large WFI [17] . Motivated by this, an eligibility test was introduced in the WF Q [17] (also SPFQ [10] ) in that when the next packet is chosen for service, it is selected from those "eligible" packets whose start times are not greater than the system virtual time. It has been proven that the WF Q can provide almost identical service to that of the GPS, differing by no more than one maximum size packet. However, a serious limitation to the WF Q (and WFQ) is its computational complexity, which arises from the simulation of the GPS. A maximum of events may be triggered in the simulation during the transmission of a packet. Thus, the time for completing a scheduling decision is . WF Q [18] and SPFQ [10] have been shown to have worst-case fairness properties as WF Q but are simpler to implement by introducing the following system virtual time function (3) where is the total amount of service provided by the server or the bits that have been transmitted during this time interval , and is the set of queues that are backlogged in the system at time . In the special case of a fixed-rate server, , where is the link capacity. The time complexity is reduced to , attributed to the operations of searching for the minimum starttime value among all sessions. Next we discuss the issues of implementing a PFQ scheduler in high-speed packet networks.
B. Conceptual Framework
Generally, a PFQ scheduler maintains a logical queue for each flow or session in the data memory, as shown in Fig. 2(a) . Each queue can be implemented in a linked list with head and tail pointers pointing to its HOL and tail-of-line (TOL) packets. An idle queue may also be needed to maintain the idle space in the data memory. There is a head-pointer memory, which stores each queue's head pointer, and a tail-pointer memory, which stores each queue's tail pointer, as shown in Fig. 2(a) .
When a packet arrives at the system, it is first stored in the corresponding queue, as shown in Fig. 2(b) . The scheduler queue prioritizes all HOL packets, or all eligible HOL packets if a shaper-scheduler is implemented, based on their finish times, as shown in Fig. 2(b) . It then chooses the packet with the smallest finish time to transmit first. This requires fast sorting or searching operations and is one of the challenges in designing a packet scheduler, referred to as design issue I (see Section II-B1).
If an incoming packet is an HOL packet, it (or its session index, actually) is placed in the scheduler queue, or the shaper queue first for the shaper-scheduler case, as shown in Fig. 2(b) . In the latter case, if this packet is eligible, it is moved to the scheduler queue. In general, all the HOL packets are first stored in the shaper queue. Only those that are currently eligible can be moved to the scheduler queue. Some efficient mechanism is needed to compare the system virtual time with the start times of the packets in the shaper queue (i.e., performing the eligibility test) and then move eligible packets to the scheduler queue. In the worst case, there may be a maximum of packets that become eligible. This is another challenge, referred to as design issue II (see Section II-B2).
Suppose the scheduler queue selects the HOL packet of queue ; it uses to fetch the head pointer associated with queue and then reads out the packet using the head pointer, as illustrated in Fig. 2(a) . The pointers, if necessary, are updated after being used. There are more design issues, such as handling time-stamp overflow and time-stamp aging problems. The former is discussed in Sections III-B and IV-B3, while the latter is detailed in Section V. Next we only focus on design issues I and II.
1) Design Issue I: Instead of using the sorting approach to find the smallest time stamp, where the time complexity can be for binary sorting or for parallel sorting [35] , we use the search-based approach to reduce implementation complexity. In the search-based approach, time stamps are quantized into integers and are used as the address for the priority queue. Each memory entity may contain a validitybit ( -bit) and two pointers pointing to the head and tail of an associated linked list called the timing queue, as shown in Fig. 3(a) . The data structure is called a calendar queue [29] , [30] . The -bit indicates whether or not the timing queue is empty. For instance, we use "1" to indicate the nonempty status and "0" otherwise. The timing queue links the indexes, such as , of each session, where the time stamps of the HOL packets are the same. Therefore, all the HOL packets are presorted when their corresponding session indexes are stored in the calendar queue. Finding the next packet with the minimum time stamp is equivalent to finding the nonempty timing queue with the smallest address (see following).
In the search-based approach, the time complexity of sorting time stamps is traded with space complexity, which is determined by the maximum value of the time stamp, say for instance. The value of is decided by the minimum bandwidth allocation that can be supported in the system. Brute-force linear searching has a time complexity of . It is attributed to reading each entry from address zero to and checking whether it is nonempty. memory accesses are needed to find out the first (only) nonempty entry. Obviously, a tree data structure can be used to reduce the complexity to , where is the group size [10, p. 189]. A tree of priority encoders and decoders can be used to implement this data structure [15, Sec. V]. However, the cost would be prohibitively large. The PCAM chip [37] can search for the minimum time-stamp value at very high speed. However, due to a sizable on-chip memory requirement, the PCAM is still too expensive to implement. Motivated by the need for efficient hardware implementation, we propose an RSE [38] (see Section III).
The RSE reorganizes and stores all the -bits in the calendar queue. As shown in Fig. 3 , its main function is to find a nonempty timing queue that has the smallest finish time (address) and to output its address. Suppose is the output of the RSE (read operations). It is used to fetch the queue index, say , as shown in Fig. 3(b) , which in turn locates the pointer pointing to queue 's HOL packet, as shown in Fig. 2(a) . Suppose is the virtual finish time of a new HOL packet. This packet's session index will be stored at the tail of the timing queue addressed by . In the meantime, the corresponding -bit may be set to 1 (write operations) if the timing queue was empty previously. The selector is used to choose the appropriate memory address during write/read operations. A similar calendar queue based on start times of the HOL packets can be used to find the minimum value of start times (see Section IV-B1), as required for updating in (3) .
2) Design Issue II: A shaper-scheduler, such as the WF Q and SPFQ, also needs to maintain another priority queue called shaper queue for performing the eligibility test, as shown in Fig. 2(b) . A shaper-scheduler was proposed in [13] for ATM switches, in which the shaper queue is implemented as a multitude of priority lists. Refer to . Each list is associated with a distinct value of start time common to all queued packets in this list. Using the search-based approach, we can construct a 2-D calendar queue based on the start times of the queued packets, as shown in Fig. 4 , where the start time and time stamp are used as the column and row addresses, respectively, and is the maximum value of the start times. All packets with the same start time are placed in the same column addressed by and also are sorted according to their time stamps . Hence, each column represents a priority list. Each -bit in a column can be located by its unique address . Performing the eligibility test is equivalent to using the system virtual time as the column address to find the nonempty column(s), with their addresses ranging from the previous value until the current value of the system virtual time, and then moving packets in these column(s) to the scheduler queue.
According to (3), WF Q and SPFQ advance the system virtual time by the amount of work the server performs. It has been shown in [13] that for ATM switches, the system virtual time is advanced by one after the transmission of each cell because a cell is a constant unit of work. Only two cells at most need to be transferred to the scheduler queue. In packet networks, the system virtual time may be advanced by more than one due to the variable size of packets. Refer to Fig. 2(b) . In the worst case, a maximum of eligible packets need to be moved from the shaper queue to the scheduler queue. No solutions on hardware implementation have been observed so far. In Section IV, we propose a general shaper-scheduler for both ATM and packet networks, in which we show that only two packets at most need to be transferred to the scheduler queue.
Furthermore, it still remains unclear whether implementing a large number of priority lists as increases would be feasible. Simply extending the concept of the RSE allows us to construct a 2-D RSE [38] to implement the 2-D calendar queue, as described in Section IV-B2.
III. RSE
A. Hierarchical Searching
In a calendar queue, a validity-bit -bit) is associated with each timing queue, indicating whether this queue is empty or not . Since packets are automatically sorted based on their corresponding locations in the calendar queue, finding the next packet to be transmitted is equivalent to searching the first -bit in the calendar queue. The key concept of hierarchical searching [38] is extended and generalized from the one in the PCAM chip by dividing the total validity-bits in the basic searching into multiple groups, which forms a tree data structure, as shown in Fig. 5 , where is the maximum value of time stamp . Each group consists of a number of -bits, so another bit string can be constructed at the upper level with its length equal to the number of groups at the bottom level. Each bit at the upper level represents a group at the bottom level with its value equal to the logical OR of all the bits in the group. Further grouping can be performed recursively until the new string can be placed in a register.
Suppose levels are formed from the original -bit string. There are bits at level , and each of its groups has bits, where . Another -bit string can be constructed at upper level . So for
Let us denote the -bit string at level as , where , , and the -bit string of the th group as . We have , where and " " represents the logical OR operator. Fig. 5 illustrates such a data structure with and . The string at level can be stored in a RAM with the size of , while the string at the top level is stored in an -bit register. Denote the -bit address as , where we assume . The address used to locate any one of the bits at level is , where . Hence, it follows from (4) that (5) Equation (5) illustrates the principle of addressing in the hierarchical searching. That is, most-significant bits (MSB's) of should be used at level 0. Then at level , the complete address used at upper level (which is -bit wide) will be used to locate the proper -bit word in its memory. Another MSB's following the previous MSB's is extracted from and is used to locate the proper bit in the -bit word that has just been identified. Priority encoder and decoder are two basic modules in the RSE. Each level requires a priority decoder with -bit input and -bit output, which can be used to write/reset any bit of a -bit word stored in its RAM. For that we can simply OR the -bit outputs from both the decoder and the RAM (write operation) or AND the inverted output of the decoder with the RAM output (reset operation) and then write it back to the memory. Each level also requires a priority encoder with -bit input and -bit output, which can be used to search the first MSB (equal to one) of any -bit word stored in its RAM and provide bits of the -bit time stamp for the first -bit in the original string (search operation). Since the search works top-down, according to (5), the -bit time stamp should be the concatenated result of outputs from all encoders, as illustrated in Fig. 6 .
The searches at all levels need to be carried out sequentially based on (5). The time to search for the first -bit in the original string is decided by finding the first -bit at each level. It needs one register reading at level zero and memory accesses at other levels, which is independent of the number of flows in the system (PCAM is a special case of the RSE with ), as is the time to update (i.e., write/reset) each of the -bits. The total memory requirement is or bits, according to (4) . As an example, if K, , the required memory is 32 32 bits at level 1 and 32 32 or 1 K 32 bits at level 2 (total of 33 792 bits).
B. Time-Stamp Overflow
According to (2) , the maximum value of (i.e., ) is decided by the maximum packet length over the minimum allocated bandwidth supported in a real system. increases monotonically with time, and on some occasions the calculated time stamp may overflow, i.e., exceeds its maximum value due to finite bits of (recall that we assumed ). To overcome this problem, we can use two memory banks in the RSE to store the -bits of the nonoverflow and overflow time stamps, respectively, as shown in Fig. 7 , and use bits to record . A separate bit called a zone indication bit is used to indicate the zone where the new arrival's time stamp is to be stored, which is actually the MSB of the time-stamp value, and used to indicate overflow. The definition of overflow here is different from the traditional one. We use a current zone bit (CZ) to indicate the zone of the packets that are currently being served. Whenever the MSB of a calculated (i.e., its overflow bit) has the same value as the CZ, the is defined as nonoverflow; otherwise, it is defined as overflow. Thus, when searching for the -bits in the RSE, the CZ facilitates the RSE to choose the first -bit from an appropriate zone. When all the -bits in the current zone are zero and there is at least one nonzero -bit in the other zone, the CZ will be toggled after sending a packet from the other zone, indicating the service zone is flipped. The time stamp is nondecreasing within each zone; the system virtual time with recalibration is at least equal to the minimum start time of the HOL packet among all currently backlogged sessions, and thus is also nondecreasing. So, new HOL packets with their time stamps derived from (2) will be placed either in the current zone or in the other zone that is now regarded as the overflow zone, due to time-stamp overflow, as indicated by CZ, ensuring the correct sequence of packet transmission. As shown in Fig. 7 , when all packets "a," "b," "c," and "d" are transmitted and the -bit at the other zone is found at , the CZ bit is toggled from zero to one. From then on, packets in Zone 1 will be scheduled before those in Zone 0. The searching in the RSE chip alternates between the two zones with CZ and . As long as the MSB of the calculated does not change more than once when serving in the current zone no packet out-of-sequence problem will occur. Fig. 8 shows a block diagram of the RSE that can handle 32 K (2 ) time-stamp values. The RSE consists of a controller and a RAM that is divided into two banks. The input data (15 bits) are written to the RSE with the WRITE signal through the bus IN [14:0] to set the validity-bit. The SEARCH signal is asserted when searching for a -bit in the RSE. If the -bit that is closest to the top of the list is found, the HIT signal is asserted, and its corresponding time-stamp value appears at the output bus, OUT [14:0] . The MODE signal is used to determine whether the RSE is configured to one 32 K-bit zone or two 16 K-bit zones (to deal with the time-stamp overflow problem). The CZ signal is used to indicate the zone of the -bits that is currently being searched. At the initialization, the INIT signal is asserted and all of the data (i.e., -bits) in the REG and RAM are set to zero. These I/O signals of the RSE are summarized in Table I . To support a larger time-stamp value, we can increase the group size, which in turn increases the width of the register and the size of the memory.
C. Design of the RSE
D. RSE Operations
Consider that the RSE in Fig. 8 is configured to accommodate up to K different time-stamp values. The group size is set to 32 bits. Thus, the number of total searching levels is or 3. The -bits of level 0 are stored in REG I, while those of levels 1 and 2 are stored in two banks of the RAM, Bank I (32 32) and Bank II (1 K 32), respectively.
1) Write-in Operation:
When a new packet arrives, its 15-bit time stamp is divided into three parts, 5 bits each. The write-in operation sets the validity-bits properly in the register and memory banks. This operation consists of two phases (see Fig. 9 ).
Phase 1) This operation sets the validity-bits at levels 0 and 1, which can be done at the same time because the validity-bits of levels 0 and 1 are stored at different places. In Fig. 9(a The RSE write operation can be reduced to one phase by adding one extra OR gate and one extra decoder. Phase 2 can now be performed in parallel with Phase 1 since the 5 bits from IN [4:0] can be decoded using the extra decoder. The total memory accesses can be reduced to two at the cost of one more OR gate and one more decoder.
2) Reset Operation: Whenever a session queue becomes empty, its validity-bit in the RSE should be reset to zero. The reset operation is similar to the write-in operation, except that the reset operation starts from the bottom level and asserts the RESET signal to HIGH. This operation consists of three phases (see Fig. 10 ).
Phase 1) This operation resets the validity-bit at level 2, as shown by the dashed line in Fig. 10(a) . The 10 bits of IN [14:5] are extracted and used as the address to read out the validity-bit information (32 bits) from RAM Bank II. Meanwhile, 5 bits from IN [4:0] are decoded to a 32-bit word, which is then inverted and AND'ed with the 32-bit output from RAM Bank II. The newly updated validity-bit information is written back to RAM Bank II at the same location. In addition, the new 32 validity-bits are OR'ed. If the result is zero, meaning that all validity-bits in this group are all zero, we proceed to Phase 2. Two memory accesses are needed in this phase. Phase 2) This operation resets the validity-bit at level 1 of the corresponding group at level 2, as shown by the dashed line in Fig. 10(b) . The 5 bits of IN [14:10] are used as the address to read the old validity-bit information from RAM Bank I. At the same time, the 5 bits of IN [9:5] are decoded and then inverted. These two results are AND'ed to obtain the new validity-bit information that is written back to RAM Bank I at the same location. In addition, the new validity-bits are OR'ed. If the result is zero, meaning that all validitybits in this group are all zero, we proceed to Phase 3. Two memory accesses are needed in this phase. Phase 3) This operation resets the validity-bit at level 0 of the corresponding group at level 1. In Fig. 10(c) , the heavier line indicates the data path for resetting the validity-bit at level 0. The first 5 bits of the time stamp (IN [14:10] ) are extracted, decoded, inverted, and AND'ed with the old value in REG I. The result is written back to REG I. 
3) Search Operation:
The search operation consists of three phases (see Fig. 11 ).
Phase 1) If there is at least one bit in REG I that is set to one, the HIT signal is asserted indicating a match is found, and the OUT [14:0] signal is valid. The 32-bit data from REG I is then encoded by the priority encoder into a 5-bit output (OUT [14:10] ), which is then written to the upper part of a register. The data path is shown by a solid line in Fig. 11 . Phase 2) Following Phase 1, the output of OUT [14:10] is sent to RAM Bank I as the address to read out the corresponding validity-bit information (32 bits). The 32-bit word accessed from RAM Bank I is then encoded by the priority encoder. Its 5-bit output (OUT [9:5] ) is written to the middle part of the register. Its data path is shown by a dashed line in Fig. 11 . One memory access is needed in this phase.
Phase 3) The outputs from Phases 1 and 2 are combined to form OUT [14:5] , which is used as the address to read out the validity-bit information from RAM Bank II. The 32-bit word accessed from RAM Bank II is then encoded by the priority encoder. Its 5-bit output (OUT [4:0] ) is written to the lower part of the register. The data path is shown by a bold solid line in Fig. 11 . One memory access is needed in this phase. The register content, OUT [14:0] , is the final result, indicating that the location of the first validity-bit is found.
IV. GENERAL SHAPER-SCHEDULER
A. Slotted Updates on System Virtual Time
According to (3), the system virtual time in shaperschedulers, such as WF Q and SPFQ, is advanced by the amount of work the server performs during the time interval . Conceptually, the length of each packet serves as a measure of work in a packet system, so could be advanced quite differently from time to time due to variable packet length. However, for efficient hardware handling, packets are usually divided into fixed-length segments in IP routers and packet switches before they are stored in the buffer and forwarded to the next node(s). As a result, a packet scheduler can be viewed as a slotted (synchronous) system. A time slot corresponds to the time (or number of system clock cycles) that are needed to transmit a segment at the link capacity . Assume a fixed-rate server always having data to transmit and in (3), the work that the packet server performs in a time slot is a constant , which we can normalize to one, similar to that for ATM switches [13] .
Without introducing much inaccuracy (explained next), we assume that packet arrival and departure events occur at discrete time instants. Imagine that a packet is transmitted segment-by-segment (bit-by-bit, actually) on the link. We can define segment departure events accordingly. This is critical to understand the following mechanism. We may update in every time slot, according to (3) . The size of time stamps determines both the range of supportable rates and the accuracy with which those rates may be specified. The slotted mechanism allows time stamps to be represented as integers, instead of the floating-point numbers required in the general implementation. Bandwidth reservation of a session can be expressed in units of segment per slot.
If the updated value of is (modulo ), all packets in the column of the shaper queue addressed by , if any, become eligible, as shown in Fig. 4 . We say this column is eligible for brevity of description. Here, by using the slotted mechanism we can achieve the same goal as that in [13] . That is, only the following two packets at most need to be transferred to the scheduler queue in each time slot: 1) the packet with the smallest time stamp in column ; 2) the packet with the smallest time stamp in column , if any, if the packet being transmitted or just sent out has a start time value of (modulo ). Both packets are referred to as the first packet in the corresponding column. Based on the previous operations, the first packet of each nonempty and eligible column is moved to the scheduler queue, prioritized by its finish time. Since each eligible column has at least one packet in the scheduler queue based on the previous operations, the remaining packet(s) in the same column, if any, will all be moved to the scheduler queue eventually.
According to (3), may be advanced by more than one, in case several columns in the shaper queue are empty. These empty columns fall between the column pointed to by the previous system virtual time and that pointed to by the minimum value of start times among all currently nonempty sessions, which is the current system virtual time based on (3). This can help to immediately find the next nonempty and eligible column so that the work-conserving property of the packet scheduler is preserved [10] . Still, only two packets at most need to be transferred to the scheduler queue. Since this mechanism can be used in both ATM and packet switches, we call it the general shaper-scheduler. Fig. 12 illustrates the basic operations of our slotted mechanism when packets enter and leave the scheduler queue. The length of a time slot is regarded as one. From time to , there are four arriving HOL packets, B, C, X, Y, and one departing packet, A. Since the server is busy all the time, the system virtual time is updated at each of these time instants (step 1), as indicated by the circled "1" in Fig. 12 . Since packet B arrives at time , its virtual start time and virtual finish time are also computed (steps 2 and 3), as indicated by the circled "2" and "3" in Fig. 12 , respectively. Although packet C arrives in the time interval , the server assumes it receives this packet at time , as illustrated in Fig. 12 . This also applies for packet X. At time , no HOL packet arrives, but a segment departs, and only is updated. Packet A leaves the system at time ; packet Y could be the successor to packet A in the same column or a packet from another eligible column with the smallest finish time, so it is loaded (as an arrival) to the schedule queue.
The slotted mechanism basically is an approximation. It can simplify hardware implementation. However, it also introduces some inaccuracy by assuming packet departures/arrivals at discrete times. Since not every packet can be divided into an exact number of segments due to variable packet lengths, the last segment of a packet can have fewer bits of data than a complete segment. We call it an incomplete segment. The server is not work conserving since some bit times within a time slot for transmitting an incomplete segment are wasted, regardless of other HOL packets in the system. As a result, a physical link could be underutilized. It also happens when an HOL packet arrives in a time slot but has to wait for the end of this slot before it is processed. However, the maximum time that the server is idle while the system is not empty is bounded by one slot time. For instance, suppose the system is empty at time , there is no packet B, and packet C arrives in , as shown in Fig. 12 . The server will not handle packet C until time . Its idle time is at most one slot time. sically consists of a shaper queue and a scheduler queue, as shown in Fig. 13(a) . The shaper queue uses a 2-D RSE (see Section IV-B2) to find a valid -bit, i.e., a nonempty timing queue with the smallest time stamp in an eligible column and sends out this time stamp (such as ) and the first session index (such as ) in the corresponding timing queue to the scheduler queue. The scheduler queue uses an RSE to find the smallest finish time among all the backlogged sessions in itself and sends out the first session index (such as ) in the corresponding timing queue. This index will be used for getting the address of the corresponding packet.
The implementation architecture also includes a CPU, another priority queue based on start times, called start time queue, as shown in Fig. 13(b) , and a regular queue, called finish-time queue, that stores finish times of HOL packets of every session. The start time queue uses another RSE to find the smallest start time among all backlogged sessions in the system, i.e., for updating according to (3) . This value, denoted by in Fig. 13(b) , is provided as the output of the RSE. The finish time queue uses session index as the address for direct access to the finish time of each HOL packet. Since there are a total of sessions in the system, the queue has entries. The stored time stamp (such as ) is used by the CPU to compute the virtual start time of a new HOL packet (such as for session ) according to (1) . The basic operations can be briefly described as follows. At every time slot the start time queue provides the minimum start time to the CPU, and the CPU keeps updating the system virtual time, denoted by ; the shaper queue performs the eligibility test and sends those eligible packet(s), if any, and their time stamps (such as and ) to the scheduler queue. When a new HOL packet (such as ) comes, its predecessor's time stamp (such as ) is first fetched from the finish-time queue and sent to the CPU, and then the CPU computes the virtual start time and virtual finish time for this packet. Afterward this packet (its session index , actually) is placed in the shaper queue. If it is eligible, this packet may be moved to the scheduler queue immediately.
On the other hand, when an HOL packet is chosen to be transmitted (its session index, say , is chosen), its time stamp (such as , while denotes its start time) is stored in the finish time queue. When it leaves the system, the scheduler queue removes its session index and selects another HOL packet, if any, to serve. In the meantime, the queue is checked. If this queue is empty, nothing needs to be done on this queue. Otherwise, this session can have another packet (regarded as a new HOL packet arrival) to join the shaper queue (or the scheduler queue), as mentioned earlier.
For example, in the shaper queue and are used as a column address to the 2-D RSE for the eligibility test, while and are used as row and column addresses to the 2-D RSE for setting the corresponding -bit ( is also used as the address to store the session index into the corresponding timing queue), as illustrated in Fig. 13(a) . Next, we discuss the 2-D RSE and its time-stamp overflow control.
2) The 2-D RSE: Using the hierarchical searching concept, we can construct a simple architecture, the 2-D RSE [38] , to accommodate the calendar queue, as shown in Fig. 4. There are groups at level 0. For simplicity, we assume each of these groups has bits. Each group at level 0, equivalent to the register at level 0 of the RSE, is associated with a column in the calendar queue shown in Fig. 4 and represents all -bits in the column. The RSE grouping operation is then applied recursively to the -bit string of each column and the total level . The total bits at level 0 can be placed in a memory, as shown in Fig. 14 . Except for this extra memory, the 2-D RSE architecture is similar to that of the RSE.
Fig. 14 also shows how to find out the first -bit in a column of the calendar queue given its . Since the system virtual time maintained as an integer in the integrated shaper-scheduler is advanced by one after the transmission of each packet, only two packets at most are needed to be transferred to the scheduler afterward [13] : the packet at the head of a column (I), with its associated start time equal to the updated system virtual time, and the packet at the head of the column (II), from which the transmitted packet departed. The 2-D RSE will be searched twice at most in each time slot. In each searching cycle, only -bits in columns I and II will be selected to participate in the search. Since packets in the same column are arranged in the monotonic order of their time stamps, the 2-D RSE can easily find any of the two eligible packets by identifying the first -bit in the proper column.
Therefore, the time to search the first -bit in a column given its is decided by finding the first -bit at each level and needs memory accesses, which are independent of the number of sessions in the system, as is the time to update each of the -bits. As an example, if K and , then .
3) Time-Stamp Overflow:
Both the time stamp and the start time in the 2-D RSE can have an overflow problem. To solve it, we can follow the method described in Section III-B. Since according to (2) , where we neglect the super-and subscripts without confusion, and corresponds to , the range of is bounded by the maximum of that corresponds to maximum packet length over minimum allocated bandwidth supported in a real system. 1 We can choose such that the maximum of that is rounded to an integer in implementation is equal to . So ranges from zero to . should also be at least one because zero packet length or an infinite reserved bandwidth that makes zero is impossible, conceptually and practically.
Next, we discuss two methods to control the time-stamp overflow. In theory, should always be greater than due to with ; in implementation, each of them is represented with finite bits, so even is not overflow, and could be overflow after adding to . As a result, whether is overflow or not should be defined with respect to . Henceforth, unless otherwise stated, we always discuss , and within a real system. Consider . Two banks of memory are needed to solve the overflow of , as shown in Fig. 15 ; within each of which there are two zones for : overflow and nonoverflow zones. With respect to is defined as being nonoverflow if or being overflow if . Since will never be equal to . The bits with their row/column addresses are dummy in the calendar queue (see Fig. 4 ) and are never used, as illustrated by shaded blocks in Fig. 15 . They simply serve as a zone boundary dividing each bank of memory into two zones for . Thus, there are a total of four zones , and each zone has a triangle form. It follows that to provide equally sized zones in each bank of memory, as shown in Fig. 15 , with each zone containing -bits. A 2-bit zone indication and a 2-bit current zone indication (CZ) are needed. Recall that we assume . We use an -bit word to represent as well as with the MSB defined as its overflow bit. The leftmost bit of or CZ corresponds to the 's overflow bit, while the rightmost bit corresponds to the 's overflow bit. Suppose CZ initially, indicating the current zone to be serviced, as shown in Table II . Whenever the 's MSB is equal to one, we define it to be overflow, indicating zones 10 and 11 ; otherwise, we define it to be nonoverflow, indicating zones 00 and 01 . Whenever the 's MSB is different from the 's MSB, we define it to be overflow (with respect to ), indicating zones 01 and 10 ; otherwise, we define it to be nonoverflow, indicating zones 00 and 11 . The four zones are served in order of . The current zone indication CZ will be flipped according to the arrowed sequence in Table II only after sending a packet from another zone, indicating the service zone is changed. As long as the overflow bit of does not change more than once when the current zone is being served, as described in Section III-B, no packet out-of-sequence problem will occur. This is also true for since . All -bits in zones 00 and 01 satisfy the conditions that and , respectively. This is also the case with zones 11 and 10 . Fig. 15 illustrates that in each bank of memory, we need to mask off the -bits of each column that do not belong to the current zone to correctly perform the memory Read/Write/Reset operations. Consider CZ . When we search the column of top-down for a -bit with the smallest , only those bits with are possible candidates. The others with should be masked off. If CZ , then those bits with should be masked off. This is also the case for zones 11 and 10 . We can XOR the two bits of CZ and use the result to decide the proper masking operation when we search a column of generally. A result of zero (zones 00 and 11 ) indicates that the -bits in the column with will be masked off; otherwise (zones 01 and 10 ) the -bits with will be masked off. To perform the masking operations, we need to find the boundary between the region that is to be masked and the region not to be masked, using dividers or a table where the precomputed results are stored. Extra priority decoders and gates are also needed.
A more desirable alternative to the previous approach is to use instead of , as shown in Fig. 16 . There are still zones 0 and 1 for . The row is dummy, as explained earlier. The attractive point of this method is that we can only consider the overflow of since has no overflow problem. Besides, no masking operations are needed because eachbit in a column is ranked unambiguously by . Regardless of that, each bank of memory can also be divided into two areas for : overflow with and nonoverflow with . One bit is needed for both and CZ. The tradeoff is that one extra adder or adding operation is needed to recover the value of based on (2) . Besides, the scheduler should take care of those eligible packet(s) that are read out and have their 's overflow. In other words, the handling of 's overflow is actually moved from the 2-D RSE in the shaper queue to the RSE in the scheduler queue, as shown in Fig. 13(a) .
V. TIME-STAMP AGING PROBLEM
Recall that, as shown in Fig. 13(b) , when a packet of session departs, its finish time is stored in a lookup table (i.e., finish time queue) for later use according to (1) . Information other than for session , such as and , can also be stored in the same location (addressed by ). Later, when a new packet of this session arrives at the head of its queue and immediately becomes the HOL packet, the needs to be read out and compared with the current system virtual time according to (1) to decide the new . However, since the system virtual time is also represented with finite bits in implementation, it can overflow in the same way as the time stamp, as explained earlier. It is impossible to decide which is greater without any previous history or certain constraints, especially when the queue has been empty for some time. We call this the aging problem of the time stamp.
The becomes obsolete whenever the system virtual time exceeds it. Recall that in Sections III-B and IV-B3, an extra bit is used to indicate two different zones of the time stamp for its overflow control. By the same token, we can introduce more than one bit to record a number of overflow events of the system virtual time, as well as the time zone, to where the system virtual time and the stored belong. Besides, we also need a purging mechanism [39] that should run fast enough to check each entry and purge all the obsolete ones before we lose track of the history of the system virtual time due to finite bits. The system virtual time is updated per time slot based on (3), causing a number of entries to be checked. In a time slot, the could be increased by a maximum of according to (3) , which is the maximum value of the start time as shown in Fig. 4 . As an example, the could change from zero to for some backlogged session , while all the other sessions are empty, generating a maximum of entries that would need to be checked. Each purging operation has at most two memory accesses: one is to read the , which is always needed, and the other is to mark it with a write operation, if the is obsolete. Due to the limited memory speed, it may not be possible to perform many purging operations in a time slot when is large. However, since the can overflow at most once in every time slot, we use a multibit counter to keep track of its overflow in a period of multiple time slots so it is possible to purge all obsolete entries by carefully designing a purging scheme.
Next, we introduce a periodic purging mechanism that is required to check entries in the lookup table in consecutive time slots with (6) should be no less than , rather than , because when all session queues are empty (the system is empty), all entries in the lookup table become obsolete. As a result, the system virtual time is simply reset to zero. Since in the worst case there could be purging operations, plus regular memory accesses (write after current HOL packet of session departs and read when a new packet from session becomes HOL) in the time slots, the value of must thus satisfy slot time memory cycle (7) which guarantees that all entries can be purged, if obsolete, within time slots. is the maximum number of memory accesses during this time. As an example, let a slot time be equal to that needed to transmit a 64-byte packet segment at a speed of 10 Gbit/s, i.e., 51.2 ns, and the memory cycle is 10 ns, K. According to (6), we choose . Then follows based on (7). There will be purging operations (3.12 memory accesses) per time slot.
To ensure unambiguous comparisons between the system virtual time and each stored time stamp in any of the time slots, the multibit counter must be able to record at least times of overflow for the system virtual time, as will be explained later. We therefore introduce a -bit counter variable for the system virtual time , which is increased by one once is overflow. To facilitate the purging operations, we also define for each entry an obsolete bit and a similar -bit counter variable . The obsolete bit is similar to the timeout bit in [20] . Fig. 17 shows the format of an entry in the table. Similar to the overflow bit introduced for the time stamp in Sections III-B and IV-B3, both and can be regarded as the time-zone indicator for and , respectively. So the and can be compared directly if both are in the same time zone. Otherwise, simply comparing their time-zone values indicates which is larger. This is the basic idea of the purging scheme.
Whenever we store in the table at time slot , as shown in Fig. 17 , we set its -bit to zero since it is not obsolete at that time, and we store its time zone in the field of the entry.
could be because even if and are in the same time zone, may be overflow and is one zone ahead of , according to (2) , as indicated by in contrast to . Thus if otherwise.
As mentioned earlier, since cannot be overflow more than once when the system is serving the current zone, there will be no packet out-of-sequence problem. Suppose all sessions are visited in the order of their indexes as shown in Fig. 17 , and is examined at time slot . Fig. 18 is a flow chart of the periodic purging algorithm, which is explained as follows. If the -bit is one, indicating that is obsolete at this time, we are done for this entry and proceed to for the next; otherwise, we compare the with the current value of . Note that . There are three cases, described as follows.
Case 1) If , indicating both and are in the same time zone, then they are compared directly. If is obsolete from this time on and its -bit is set to one; otherwise, without further operations we proceed to . Case 2) If or with is one zone ahead of from (8) and is not obsolete. We proceed to for the next step. The latter subcase occurs when from (8) .
Because sessions can be checked within time slots [according to (6) ], it guarantees that can be checked (and purged, if eligible) at least once since time , when it is stored until time inclusive with . Remember is not obsolete at time when it was stored. The can be overflow at most times during the period , while the can be overflow (wrapped around) at most once. The critical question is whether under the previous assumptions it can be guaranteed that the exceeding the , if wrapped around, will never be above . If it is true, this case can never be misinterpreted as Case 1 or 2. Otherwise, unsuccessful purging operations will result. Suppose , which is worse than because it would be closer to the wrapped . Since mod
if wrapped around, it will be when , which ensures correct purging operations. This is why we require the counter variables and to be -bit wide. 
VI. CONCLUSION
In this paper, we have proposed a novel RSE for PFQ implementation, which is aimed at designing an efficient and scalable architecture that can support hundreds of thousands of sessions in a cost-effective manner. The basic idea is to use the concept of hierarchical searching with a tree data structure to speed up the searching process. The RSE uses commercial memory chips with its total time complexity independent of the number of sessions in the system and is only bounded by the memory accesses needed. This is achieved by trading the time complexity (e.g., sorting) with the space complexity (e.g., memory size). With an extension of the RSE, we have proposed a 2-D RSE architecture to implement a general shaper-scheduler. By introducing a slotted mechanism for updating the system virtual time, we have shown that only two eligible packets at most are transferred to the scheduler queue in each time slot.
We have demonstrated the feasibility of our approach by presenting an implementation architecture for the RSE. The finite-bits overflow, such as time-stamp overflow, is a generic problem in implementing GPS-related scheduling algorithms. We suggest dividing the memory into two zones in the RSE to handle the overflow of time stamp . As for the 2-D RSE, conceptually four zones are needed since both and could be overflow. As an alternative, we suggest using instead of to organize the memory. Since an extra adder or adding operation is needed to restore from , the handling of time-stamp overflow is moved from the 2-D RSE to the scheduler queue in the corresponding shaperscheduler. Another kind of finite bit overflow problem is the aging problem of the time stamp, which makes its comparison with the system virtual time difficult in a real system. To solve this problem, we suggest using a counter variable to record the evolution of the system virtual time as well as the time stamp within a finite period, and we introduce a periodic purging mechanism that is required to clear all the obsolete time stamps (below current-system virtual time) within this period, so there will be no ambiguity in comparison.
