Abstract-Due to the natural randomness of broadband traffic, queues are required at various places in the asynchronous transfer mode (ATM) network to absorb instantaneous traffic bursts that may temporarily exceed the network bandwidth. A queue management algorithm will manage the queued cells in such a way that higher priority cells will always be served first, low priority cells will be discarded when the queue is full, and for same-priority cells any interference between them will be prevented. This paper presents four architecture designs for such queue management and compares their implementation feasibility and hardware complexity. This paper introduces the concept of assigning a departure sequence number to every cell in the queue so that the effect of long-burst traffic to other cells is avoided. A novel architecture to implement the queue management is proposed. The architecture applies the concepts of fully distributed and highly parallel processing to schedule cells' sending or discarding sequence. To support the architecture, a VLSI chip (called Sequencer), which contains about 150K CMOS transistors, has been designed in a regular structure such that the queue size and the number of priority levels can grow flexibly.
I. INTRODUCTION ROADBAND integrated service digital networks
B (B-ISDN) provide end-to-end transport for a broad spectrum of services flexibly and efficiently via the asynchronous transfer mode (ATM) technique [ 11. In ATM, information is packetized and carried in fixed length "cells." Each cell consists of 53 octets comprising a fiveoctet header and a 48-octet information field. Various architectures for ATM switches, which are also called fast packet switches, have been proposed and can be found in two survey papers [2] , [3] . Among those architectures, switches with output queues have been proven to give the best delay/throughput performance [4] . Fig. 1 shows a large-scale ATM switch fabric with output queues proposed by Chao 151. It is an improved version of the Knockout switch [6] and is capable of accommodating more than 8000 input ports to achieve a Terabit/second throughput. In this switch, the cell filtering and contention resolution functions are performed in parallel in small switch elements (SWE's), which are located at the intersection of the crossbar lines. The SWE examines incoming cells from horizontal lines and routes them properly to one of the L links of each output port. The value of L should be at least 12 for the load of 0.9 in order to have less than lo-'' cell loss probability caused by contention among the L routing links. The ATM switch fabric routes higher priority cells to output ports when congestion occurs among the L routing links. At any given cell time slot (2.83 p s ) , if there is more than one cell arriving at an output port, all except one will be stored in the output queue waiting to be sent to the transmission link.
Under normal traffic conditions, most of the queued cells will be transmitted on the link, and a simple first-in first-out (FIFO) queue discipline will provide acceptable cell loss/delay performance. However, under congestion, a queue management algorithm is essential to discipline the queued cells so that higher priority cells will always be sent to the outputs before the lower priority ones, low priority cells will be discarded when the queue is full, and for same-priority cells any interference between them will be prevented by setting up some policies and disciplines (or firewalls). Since different traffic has different service requirements, real-time traffic, such as video and voice, should be assigned higher priorities to satisfy its stringent delay requirements, while data traffic may be handled at lower priorities tolerating longer delay. A simple FIFO queue will not be able to handle prioritized cells nor prevent the interference of one connection to another. For instance, consider a situation in which a conventional FIFO is used for each output queue of the switch, as shown in Fig. 1 . When long cell bursts from a misbehaving user are queued up in the FIFO, the other regular arrival cells will be delayed or even discarded when the FIFO is full.
The round-robin discipline is believed to be capable of providing fairer service than the FIFO discipline when the network is congested [7] , [8] . Their throughput/delay per-formance comparison can be found in [9] . The roundrobin discipline usually operates by maintaining a separate FIFO queue for each connnection, which could be identified by a two-octet virtual channel identifier (VCI) in each cell header. The queues are visited in a cyclical order and thus, when congestion occurs, light-traffic and short-burst users are protected by evenly cutting back all users' throughput to approximately the same level. Three hardware implementations for the round-robin discipline are presented in [ 101. However, their hardware complexity increases as the number of virtual connections (VC's) or priority levels increases. As the number of VCI's approaches 64K, the hardware circuitry grows considerably to where the per-VCI processing speed limitation may become the system's bottleneck. Furthermore, each connection may request different transmission rates at the initial call setup. Thus, the round-robin discipline may not serve all users "fairly" because every user shares the remaining resource equally, rather than sharing the resource proportionally to the transmission rates they have requested.
In this paper, we present a queue management algorithm using the mechanism called VirtualClock [ 113 by assigning a departure sequence number to every cell in the queue to provide "fairness" among all virtual circuits and to set up resource firewalls to prevent interference among them. The concepts and design of the queue management discussed in this paper can be generally applied to any other ATM switches or statistical multiplexers with either input or output queues. This paper proposes a novel architecture to implement the queue management. The architecture applies the concepts of fully distributed and highly parallel processing to schedule cells' sending or discarding sequence. In contrast to the round-robin implementations, the hardware complexity of the proposed architecture is independent of the number of VCI's. A VLSI chip (called Sequencer), which contains about 150K CMOS transistors, has been designed using a regular structure so that the queue size and the number of priority levels can grow flexibly.
Section I1 first discusses the effects of long-burst traffic on other connections in terms of cell losses and delays, and then sketches a mechanism to prevent such interference. Section I11 presents four possible architectures to implement the queue management, and shows that the last architecture performs the best in terms of the implementation feasibility and hardware complexity. Section IV describes the key VLSI chip used to implement the proposed architecture and Section V gives a final conclusion.
EFFECTS OF LONG-BURST TRAFFIC
One of the basic functions of network flow control is to monitor users' traffic and prevent interference between users. During the call setup, the user provides the network controller with traffic information such as the peak bit rate, the average bit rate, and the maximum burst length. Based on the current status, the network will either accept the new call, or reject it if it will affect the negotiated grade of service of existing connections. Once the new call has been accepted, it will be monitored to see whether it follows the traffic characteristics it has claimed. Usually, misbehaving users of higher bit rate traffic (e.g., by increasing their burst length) have more impact on others than users with lower bit rate. The discussion in Section 11-A details this effect and shows a mechanism to set up firewalls between the users so that the less bursty traffic will not be disturbed by the more bursty sources.
A. Simulation Models and Results
The simulated model shown in Fig. 2 represents an output queue of the ATM switch shown in Fig. 1 , where cells from 12 inputs are stored in a single queue of 256 cells. A fixed-rate server sends out a cell at every time slot if the queue is not empty. Although the traffic models used in this paper are relatively simple and do not represent any specific services, the simulation results could still be qualitatively applied to any other more practical source models. Let us assume that inputs 1-6 have an average amval rate (AR) of 0.025 cell/time slot, while inputs 7-12 have an average arrival rate of five times that, or 0.125 cell/time slot. The aggregated traffic produces an offered load of 0.9 to the server. Let us also fix the average burst length of inputs 1-6 to be 1 cell, and vary the burst length of inputs 7-12 from 1 to 20 cells. The traffic source model alternates between active and silent modes as shown in Fig. 2 . During a silent mode, no cells are generated. During an active mode, cells are produced back-to-back, although they can be separated by regular intervals [12]. The length of the active and silent periods are geometrically distributed with an average length of B and I, respectively. Let us define that p = the probability of a cell that is the last cell in a burst (in active mode), and q = the probability of starting a new burst per time slot. The probability that the burst has i cells is then
The probability that an idle period lasts f o r j time slots is
Thus, the mean number of cells in a burst and the mean idle length are
The offered load ( p ) from each input is equal to B / ( B + I ) . Therefore, for a given p and a mean burst length B, q can be found to be
To evaluate the effect of the long-burst traffic, we simulated the model in Fig. 2 the simulation time. Fig. 3 shows that the cell loss probability of inputs 1-6 increases from less than lo-' to lop3 as the burst length of inputs 7-12 increases from 1 to 20. 
B. Department Sequence Number
A mechanism called VirtualClock was developed by Zhang to monitor multiple connections' behavior and to set up firewalls to prevent interference among them [ 1 11. In this paper, we simplify the VirtualClock concept by assigning a departure sequence number (DS) to every cell based on the average arrival rate (AR) that could be claimed at the call setup. A queue management algorithm serves the cell that has the smallest value of the DS. The algorithm is depicted below:
1. Upon the arrival of the first cell of connection i , its OS; = real time, where the real time can be the value of a counter incremented per cell time (2.83 ps).
2. Upon receiving every cell from connection i, its OS;
= maximum (real time, OSj + 1 /ARi}. Mean burst length (6) . field of each cell can be constructed from the service class combined with the departure sequence number. The priority field in the cells that are routed in the internal ATM switch network can be arranged as the one shown in Fig.  6 . Both the output port address and the priority field in the cells could be assigned by the input port controller of an ATM switch (or multiplexer). For Q bits of the priority For traffic that has different service classes, the priority field, the number of priority levels (P) is 2Q. 111. ARCHITECTURES FOR QUEUE MANAGEMENT This section describes four possible architectures used to implement queue management for ATM switches or statistical multiplexers. The first three architectures are intermediate architectures that suffer some implementation constraints. They are described in an order that leads up to the procedure for generating the fourth, and final, architecture. The example used to illustrate the architecture in this section has 12 inputs and 1 output, but it can be generally applied to more inputs and outputs (for instance, up to 100). Their hardware complexity in terms of memory requirements and implementation constraints are compared and discussed. Fig. 7 shows the most straightforward way to implement queue management. Cells carrying valid information from all 12 inputs are time division multiplexed into a higher-speed channel and distributed by a cell distributor to proper FIFO's based on their priority levels. On the output side, an arbiter chooses a cell with the highest priority and sends it out. The lower priority cells are not assessed until all the higher priority cells have been transmitted. Let us assume that the size of the output queue is 256 cells so that the maximum delay/jitter through the switch is less than 1 ms. Therefore, the size of each FIFO is also 256 cells to accommodate the worst case that all 256 cells have the same priority and are stored in the same FIFO. The total FIFO capacity will be 256 X P cells, where P is the number of priority levels. Let us assume that the cell's priority field consists of two service-class bits, and 12 bits for the departure sequence number; so the number of priority level, P, will be 214, or 16 384.
A . Store Cells in Multiple FIFO's
Then, the total FIFO capacity will be about 4.2 million cells, or 222 Mbytes for the cell size of 53 bytes. In each cell time (2.83 ps), 12 cells are time division multiplexed and distributed to FIFO's. Accordingly, each cell's process time in the distributor, plus the writing time for the FIFO's, is less than 2.83 ps/12, or 236 ns. This can be easily achieved if the cell's bit stream is converted to multiple parallel bytes before being written into FIFO's. The arbiter's speed is, however, a bottleneck, because in every 2.83 ps it has to scan all 16 384 FIFO's (the worst case) to select a single cell to transmit. This is very difficult to implement with the existing hardware technology.
B. Store Cells in a Pool
A large amount of memory is required in the preceding architecture because the FIFOs' capacity is not shared among all priorities. One way to allow them to share the memory is to store all cells in a single physical memory, i.e., a cell pool, and then retrieve the cells in a sequence according to their priority levels. The cells' corresponding addresses are stored in P FIFO's according to their priority levels, as shown in Fig. 8 . In addition to the P FIFO's used to store cells' addresses, there is an idleaddress FIFO used to store the addresses of all empty cells in the pool. When a cell arrives, its priority field, accompanied by an idle address, is sent to an address distributor. The distributor then stores the address in the proper FIFO based on its priority level. In the meantime, the cell is written to the pool with the idle address. The output side is similar to the architecture described in Section 111-A, where an arbiter chooses the address of the highest priority cell and then reads out the cell.
If we use the parameters assumed above, the cell pool capacity is 256 cells, or 13 568 bytes. Each FIFO's capacity is 256 x (log, 256) bits, or 256 bytes. Thus, the (P + 1) FIFOs' capacity will be (16 384 + 1) X 256 bytes, and the total amount of required memory will be about 4.2 Mbytes, much less than the first architecture's 222 Mbytes. However, the arbiter's speed for scanning all 16 384 FIFOs' status is still a bottleneck.
C. Store Cells in Logical Queues
In the preceding architecture, the memory used to store cells' addresses is much more than the memory actually .. 1,
1114
. ON SELECTED AREAS IN COMMUNICATIONS, VOL. 9 Fig. 9 replaces the P FIFOs' in Fig. 8 with two small memories having P entries for each. Each entry has both head and tail pointers (HP or TP) indicating the starting and ending addresses of a logical queue. The logical queue is stored in the cell pool and is associated with its priority level. Here, instead of storing each cell's address in the FIFO's, two addresses are stored for every logical queue, which results in a big memory savings. Every cell in the pool is attached with a pointer to point to the next cell that is linked in the same queue, as shown in Fig. 10 . A similar approach for implementing a priority queue that can handle only a small number of priority levels has been presented in [ 131.
P

Cell
IEEE JOURNAL
When a cell arrives, it is added to a logical queue properly based on its priority level. As shown in Fig. 9 , the cell's priority field is extracted first and used as an address to read out the tail pointer from the TP memory, e.g., A l . The tail pointer is then used as a writing address for the cell to be written into the pool. In the meantime, an idle address (e.g., A,) attached to the cell is written into the cell pool; the address also points to the queue's tail. The TP is then updated with the value of A,, as shown in Fig.  10(b) . The arbiter records the length of each logical queue in the cell pool and selects one cell to send out in every cell time slot. The transition from Fig. 10(b) to 1O(c) shows the operation of deleting a cell from a logical queue. The arbiter reads out the head pointer (e.g., Aj) from the HP memory that corresponds to the highest priority. This head pointer is used as a reading address to read out the corresponding cell from the pool. Once the cell is read out, its pointer (e.g., Ak) is written into the HP memory to update the old head pointer. This architecture obviously saves considerable memory. But it also adds complexity to the arbiter because it has to record the occupancy status of all logical queues with counters and, in the worst case, has to examine all counters (up to 16 384 in this case) to choose a single cell in one cell time (2.83 ps). This is very difficult to achieve with the state-of-the-art hardware. Since all of these functions are performed centrally by the arbiter, its processing speed limits the number of priority levels. Furthermore, if any pointer in the TP/HP memory or in the cell pool is somehow corrupted, the linkage between cells in the logical queues will be wrong, and cells will be accessed mistakenly. Although this can be checked by adding an extra parity bit to the pointers, it is still not easy to recover from faults once errors occur in the pointer, unless the entire cell pool memory is reset.
D. Sort Priorities Using a Sequencer
The three preceding architectures all limit the number of priorities because of their centralized processing characteristics. In addition, the architectures described in Section III-A and III-B require a large amount of memory. A novel architecture proposed in this section, as shown in Fig. 11 , requires less memory and is not limited by the number of priority levels because it uses the concepts of fully distributed and highly parallel processing to manage cells' sending and discarding sequence.
Comparing Figs. 8 and 1 1, we notice that the P FIFO's in A pair composed of a cell's priority field and its corresponding address, denoted as PA, is stored in the sequencer in such a way that higher priority pairs are always at the right of lower priority ones so they will be accessed sooner by the read controller. Once the pair has been accessed, the address is used to read out the corresponding cell in the cell pool. The concept of implementing the sequencer is very simple, as illustrated in Fig. 12 . Assume that the value of P, is less than that of P , , I and has a higher priority. When a new cell with priority P, arrives, all pairs on the right of Ak, including the Ak itself, remain at their positions while others are shifted to the left; the vacant position will be replaced with the pair composed of the new cell's priority field (P,) and address (A,).
When the cell pool is full (i.e., the idle-address FIFO is empty), the priority field at the left-most position of the sequencer (e.g., P,) will be compared to that of the newly arrived cell (P,). If P, is smaller than P z , the pair of Pz and A, will be pushed away from the sequencer as the new pair PA, is inserted in the sequencer. Meanwhile, the cell with address A, in the pool will be overwritten with the new cell. However, if P , 1 P z , the new cell will be discarded instead.
IV. VLSI SEQUENCER CHIP
A. General Operations
The VLSI sequencer chip is essentially a 256-word sorting-memory chip. Due to its generic architecture, it could be used for other scheduling algorithms and priority assignment procedures. Fig. 13 shows the building block of the chip, where the circuit in the shaded box is a module and is repeated 256 times in the chip. The new priority and address pair, PA,, is broadcast to every module. Based on the priority values of P I , and P,, the decision circuit will generate proper signals, sn and sl, to shift the new pair (PA,) into the register in the shaded box, shift the pair PA,-, from the right to the register, or retain the original value, PA,. Table I shows the truth table of generating the sn and sl signals, where X = P I -, , Y = P I , and Z = P,. For case (a) in Table I , where the new pair PA, is to be latched in the register, both the sn and sl signals are asserted to select the PA, for the register's input (D) and pass the shift-left-clock (slck) signal to the register's clock input. For case (b) in Table I , only the sl signal is asserted, which results in the PA,-I being selected and latched into the register with the clock signal slck. For the last case (c) in Table I , the sl signal is deserted while the sn is "do not care"; thus, the register remains its original value, PA,.
When a cell is to be read from the cell pool to the output channel, its corresponding address first has to be shifted out from the sequencer. Since the address associated with the highest priority in the pool is always at the right-most of the queue, it can be easily accessed by shifting all PA pairs one position to the right. As shown in Fig. 13 , the PA,. I will be shifted to the register in the shaded box when the shift-right (sr) and shift-right-clock (srck) signals are both asserted. For the example shown in Fig. 2 , where there are 12 inputs and 1 output, in every cell time there will be, at most, 12 slck pulses to write 12 new PA pairs to the sequencer, and one srck pulse to read out the highest priority pair. If a newly arrived cell is empty, it will not be written into the pool, and the slck pulse will not be generated. Similarly, if the cell pool is empty (or the idleaddress FIFO is full), no cells will be read out, and the srck pulse will be masked. Consequently, the shifting clock's period will be equal to 2.83 ps/(12 + l), or about 2 18 ns. When the number of inputs plus outputs increases up to 100, for example, the clock period is reduced to about 28 ns. There is still sufficient time for the decision circuit to generate proper signals based on the values of X , Y, and Z , and then latch the proper priority-address pair into the register. Usually, it is the speed of accessing cells in the pool (i.e., random access memory [RAM]), not the sequencer's speed, that limits the number of inputs and outputs.
B. Underjlow and Overjlow
We assume that each sequencer chip can support 214 priority levels, including 2 bits for the service class and 12 bits for the departure sequence number (DS), for a total TABLE I  TRUTH TABLE FOR Any services with bit rates lower than 33 kb/s will require a larger DS range because the DS is updated with the inverse of average cell arrival rates. We can either duplicate the sequencer chips to accommodate a larger DS (see Section IV-E) or always assign them with a real-time counter's value. Although this may cause some unfairness to the higher-bit-rate connections, the effect is insignificant and can be ignored due to their low arrival rates. Besides the DS underflow situation for low-bit-rate services, the "time rollover" situation, which occurs when the real-time counter or the DS exceeds its maximum value (DS overflow), has to be resolved. For the case of 12 bits for the DS and assuming that the real-time counter is incremented per cell time, the "time rollover" will occur at every 2.83 ps X 2", or 11.6 ms. One way to handle the "rollover" situation is to use two parallel sequencers (sequencer 0 and 1) for each queue manager to store two different banks of departure sequence numbers. Whenever the DS value exceeds its maximum value. a bank indication bit sent along with the priority field will be toggled by, for instance, the input port controller of an ATM switch (or multiplexer). Based on this bit, the cell's priority-address pair (PA) will be stored in either sequencer 0 or sequencer 1. Note that the bank indication bit will not be stored in the sequencers. The queue manager at the output port will empty the sequencer in one bank before serving another.
Consider an example that, at time = 0, the indication bit is reset to 0, and the PA pairs of arrived cells are stored in sequencer 0. Concurrently, this sequencer is being served by the queue manager. When the first "rollover" occurs at time = 11.6 ms, the PA pairs of cell arriving afterwards will be written into sequencer 1. Meanwhile, sequencer 0 is still being served. Sequencer 1 can start to be served only when sequencer 0 becomes empty. It is essential that sequencer 0 be exhausted before the next "rollover" occurs at time = 23.2 ms. Otherwise, PA pairs with larger DS's will be loaded into sequencer 0 and served before those that are stored in sequencer 1 and have smaller DS's. Therefore, to operate correctly, the "rollover" period in cell time unit (e.g., 4096 in our case) must be greater than the queue size (256 cells in our example) to guarantee that the sequencer currently being served becomes empty before the next "rollover" occurs. Fig. 14 shows the block diagram of the decision circuit, which includes two 14-bit subtractors (accommodating 214 priority levels) and one inverter gate. Although the subtractor could be shared by selecting X or Y as its minuend, it may not save many transistors (if any) because the subtractor's circuit can be optimized (as described later) and, if shared, an extra selector and other control circuits will be needed. The borrow-out signal, b, or by, will be asserted if 2 is less than X or Y. According to Table I, The 14-bit subtractor consists of 14 full subtractors cascaded in series, as shown in Fig. 15 , where Fig. 15 also lists the logic equations for the full subtractor are also shown. Since only the borrow-out is needed in our application, the circuit for generating the difference is omitted and not shown in the figure. Notice that the exclusive-OR function is implemented with six transistors. To avoid the long chain of pass transistors, the borrow-out at every stage is inverted. If we name the full subtractor of the least-significant bit as an even one and the next one as an odd one, the 14-bit subtractor then consists of even and odd subtractors alternately and repeatedly [ 141.
C. Decision Circuit
D. Chip Complexity
Notice that the operation of (2-X) in module i (the shaded box in Fig. 13 ), or PA, -P A i -, , is equivalent to the operation of (Y-Z) in module (i-1) on the right. Therefore, in real implementation we need only one subtractor instead of two in each module. The transistor count for each full subtractor (without the difference circuit) is 10, and the 14-bit subtractor needs 140 transistors. The register has been implemented with static D-type flip-flops (DFF's) and each has 12 transistors. The selector uses one pass transistor for each bit. Since we assume that each sequencer chip can accommodate a cell's priority field and address up to 14 and 10 bits for each, the transistor count including the register and selector will be (14 + 10) x (12 + l ) , or 312. Together with some other control gates and drivers, the transistor count for each module in the sequencer is less than 600. 256 of such modules, equivalent to about 150K transistors, have been integrated into a single VLSI chip with a commercial 1.2 pm CMOS technology. The chip has 180 pins, and the die size is 7.5 X 8 . 3 mm. The chip has been simulated and functions correctly at 40 MHz and provides a lot of operation margin for the example shown in Fig. 2 , where the required operation speed is 4.6 MHz (1 /218 ns). It is important to notice that the queue management operation speed is not affected by the number of priority levels due to the broadcast mechanism and distributed processing.
E. Cascading the Sequencer Chips
A single sequencer chip can accommodate a cell pool capacity of up to 256 cells (due to the 256 modules in the chip) and priority levels up to 16K (214). If the pool size exceeds 256 cells (e.g., 2K), the sequencer chips can be cascaded as shown in Fig. 16 . Address bits, a9 to aO, and priority bits, p13 to PO, are applied to the lower sequencer chips, while a10 and all the priority bits are applied to the upper ones. Notice that the right-most input PA pairs are connected with all 0's while the left-most ones are connected with all 1's. At initialization, all the PA registers in the sequencer chips will be loaded with all 1's so that any valid cell carrying a priority value less than (214 -1) will be inserted into the register of the sequencer chip. If the required priority levels are more than 16K (e.g., 64K), the sequencer chips can be connected as shown in Fig.  17 . The priority and address pair (p13 t o p 0 and a7 to aO, assuming a 256-cell queue) are connected to every sequencer chip. The two most significant bits of the priority field (p15 and p14) are decoded to generate four enable signals, en3 to enO, which are used to select one of the sequencer chips to load the PA pair. At the sequencer chip's output, the address of the highest priority will be chosen for the cell pool. The right-most PA pairs of the four sequencer chips will be checked to determine whether the sequencer chip is empty.
V. CONCLUSIONS
Since an ATM network node has to deal with traffic with different requirements, the use of multiple priorities and deliberate queue control functions serves as a possibility for distinguishing among different traffic types. A queue management algorithm manages the queued cells in such a way that higher-priority cells are always sent to the links before the lower priority ones, low-priority cells are discarded when the queue is full, and same-priority cells are served fairly. The concept of assigning a departure sequence number to every cell in the queue is introduced so that the effects of the long-burst traffic to other regular arrival cells is avoided. This paper presents four architecture designs for queue management and compares their implementation feasibility and hardware complexity. The architectures discussed in this paper can be generally applied to queue management for ATM switches or multiplexers with either input or output queues. A novel architecture to implement the queue management is pro-posed. It applies the concepts of fully distributed and highly parallel processing to arrange the cells' sending or discarding sequence. To support the architecture, a VLSI chip (called Sequencer, containing about 150K transistors) has been designed using a commercial 1.2 pm CMOS technology. The chip has a regular structure so that the queue size and the number of priority levels can grow flexibly.
