Absfract-I n this paper we address the design of a packet buffer for future high-speed routers that support line rates as high as OC-3072 (160 Gbls), and a high number of ports and service classes.
I. INTRODUCTION

HE fastest available high-speed routers today support T up to 16 interfaces at OC-192 ( I O Gbls) or OC-768
(40 Gbls) line rates. It is devised, however, that next generation high-end systems will support a much more larger number of interfaces (e.g. 624 or even more) at OC-192, OC-768 or even 0C-3072 (I 60 Gbls) line rates [I] .
Packet buffers for the next generation routers will require a storage capacity for several Gb (giga bits) of data and a bandwidth of several hundreds of Gbls. Usually these packet buffers will support Virtual Output Queueing (VOQ), which means that they must manage internal data structures of almost one thousand queues. Moreover, the design must be able to handle any input pattern, and not only traffic patterns that can be present in average. This restrictive condition is usual in networking equipment, and leads to design choices that optimize the worst case and not the most common case. Currently proposed packet buffer architectures do not meet these strict requirements.
To our knowledge, the fastest packet buffers with worstcase bandwidth guarantees that can be found in the literature are the hybrid SRAMDRAM designs. This memory organization for packet buffers was first proposed by the research group of N. McKeown (see [2] ). The scheme proposed in full buffer. For a large number of interfaces. however, the required SRAM size becomes too large. In [3] we described an scheme which aims at reducing the SRAM size of [2] while supporting a larger number of interfaces. The scheme we proposed in [3] is based on the observation that the effective DRAM access time can be reduced by overlapping multiple accesses to different banks, allowing us to reduce the granularity of the accesses thereby reducing the SRAM size. However, the memory scheme presented in [3] had the drawback of DRAM memory fragmentation, i.e. certain traffic patterns would lead to a situation where only a fraction of the DRAM memory could be used. In 141 we introduced a renaming of queues scheme that would reduce the probability of DRAM memory fragmentation. However, since the traffic patters are unpredictable, it is not possible to assess the probability of the DRAM memory fragmentation.
In the proposal presented in this paper we maintain the hybrid SRAMIDRAM design of [2] and [3], however: (i) We redesign the functional blocs that governs SRAMmRAM memory transfers. Our proposal is a general hybrid SRAMDRAM design such that [2] and [3] schemes are particular cases of our general scheme. (ii) We propose a new algorithm for the general scheme that can reduce the by almost an order of magnitude the SRAM size of the scheme proposed in 121. Furthermore, this new algorithm would not have the memory fragmentation problem of the scheme proposed in [4]. Figure I shows a general hybrid DRAMlSRAM architecture. The system consists of (i) two fast but costly SRAM memory modules (t-SRAM and h-SRAM) (ii) a slow but low cost DRAM memory and (iii) the functional blocks that governs the transfers between DRAMlSRAM memory modules.
GENERAL HYBRID DRAMlSRAM ARCHITECTURE
The t-SRAM and h-SRAM respectively cache the tail and head of each VOQ logical queue. The rest is stored in DRAM. The SRAM memory bandwidth must fit the line rate, which means that the SRAM access time must be less than or equal to the transmission time of a cell (we shall refer to this time as time slot).
The DRAM memory is organized in M banks. The t-MMA module consists of (see Fi,gure 1): a Queue Transfer Requester module (t-QTR) a Requesf Register (t-RR), and a DRAM Scheduler Algorithm module (t-DSA). Two additional modules, a Bank.Allocation Unit (BAU), and the Ongoing Request Register (ORR), are shared by both t-MMA and the h-MMA described later in this section.
The functional blocks of t-MMIA work as; follows: When a cell for queue i arrives from the transmission line, the t-QTR decides whether a transfer from t-SRAM to DRAM has to be scheduled for this queue. Since the t-,SRAM has to be emptied as soon as possible, the t-QTR schedules a transfer whenever is possible. i.e. when b cells of queue i are standing in t-SRAM. Equivalently, let Cj be a counter of the number of cells arriving at queue i ((7 is initialized to 0). Each time a cell arrives for queue i, C: is increased and t-QTR issues a transfer request for queue i if (q mod b ) = 0.
The request issued by t-QTR i:i processed by the Bank Allocation Unif (BAU), which in turn chooses the bank where the cells should be allocated (the:'algorithm to do so will be discussed in later sections). The request issued by the BAU contains the queue from which I7 cells must be transfemed, and the bank where these cells will be placed. The request is stored in the tail Requesr Register (t-R.R). Finally, the tail DRAM Scheduler Algorithm t-DSA selects one of the transfer requests pending in t-RR every b slots issuing the transfer from t-SRAM to DRAM. In order to choose one of the pending transfers in t-RR, the t-DSA may take into account the banks that are being accessed (in order to avoid bank conflicts). The identifiers cif these banks are stored in the Ongoing Requests Register (ORR).
In the explanation above we have described the transfers between t-SRAM and DRAM. In the following we shall focus on the transfers between h-SRAM and DRAM. These transfers are managed by the hea,dMemory Management Algorithm (h-MMA). Now h-MMA has to guarantee that cells transferred between DRAM and 11-SRAM can accommodate the sequence of cells requested by the switch fabric scheduler (we shall refer to it simply as the scheduler). Otherwise, the cell requested by the scheduler may not be present in the h-SRAM as it may not yet have been transferred from the DRAM. We shall refer this condition as a miss.
Again, the h-QTR algorithm is simple: Schedule a transfer for queue i whenever the number of request for cells from queue i issued by the scheduler, exceeds the number of cells from this queue present in h-SRAM. Equivalently, let C: be a counter of the number of cells requested by the scheduler from queue i (C;h is initialized to 0). Each time the scheduler request a cell from queue i, C;h is increased and t-QTR issues a transfer request for queue i if (Cf mod b) = I . We shall refer as the h-MMA response lime to the delay since the h-QTR schedules a transfer, until the corresponding download of b cells from DRAM to h-SRAM is finished. Analogously, we define the r-MMA rcsponse time as the the delay since the t-QTR schedules a transfer, until the corresponding upload of b cells from t-SRAM to DRAM is finished.
The rest of functional blocks of h-MMA work analogously to those of t-MMA. Now, however, we need an additional latency register (see Figure I ). This register introduces a delay since the scheduler issues request until the h-SRAM is accessed to grant the corresponding cell. This delay is necessary to cope with the response time of the h-MMA. Furthermore, note that in order to have zero miss probability, the delay introduced by the latency register should be equal to the maximum response time of the h-MMA.
EXTENSION OF THE GENERAL MODEL TO PREVIOUS DRAMlSRAM SCHEMES
In this section we show that previously proposed DRAM/SRAM schemes are particular cases of the general model introduced in section 11.
The simplest BAU scheme appears when we choose b = E . In this case there are never bank conflicts, and no specific BAU is needed (i.e. consecutive accesses to DRAM can be done to any bank). The DSA,can then be seen as a FIFO scheduler. which alternatively chooses the oldest write and the oldest read stored in the RR. This scheme is equivalent A drawback of the scheme previously described is that the assignment of the queues to the memory groups may prevent the full usage of the DRAM. This problem is referred to as memoryfragmenration. In [4] we alleviated this problem by means of a renaming scheme of the queues. However, even with the renaming scheme, memory fragmentation could still arise for certain traffic patterns.
IV. RANDOM BAU SCHEME In this section we describe a scheme that would exploit DRAM bank organization as in [4] (see section 111). Therefore. this scheme allows a data granularity b < B, and thus, reducing the SRAM size. However, the scheme described in this section does not have the memory fragmentation problem of 141.
The BAU we propose randomly chooses a DRAM memory bank for every queue transfer request issued by the QTR. This random selection is done as follows. Let < be the n-th request for the i-th queue issued by the t-QTR.
Then, the DRAM memory bank allocated to 4 is randomly chosen among the all the banks, provided that requests <-f+,,...,(-l,< are always addressed to different banks (i.e. different banks are chosen for any B / b consecutive requests for the same queue). Since the queues are FIFO, h-QTR consecutive requests for the same queue will also correspond to different bank accesses. Therefore, doing this way we avoid bank conflicts transferring cells from thc same queue (remember from section I1 that we can only access the same bank every B time slots, and we access DRAM every b time slots).
The associated DSA chooses the oldest eligible request in the RR, i.e. the oldest request that can be issued to DRAM without suffering bank conflict.
In order to obfain some dimension guidelines for the Random BAU Scheme let assume first that we choose b = B. As we explained in section 111, in this case there. are never bank conflicts and the scheme is equivalent to the ECQF scheme proposed in [Z]. We now derive the required SRAM size and the size of the latency register. We shall first focused on the t-SRAM dimensioning. Assume that the t -S U M is empty and a pattern of cells from each queue in round robin fashion arrive at t-SRAM. The t-QTR would issue the first transfer Q ( B -1) + 2, . . . , QB. In fact, a round robin pattern of arrivals (respectively scheduler requests) would produce the same pattern of transfer of requests will issued by the t-QTR respectively h-QTR. This pattern consisl of bursts of Q consecutive transfer requests'for all the queues, occurring with a period of QB time slots. The maximum response time would be reached by the last transfer request, and both t-MMA and h-MMA distributions would be the same. Therefore, from now on, we shall not distinguish between the t-MMA and h-MMA response times, and we shall refer to them simply as the MMA response time. Furthermore, we shall refer to the round robin pattern of arrivals (respectively scheduler requests) as a wurst case scenario since they lead to the maximum MMA response time.
Let us now consider a scenario using a granularity b < B.
Because bank-conflicts may occur. the maximum response time could be as high as Q ( E -1) + 1. This response time would occur if the scheduler issues Q consecutive requests addressed to different queues and all the requested cells were stored in the same DRAM memory bank. Therefore, if we want to guarantee zero miss probability, we would need an h-SRAM, t-SRAM and Latency Register of size Q(B-1) + 1. However, given the random bank assignment policy used by the random BAU scheme, the probability of the former event can be extremely low (A!) for the worst case traffic pattern.
In order words, it is plausible to assume that the event leading to the maximum response time (Q(B -1) + I ) using the random BAU scheme is very unlikely to happen. In fact, in the next section we show that, for practical purposes, the system can be dimensioned as no bank conflicts would occur, Le, assuming an MMA maximum response time of
V. NUMERICAL RESULTS
In this section we analyze the Random BAU Scheme described in section IV. For dimensioning purposes. the key parameter lo study is the MMA maximum response time MMA (see section IV).
For the results shown in this se.ction, w e use the worst case scenario described in section IV: The t-MMA and h-MMA respectively receive a sequence of cell arrivals and scheduler requests in round robin fashion for queues I , 2,. . . , Q.
In response to these'patterns, the I-QTR and the h-QTR will generate periodic bursts of transfers requests for queues 1,2 ,..., Q. Although the Random BAU Scheme proposed in this paper does not have zero miss probability, the randomization process among memory banks allows to guarantee an extremely low miss probability for any frafic pattern. We think that our design can be useful for building very large and fast future packet switches.
Further Work: Now we are working on (i) technological aspects of the implementaion of the scheme proposed in this paper, (ii) an analytical model for system dimensioning. The previous numerical results show that Q ( b -1) + 1 is plausible dimensioning rule for 1-SRAM, h-SRAM and the Latcncy register of the Random BAU Scheme. Provided that we can build a fast enough MMA unit. this size can be almost an order of magnitude lower than the one that would be required using the design given in [2]. Furthermore, the Random BAU Scheme does not have the DRAM fragmentation problem of the design we proposed in [4] . in [SI (with the name of "dynamically-allocated multi-queue buffers"). The amount of buffering and the line rates considered in this seminal paper were far lower than those required for our target application: high-speed backbone routers. For OC192 ( I O Gbls) line rates, a time-slot is lower than the random access time of DRAM. [6] proposes a design using DRAM only for a VOQ buffer architecture working at this line rate. The proposed design uses out-of-order memory access in order to reduce the number of bank conflicts. although it does not guarantee zero miss loses. 171 proposes techniques that exploit row locality whenever possible in order to enhance average-case DRAM bandwidth. However, this scheme may have significant miss probability for special traffic patterns.
For faster line rates. an hybrid SRAM-DRAM implementation of a VOQ buffer using ECQF for the h-MMA. is discussed in [2] . This is the scheme we used as starting point for our work.
There are many proposals exploiting the bank organization of DRAM memory [8], [9] , [IO] . This is especially true in the vector processor domain. The novelty of our technique resides in the application of this technique to the context of fast packet buffering.
