Abstract-We address the design of high-speed packet buffers for Internet routers. We use a general DRAM/SRAM architecture for which previous proposals can be seen as particular cases. For this architecture, large SRAMs are needed to sustain high line rates and a large number of interfaces. A novel algorithm for DRAM bank allocation is presented that reduces the SRAM size requirements of previously proposed schemes by almost an order of magnitude, without having memory fragmentation problems. A technological evaluation shows that our design can support thousands of queues for line rates up to 160 Gbps.
INTRODUCTION
A router is a network node connecting several transmission lines, whose basic function is to forward Internetworking Protocol (IP) packets across the lines depending on the packet's destination IP address and the information stored in its routing table. The main functional units of a router are:
1. line interfaces, which connect the router to each transmission line, 2. packet processors, which process the packet headers, look up routing tables, classify packets, and perform related tasks, 3. packet buffers, which store the packets waiting to be forwarded, 4. switch fabric, which interconnects the router's packet processing units, and 5. system processor, which performs the control functions, such as routing table maintenance and configuration tasks. A basic measure of the performance of a router is its switching capacity, measured as the product of the linerates of the transmission lines times the number of line interfaces. We can give this measurement in bits/sec or in packets per second (for example, assuming a packet-size of 40 Bytes). Currently, the evolution of high-speed routers is determined by the advances in optical transmission technologies such as DWDM (Dense Wavelength-Division Multiplexing), which makes it possible to exploit the huge potential bandwidth of optical fibers [36] . Data carried by optical fibers continues to double every 8-12 months, with a single fiber capacity exceeding 10 Tbps in the near future [39] , [26] , [3] .
Moreover, this optical transmission technology can already achieve 80-120 wavelengths per fiber for commercial systems, while, in experimental settings, thousands of wavelengths have been multiplexed into fiber [44] . Consequently, the required switching capacity of a router increases due to the increases in the line rates and to the increase in the number of line interfaces (if we are using DWDM, each wavelength is equivalent to a line interface).
A switch built with purely electronic technology cannot deal with the rapid increases in the required switching capacity, so it seems reasonable to consider introducing other types of device, such as optical devices in units with alloptical technology or hybrid electro-optical switches. However, optical technology presents a major limitation since nothing analogous to electronic RAM exists in optics [12] . Although, there are some alternatives that use optical fiberdelay lines with other components such as optical gate switches, optical couplers, optical amplifiers, and wavelength converters (see [15] , [21] , [32] , [42] , [51] , [25] , [24] ), they are not commercially feasible. Since the recent introduction of the load-balanced switch described in [5] , some hybrid electro-optical switches have been proposed, but they also have several problems that need to be solved to make them practical (see [35] , [33] , [34] ). Therefore, we consider that it is interesting to explore the extent to which one can scale the speed of high-speed electronic packet buffers.
Packet buffers are essential components in the design of a packet switch. Their main function is to absorb surplus traffic directed toward a given interface of the switch. The size and other characteristics of a buffer packet have a direct impact on the performance of the switch and the dynamics of the congestion control mechanisms used in the network.
For a packet buffer, the required bandwidth is at least twice the line interface transmission rates. Furthermore, high-speed routers not only need high-speed buffers, but they also may need large buffer storage. Usually, to calculate the size of the packet buffer required in a packet switch, manufacturers use the well-known rule-of-thumb buffer size ¼ RT T Â C, in which RT T is the round-trip time of the Internet and C is the line rate of the line interface, so a packet switch that uses line interfaces with C ¼ 40 Gbps in a network with RT T ¼ 200 ms requires packet buffers able to read/write 80 Gbps and a size of 1 GB.
The validity of the previous dimensioning rule in very high capacity links has recently been questioned. In [2] , [48] , it is argued that, in backbone routers that switch thousands of TCP flows, the phenomenon of flow synchronization does not occur when there is congestion ( [19] , [9] ), so we can drastically reduce the packet buffer size, using, as a dimensioning rule, buffer size ¼
, where N is the number of active TCP flows in the link. On the other hand, in [11] , it is argued that this dimensioning rule, which focuses on maintaining a high link utilization and low delay, can lead to high loss rate and poor performance for many applications. Another dimensioning rule, which, in fact, can give buffer size values even larger than the ones obtained with the bandwidth-RTT rule, is derived. Given the lack of accurate models for Internet traffic and the closed-loop nature of TCP, analytical or simulation models cannot give the final answer to the buffer dimensioning question and, thus, real measurements in congested Internet links are needed. None of the alternative dimensioning rules have been extensively tested to date in real scenarios and buffer sizing of Internet routers remains an open question. Today, routers manufacturers seem to favor the use of large buffers. For instance, the Cisco CRS-1 modular service card with a 40 Gbps line rate incorporates a 2 GB packet buffer memory per line card and side, ingress, and egress (see [7] ).
The problem of packet storage is not only related to the absorption of traffic peaks. As an interesting example, we might mention the case of the packet buffers used in OBS (Optical Burst Switching) edge routers. Optical Burst Switching (OBS) is assumed to be the most practical solution in the near future [6] , [45] , [38] , [50] , [31] for efficiently using the huge potential bandwidth of optical fibers that the advances in DWDM technology have made possible. In OBS, packets from various sources are aggregated into bursts at the ingress edge of an OBS network and disassembled at the egress edge router. A control packet is sent first to set up a connection (by reserving the appropriate amount of bandwidth and configuring the switch fabric along a path), followed by the burst of data. This signaling process implies that packets from a burst should be stored at the edge routers during timescales of milliseconds, which means that edge OBS routers may require buffer sizes in the order of Giga bytes of data.
Traditionally, fast packet buffers were built using lowlatency SRAM. However, with the increasing capacity requirements, high density DRAMs have become the preferred choice. DRAM-based packet buffers can easily provide a bandwidth of up to around 1 Gbps, but, if we increase the required bandwidth to several Gbps, the design becomes difficult. For instance, [20] addresses the design of a packet buffer using a single-chip 16 Mb SDRAM with a 16 bit data interface and a 100 MHz clock. Even though the peak bandwidth is 1.6 Gbps, the guaranteed bandwidth drops to 1.2 Gbps due to the activate and precharge overhead. A multiple chip design would increase the buffer bandwidth, but the increase in bandwidth would not be proportional to the total number of chips. Using, for instance, the same SDRAM parameters, an 8-chip configuration with an 8x wider bus would provide a guaranteed bandwidth of only 5.12 Gbps. Increasing the number of chips and widening the data bus therefore yields diminishing returns, while creating problems [13] such as higher memory granularity, more memory components in the line card, and wider data paths.
The low efficiency of multichip DRAM buffers can be improved by using some special techniques aimed at reducing bank conflicts in a DRAM buffer such as pipelining and out-of-order access techniques [47] , [46] , [37] or exploiting row locality whenever possible in order to enhance average-case DRAM bandwidth [8] , [22] . Using faster DRAM components (e.g., RLDRAM [28] , FCDRAM [14] , etc.) would also lead to faster buffers. However, from the previous discussion, it is clear that, to support a line-rate as high as OC-3072, alternatives to DRAM-only buffers should not be considered.
Taking these facts into account, the fastest packet buffers with worst-case bandwidth guarantees that can be found in the literature are hybrid SRAM/DRAM designs, first described in [29] . In these, Virtual Output Queuing and a combined SRAM/DRAM packet buffer architecture are used. SRAM only stores the tail and head of queues in order to ensure the line rate and DRAM stores the rest of them in order to ensure the large storage that is needed. To our knowledge, the hybrid design proposals made in this field are [29] , [17] , and [18] .
Our novel proposal presented in this paper maintains the hybrid SRAM/DRAM design of [29] , but introduces the following changes: 1) We redesign the functional blocks that govern SRAM/DRAM memory transfers to obtain a general hybrid SRAM/DRAM design, for which the schemes of [29] and [17] can be seen as particular cases. 2) We propose a new algorithm that reduces the SRAM size of the scheme proposed in [29] almost by an order of magnitude. Furthermore, this new algorithm avoids the memory fragmentation problem of the scheme proposed in [17] . To the best of our knowledge, the design proposed in this paper is the fastest that has been published to date for large packet buffers.
A technological evaluation presented in this paper shows that our design can support thousands of queues for line rates of 160 Gbps using commodity DRAM.
The rest of the paper is organized as follows: In Section 2, we explain the system assumptions for the paper. In Sections 3 and 4, we describe our proposal and we argue that the previous ones are particular cases of our design. We then discuss some implementation issues. Section 9 is devoted to previous work on VOQ buffer design and Section 10 gives some concluding remarks.
SYSTEM ASSUMPTIONS
During the next few years, aggregate router throughput will probably grow by increasing the number of interfaces rather than increasing line rates [36] . Although line rates have increased rapidly over the past years (up to OC-192 or OC-768 [30] , [7] ), it seems that this increase is close to its electronic limits: around OC-3072 [36] .
The use of Dense Wavelength-Division Multiplexing (DWDM), however, increases the number of channels available on a single fiber (without increasing the individual line rates), leading to a number of interfaces on the order of several hundred.
Our target design is to support line rates as high as OC-3072, a number of line interfaces in the order of thousands, with the capacity of grouping cells into a number of internal logical queues (e.g., using Virtual Output Queuing, as we explain below). We can therefore set several parameters that are of utmost importance in the packet buffer design: required bandwidth, buffer size, basic time-slot, and number of internal data structures internal to the buffer.
Required bandwidth. For input-queuing architecture, the required packet buffer bandwidth is twice the line rate as every packet must be both written and read from memory before being forwarded. In a shared memory buffering, packets continue to recirculate through the switch fabric in a common buffer pool, with each output removing one packet from the group each time slot, so the required total buffer is twice the line rate per number of input interfaces. In the numerical examples that we use in this paper, we do not consider any further speeding up. Note, however, that, in practice, a speed-up of 1.5-2 is commonly used to compensate for the allocation and scheduling conflicts.
Buffer size. As we discussed in the previous section, router manufacturers usually employ packet buffers of a size equal to an estimate of a typical packet round-trip time over the Internet times the line rate [13] . Taking a typical round-trip time of 0.2 sec, the required buffer size for a line rate of 160 Gbps is 4 GB.
Basic time-slot. We assume that packets in the router are internally fragmented into units that we will call cells. We will take a fixed-length of 64 bytes [4] . Cells are handled as independent units, although they are reassembled at the output port before packet transmission. The system operates synchronously into fixed time-slots, which correspond to the transmission time of a cell at the line rate. For example, for a line rate of 160 Gbps, the basic time-slot is of 3.2 ns.
Number of internal data structures. As is well known, in order to achieve full link utilization, input-buffered routers require the use of Virtual Output Queuing (VOQ) [43] . In VOQ, (see Fig. 1 ), the input buffer maintains Q separate logical FIFO queues. Each logical queue corresponds to an output line interface and a class of service. When a cell reaches the input line interface, it is placed at the tail of the queue corresponding to its outgoing interface. When an input port receives a request for a cell addressed to a given output, the cell is taken from the head of the corresponding queue in the VOQ buffer. We will assume that our packet buffer incorporates this mechanism. We assume that the number of Virtual Output Queues to be supported is around 1,000.
DRAM bank interleaving. In response to the growing gap between processor and memory speed, DRAM manufacturers have created several new architectures that address the problem of latency, bandwidth and cycle time (e.g., DDR-SDRAM [23] or RAMBUS DRDRAM [10] ). All of these commercial DRAM solutions implement a number of memory banks-as many as 512-that can be addressed independently. Banking in DRAM allows an access to one bank to begin while the other is still busy. Thus, by performing several on-the-fly requests to different banks, we can reduce the "random" access time of a DRAM memory system that implies a reduction in the SRAM size needed by a hybrid DRAM/SRAM architecture, as is shown in [16] . Fig. 2 illustrates the concept of a memory bank and an interleaved memory system. A memory bank is a set of memory modules that are always accessed in parallel with the same memory address. The number of memory modules grouped in parallel is dictated by the size of the data element we want to address. This size in cells is the data granularity. In the following, we shall refer to the data granularity used as b. Furthermore, in a conventional DRAM memory system, the data is interleaved across all memory banks using a specific policy and the memory controller is simply in charge of broadcasting the addresses to them. Each memory bank has a special logic that determines whether or not the address identifies a data item that the bank contains. We will assume that our packet buffer incorporates DRAM bank interleaving.
GENERAL HYBRID SRAM/DRAM ARCHITECTURE (GHDS)
Our proposal is a general hybrid SRAM/DRAM design of which the schemes of [29] and [17] are particular cases, as we assess in the following section. Fig. 3 shows the GHDS architecture. The system consists of 1) two fast but costly SRAM memory modules (t-SRAM and h-SRAM), 2) a slow but low-cost DRAM memory, and 3) the functional blocks that govern the transfers between Fig. 3 ).
DRAM/SRAM memory modules (indicated as Memory Management in
In the scheme shown in Fig. 3 , we maintain the hybrid SRAM/DRAM memory organization of [29] with the addition of units for DRAM bank allocation and DRAM scheduling. These units are key to obtaining a system which can fully exploit the capabilities of interleaved memory access.
The DRAM memory is organized in M banks and data are interleaved among them. However, note that there are two fundamental limits to using bank interleaving. The first is the bus address speed, that is, the cycle time required to rebroadcast an address to all memory banks. The second is the problem of bank collisions. In order to fully exploit the potential bandwidth of an interleaved memory system, we need to guarantee that the same bank is not accessed twice within its random access time (T ). A bank conflict occurs when this constraint cannot be fulfilled. The implementation of conflict-free mechanisms is especially relevant in the context of fast packet buffering because a collision would result in the loss of a packet. In proposals [29] and [17] , no bank collision ever takes place. We present in this paper an alternative system for which collisions only occur with a very small probability, independently of the traffic patterns.
The t-SRAM and h-SRAM, respectively, cache the tail and head of each VOQ logical queue. The rest is stored in DRAM. Cells that come into the buffer are placed in the t-SRAM, whereas cells that will leave the system in the near future are placed in the h-SRAM. Since the SRAM memory bandwidth must fit the line rate, the SRAM access time must be less than or equal to the transmission time of a cell (that is the time slot). The availability of room in the t-SRAM and the availability of cells to be served in the h-SRAM is controlled using two Memory Management Algorithms: the tail Memory Management Algorithm (t-MMA) and the head Memory Management Algorithm (h-MMA), respectively, which must guarantee that there is always room in the t-SRAM for an incoming packet and that any packet to be output is always present in the h-SRAM before the outputting needs to be done (i.e., the cache never misses).
Therefore, the accesses to DRAM are managed by the t-MMA and the h-MMA. When the occupancy of the t-SRAM reaches a given threshold, a transfer from t-SRAM to DRAM of a group of cells addressed to the same output interface is ordered by the t-MMA. Conversely, when the h-SRAM needs to serve a cell that currently resides in DRAM, the h-MMA orders a group transfer from DRAM to h-SRAM. In order to match DRAM/SRAM access times, transfers between DRAM and SRAM occur in batches of b cells. Note that these transfers have a size in cells that should be set to the ratio of DRAM random access time to the transmission time of a cell. In the following two subsections, we describe these algorithms in depth.
Tail Memory Management Algorithm
Every b time-slots, the tail Memory Management Algorithm (t-MMA) selects a queue and a memory bank from which b cells must be transferred from t-SRAM to DRAM. These transfers should guarantee that the t-SRAM does not fill up before DRAM. Otherwise, losses would occur before the DRAM is full.
The t-MMA module consists of (see Fig. 4 ): a Queue Transfer Requester module (t-QTR), a Request Register (t-RR), and a DRAM Scheduler Algorithm module (t-DSA). Two additional modules, a Bank Allocation Unit (BAU) and the Ongoing Request Register (ORR), described later in this section, are shared by both t-MMA and h-MMA. The functional blocks of the t-MMA work as follows: When a cell for queue i arrives to be stored in the t-SRAM, the t-QTR decides whether a transfer from the t-SRAM to DRAM has to be scheduled for this queue. Since the t-SRAM has to be emptied as soon as possible, the t-QTR schedules a transfer whenever it is able to do so, i.e., when b cells of queue i are standing in the t-SRAM. Equivalently, let C t i be a counter of the number of cells arriving at queue i (C t i is initialized to 0). Each time a cell arrives for queue i, C t i is increased and the t-QTR issues a transfer request for queue i if ðC t i mod bÞ ¼ 0. The request issued by the t-QTR is processed by a Bank Allocation Unit (BAU), which in turn chooses the bank in which cells should be allocated (the algorithm will be discussed in Section 5). The request issued by the BAU contains the queue from which b cells must be transferred and the bank in which these cells will be placed.
In order to avoid DRAM bank conflicts, a tail DRAM Scheduler Algorithm (t-DSA) is used. It takes into account two registers: the Tail Requests Register (t-RR) and the Ongoing Requests Register (ORR). The t-RR is a shift register that stores the t-SRAM requests processed by the BAU that have not yet been fulfilled. Every b slots, the t-DSA selects one of the transfer requests pending on the t-RR, which can be located at any position of the register, and issues a write transfer to DRAM. To choose it, the t-DSA may take into account the information stored in the Ongoing Requests Register (ORR). The ORR is a shift register that stores the identifiers of the banks that are currently being accessed. If a new request were issued to any of these banks, a bank conflict would arise. Hence, the banks with identifiers stored in the ORR are locked. Therefore, the t-DSA chooses the oldest request in the t-RR addressed to a bank which is not locked, starting a new transfer of b cells and placing the memory bank identifiers at the tail of the ORR.
Head Memory Management Algorithm
The transfers between the h-SRAM and DRAM are managed by the head Memory Management Algorithm (h-MMA). Now, the h-MMA has to guarantee that cells transferred between DRAM and h-SRAM can accommodate the sequence of cells requested, for example, by the switch fabric scheduler in an input-queuing switch. Otherwise, the cell requested may not be present in the h-SRAM because it may not have been transferred from the DRAM yet. We shall refer to this condition as a miss.
Again, the h-MMA algorithm is simple: Schedule a transfer for queue i whenever the number of requests for cells belonging to queue i exceeds the number of cells from this queue present in the h-SRAM. Equivalently, let C h i be a counter of the number of cells requested from queue i (C h i is initialized to 0). Each time a cell from queue i is requested, C h i is increased and the head Queue Transfer Requester module (h-QTR) issues a transfer request for queue i if ðC h i mod bÞ ¼ 1. We shall refer to the delay as the h-MMA response time since the h-QTR schedules a transfer until the corresponding download of b cells from DRAM to h-SRAM is finished. Analogously, we define the t-MMA response time as the delay from when the t-QTR schedules a transfer until the corresponding upload of b cells from t-SRAM to DRAM is finished.
The rest of the functional blocks of the h-MMA works analogously to those of the t-MMA. Nevertheless, we need an additional latency register (see Fig. 4 ). This register introduces a delay since a request is issued until the h-SRAM is accessed to grant the corresponding cell. This delay is necessary to cope with the response time of the h-MMA. Furthermore, note that, in order to have a zero miss probability, the delay introduced by the latency register should be equal to the maximum response time of the h-MMA.
EXTENSION OF THE GENERAL MODEL TO PREVIOUS DRAM/SRAM SCHEMES
In this section, we show that previously proposed DRAM/ SRAM schemes, such as [29] and [17] , are particular cases of the general model introduced in Section 3.
Random Access DRAM System (RADS)
In this paper, we shall refer to the hybrid DRAM/SRAM scheme proposed in [29] as the Random Access DRAM System (RADS). This scheme is shown in Fig. 5 . Since DRAM bank interleaving is not exploited, this memory system cannot take advantage of banks. Let us define B as the minimum granularity that can be used if we require a random transfer from any SRAM and any DRAM memory bank. In this case, B is limited by the random access time of the DRAM (T ), e.g., if the link rate is R bps and the cell size is C bits, we would have: B ! 2 T R=C (we use 2 T since, each B time slots, we have to do a read and a write transfer to DRAM). Therefore, to analyze this proposal, we are forced to rely on the worst-case scenario, so the data granularity is given by the DRAM random access of a single bank (b ¼ B). In this case, no specific BAU is needed (i.e., consecutive accesses to DRAM can be done to any bank). The DSA can be seen as a FIFO scheduler, which alternatively chooses the oldest write and the oldest read stored into the t-RR and the h-RR, respectively. This scheme is equivalent to the so-called Early Critical Queue First (ECQF) proposed in [29] . It can be shown, using a worstcase pattern argument, that the required h-SRAM, t-SRAM, and latency register size in cells is Q ðB À 1Þ þ 1. Shorter sizes would lead to miss probabilities that could be large for some specific traffic patterns. Note that, in this hybrid architecture scheme, the transfers between SRAM and DRAM have a size in cells that should be set to the ratio of DRAM random access time to the transmission time of a cell. As this factor directly influences the SRAM size, large SRAMs are needed to sustain high line rates and a large number of interfaces.
This, in turn, limits what access times are attainable. This buffer design would support line rates of up to OC-3072, but only for a reduced number of interfaces. Thus, although the scheme proposed in [29] ensures zero loss probability for cells coming to a nonfull buffer, the required SRAM size for a large number of interfaces becomes too large.
Conflict-Free DRAM System (CFDS)
In [17], we described a scheme which aims at reducing the SRAM size of [29] while supporting a larger number of interfaces. The scheme we proposed in [17] is based on the observation that the effective DRAM access time can be reduced by overlapping multiple accesses to different banks, that is, by exploiting the potential DRAM bank interleaving. This allows us to reduce the granularity of the accesses, thereby also reducing the SRAM size. Fig. 6 summarizes the Conflict-Free DRAM System (CFDS) memory architecture. In this proposal, we maintain the same hybrid SRAM/DRAM structure and MMA subsystem as [29] , but we completely redesign the DRAM system. We propose a DRAM storage scheme and its associated access method that achieves a conflict-free access memory organization with a reduction in the granularity of DRAM accesses.
Note that if we use a DRAM of M memory banks and a random cycle time of T seconds per bank, it is theoretically possible to initiate a new memory access every T =M seconds. Therefore, the data granularity can be potentially reduced by a factor of M (as we can perform sequential accesses at an M times faster rate). However, remember from Section 2 that we need to guarantee that the same bank is not accessed twice within its random access time. Otherwise, a bank conflict occurs.
The DRAM memory organization is as follows: Let M be the number of DRAM banks. We organize these banks into G ¼ M=ðB=bÞ groups of B=b banks per group (see Fig. 7 ). Each group stores cells of Q=G queues. Banks are accessed by transferring b cells from the same queue. In order to avoid bank conflicts, the cells in each queue are stored in blocks of b cells following a round-robin configuration among all the banks belonging to the same group in which the queue was assigned (block-cyclic interleaving).
The transfers between the DRAM and SRAM are managed by the DRAM Scheduler Subsystem (DSS) shown in Fig. 6 . It hides the DRAM bank organization from the former MMA Subsystem, which operates under the illusion that the DRAM access time is b time-slots, even though the actual DRAM access time is B time-slots. This is the main difference between [29] and [17] : b < B cells are transferred between DRAM and SRAM every DRAM memory access time. However, in reality, the DRAM access time remains B time slots. It is this illusion that reduces the SRAM size.
The DSS uses a DRAM Scheduler Algorithm (DSA) to avoid bank conflicts, using two registers: the Requests Register (RR) and the Ongoing Requests Register (ORR). The behavior of the DSA is analogous to that explained in Section 3 for the GHDS model. The DSA must thereby choose the oldest eligible request in the RR and, then, issue the write or read transfer to or from DRAM.
This implies that the DRAM subsystem may deliver cells out of order. Reordering these cells implies an additional cost in terms of latency and SRAM size. The additional delay, equal to the maximum delay that a replenish request can suffer due to the DSA reordering, is introduced by the latency shift register shown in Fig. 6 . However, analysis in [16] shows that this reordering is bounded and that a zero miss condition can be guaranteed. Moreover, the benefits of decreasing the granularity outweigh the additional cost introduced by the reordering process.
This memory scheme has the drawback of DRAMmemory fragmentation, i.e., certain traffic patterns would lead to a situation in which only a fraction of DRAM memory can be used, depending on the assignment of queues to memory groups.
In [17] , we alleviated this by using a renaming of queues mechanism that reduces the probability of DRAM memory fragmentation. It consists of associating each logical queue name Q l , used internally for identifying queues assigned to a certain group, to more than one physical queue name (Q p ). Initially, when no cells from Q l i are stored in DRAM, a free Q p j identifier from the group with the fewest cells is assigned to it. If cells reaching this queue find that the DRAM assigned to the group is full, a new Q w k will be chosen from another group that could offer free DRAM space. By doing this, cells from a given logical queue can reside in more than one memory group and can potentially occupy the whole DRAM system. Renaming of queues makes it much more difficult for memory fragmentation to arise. However, for some traffic patterns, fragmentation cannot be avoided.
RANDOM BAU SCHEME (RBAU)
In this section, we describe a scheme, first presented in [18] , that exploits DRAM bank organization as in [17] (see Section 4). It allows a data granularity of b < B and, thus, reduces the SRAM size. Moreover, the scheme described in this section does not have the memory fragmentation problem of [17] .
The BAU we propose randomly chooses a DRAM memory bank for every transfer request issued by the QTR. This random selection is done as follows: Let r i n be the nth request for the ith queue issued by the t-QTR. Then, the DRAM memory bank allocated to r ; . . . ; r i nÀ1 ; r i n are always addressed to different banks (i.e., different banks are chosen for any B=b consecutive requests for the same queue). As the queues are FIFO, consecutive h-QTR requests for the same queue will also correspond to different bank accesses. In this way, we avoid bank conflicts in transfers of cells from the same queue (as we see in Section 3, we can only access the same bank every B time slots and we access DRAM every b time slots).
The associated DSA chooses the oldest eligible request in the RR, i.e., the oldest request that can be issued to DRAM without suffering bank conflict. This RBAU scheme could be easily implemented as a Linear Feedback Shift Register (LFSR) or Tauseworth generator (e.g., [49] , [1] ).
SRAM DIMENSIONING
In this section, we describe some dimensioning guidelines for the SRAM modules used in our GHDS Architecture for the RBAU mechanism explained in Section 5.
Let us first assume b ¼ B. As we explained in Section 4, there are never bank conflicts and the scheme is equivalent to the ECQF mechanism proposed in [29] , thus requiring sizes for both SRAM and the latency register of Q ðB À 1Þ þ 1. Now, let us consider a scenario with a granularity b < B, in which bank conflicts may occur. A miss occurs when all the transfer requests stored in the t-RR or in the h-RR are addressed to banks that are busy at that moment, so the t-DSA or the h-DSA, respectively lose the chance of issuing a read or write request to DRAM. In the following, we outline the dimensioning of both t-SRAM and h-SRAM.
For the t-SRAM, a miss on the t-DSA implies that no write request will be served for the next B slots. Therefore, the t-SRAM will increase its maximum size in b cells. Furthermore, as the t-RR can receive requests every slot and the t-DSA only runs every b slots, the system cannot recover from this loss. Fig. 8 shows this behavior.
In order to overcome this problem, we should introduce a small speed-up in the t-DSA. This speed-up consists of adding extra read cycles in the t-DSA. During these extra cycles, the t-DSA can serve requests accumulated when misses occur, thus making the system recovery feasible from this situation. In the next section, we show that, for practical purposes, the t-SRAM size can be given by Qðb À 1Þ þ 1.
Let us now consider the h-SRAM case. If the h-DSA cannot issue any read request to DRAM because of a miss, there will not be any read transfer between DRAM and h-SRAM after B slots. For this reason, when its associated cell reaches the head of the latency register, the h-SRAM will not be able to serve it. Hence, the maximum response time could be as high as Q ðB À 1Þ þ 1. This response time would occur if the scheduler issued Q consecutive requests addressed to different queues and all the requested cells were stored in the same DRAM memory bank. If we want to guarantee a zero miss probability, we would need an h-SRAM, t-SRAM, and Latency Register of size Q ðB À 1Þ þ 1. This maximum value would be needed in the event that all requests are addressed to cells stored in the same bank. However, given the random bank assignment policy used by the RBAU scheme, we expect the probability of the former event to be extremely low (on the order of 1 M Q ), independently of the traffic pattern. In other words, it is plausible to assume that the event leading to the maximum response time Q ðB À 1Þ þ 1 using the RBAU scheme is very unlikely to happen. In fact, in the next section, we show that, for practical purposes and realistic values of M, the system can be dimensioned as if no bank conflicts occur, i.e., assuming a MMA maximum response time of Q ðb À 1Þ þ 1.
NUMERICAL RESULTS
In this section, we analyze the RBAU Scheme described in Section 5. For the results shown here, we use the following scenario: The t-MMA and h-MMA, respectively, receive a sequence of cell arrivals and scheduler requests in a roundrobin for queues 1; 2; . . . ; Q. In response to this pattern, the t-QTR and the h-QTR generate periodic bursts of transfer requests for queues 1; 2; . . . ; Q. Note, however, that the random bank assignment performed by the BAU makes the exact sequence of cell arrivals and scheduler requests irrelevant.
For h-SRAM dimensioning purposes, the key parameter to study is the h-MMA maximum response time (see Section 6) . Remember from Section 6 that it would be the same for the t-MMA. The influence of M, the number of memory banks, is easily assessed in Fig. 11 . The curves were obtained for Q ¼ 1; 024, B ¼ 32, and b ¼ 8. As we can see, the lines associated with values between 256 and 8 are almost coincident, indicating that the delay is essentially independent of the number of banks used when we have more than four banks. Similar conclusions are drawn from Fig. 11b for b ¼ 2. In this case, the number of banks required for achieving this insensitivity is M ¼ 16.
The previous numerical results show that Q ðb À 1Þ þ 1 is a plausible dimensioning rule for the t-SRAM, h-SRAM, and the latency register in the Random BAU Scheme for realistic values of M and b. As a consequence, provided that we can build a fast enough MMA unit, the SRAM size can be almost an order of magnitude lower than the one that would be required using the design given in [29] . Furthermore, the RBAU Scheme does not have the DRAM fragmentation problem of the design we proposed in [17] . It is clear that increasing the value M will lead to a lower collision probability when the previous dimensioning rule is used. Although we do not have available analytical expressions or bounds for this collision probability, we can expect that large values of M (e.g., M ¼ 128) will lead to tiny values of collision probabilities, even for low values of b.
Additional numerical results can be found in [18] .
EVALUATION OF THE GHDS MEMORY ARCHITECTURE
In this section, we will extend the scope of the study of the RADS, CFDS, and RBAU systems to design and implementation issues, taking into account technological constraints. We call RADS to the hybrid DRAM/SRAM scheme proposed in [29] based on determinist worst case and access granularity matching the DRAM random access time (see Section 4.1). We call CFDS to the variation proposed in [17] , where we combined a specific bank interleaving scheme with a memory reordering system that allowed us to leverage zero packet lossed with a granularity lower than the DRAM random access time (see Section 4.2). We finally, as shown in Section 5, call RBAU to our general GHDS hybrid SRAM/ DRAM architecture with Random BAU scheme. We pose the problem of implementing an SRAM structure able to handle several queues and we propose two design alternatives: one targeted at low cost (area and power) and one targeted at high performance. We finally estimate the area and access time and determine the viability of the proposed SRAM design. We show how RBAU helps to alleviate the limits of technology by requiring smaller SRAMs (to an even greater extent than previous systems such as CFDS), thus allowing simpler designs to accomplish the area and time restrictions.
Throughout this section, we shall assume OC768 and OC3072 links, each with 128 and 512 queues, respectively.
Design of SRAM Buffers
The main focus of this section is to establish the possible technological barriers in the near future for the RADS/ CFDS/RBAU SRAM buffer schemes. We have used CACTI 3.0 [41] to estimate the access time (in ns) and the area (in cm 2 ) of different implementations of the t-SRAM and h-SRAM buffers using a 130 nm technological process as a baseline. We have also evaluated capability limits for near-future technological processes: 65 and 45 nm. CACTI is an integrated cache access time, cycle time, area, aspect ratio, and power analytical model. The main advantage of CACTI is that, as all these models are integrated together, trade-offs between the different parameters are all based on the same assumptions and, hence, are mutually consistent.
We have assumed that the t-SRAM and h-SRAM are shared by all the queues. The design of a unified (shared) SRAM buffer is not as trivial as the design of a distributed (isolated) SRAM buffer, in which each queue has its own partition of the available memory. The second kind of SRAM buffer could be easily implemented as a set of circular queues implemented with simple direct-mapped SRAM structures. On the other hand, in the shared SRAM buffer, we need a mechanism to identify where exactly the nth element of a given queue q i is placed. Intuitively, this is similar to the design of Q linked lists in which the next cell to access by the arbiter is located at the head of the correspondent list and the next cell to store coming from the DRAM is placed at the tail of the correspondent list. Fig. 12 shows two different alternatives for implementing a unified SRAM buffer, one of them targeted at achieving a very short access time (and, hence, suited to high-performance implementations) and the other targeted at achieving the lowest impact on area and power consumption (and, hence, suited to costeffective implementations).
The global CAM design consists of a full contentaddressable memory. In such memory, packety cells are stored out of order and can be indexed using a tag. Each cell's tag identifies the queue where that cell belongs and the relative order inside the list of cells of that same queue. When the address (queue identifier and order) of the cell is set, the CAM searches across all entries for the related cell. Note that we assume that the refreshes from the DRAM are serialized along B time slots at a rate of one cell per slot. This implementation requires one CAM port (to look for a given cell entry) and one write port (to allocate new entries). Additionally, the system requires a method to perform allocation of new cells in available entries. A very simple way of handling this is to implement a free-list as a direct- mapped circular buffer that holds indexes to unused entries of the global CAM. Such a structure would require one read port and one write port, and one head and tail pointer. The functional timing behavior of the global CAM is as follows: Reading a new item implies CAMing the memory array for the specified item. This gives us the index of the entry just read so that we can write this index in the free-list as a new available entry. Writing a new item involves accessing the free-list to obtain a valid index to write. Once this index is obtained, we can do a direct-mapped write onto our CAM array.
Note that we could conceivably reduce the critical path of the access by overlapping the accesses to the CAM array and the free-list. In order to enable this, we would need a lookahead of one when obtaining a valid entry in the freelist and latch it. The drawback of this technique is that we may produce bubbles when the SRAM is full, which is actually a very unwanted artifact. Therefore, we will consider that the access to both structures needs to be serialized.
Finally, the unified linked-list proposal is a straightforward implementation of Q linked lists onto a directmapped memory structure. Each entry of that directmapped SRAM contains one cell and a pointer to the next cell (another entry of the same structure). In order to be able to identify the head and the tail of each linked list, we have another direct-mapped structure that stores the head and tail pointers for each of the queues. Of course, we also need a free-list to determine available entries in the SRAM array whenever a new item is written. The advantage of this implementation is that it does not require a complex CAM implementation. However, it requires one additional write operation to store the position of the new tail onto the pointer field of the old tail. As a result, the structure needs one read port and two write ports per structure (or having to perform three accesses with one single read/write port for a time-multiplexed configuration).
Since this design is targeted at low power and area impact, the access is performed time-multiplexed. That is, instead of having several read-write ports that can act in parallel, we have a single read/write port for each structure so that the set of accesses can be serialized. The lower the number of ports, the lower the area and power consumption of a SRAM array. The obvious downside of this alternative is that we considerably increase the overall time required to perform all the actions.
The functional timing behavior of the unified linked-list is as follows: Reading a new item involves first reading the head-tail array to determine the location of the head of a given queue. Then, we read this item from the SRAM array, which, at the same time, gives us the pointer of the new head, which is required to update the head-tail entry in the first array. At the same time, while the main SRAM array is accessed, since we know that the old head will become a new available entry, we can update the free-list (hence, this access is out of the critical path). Writing a new item involves, first, in parallel, reading from the head-tail array the current tail of the queue while reading a new available entry from the free-list. After that, two consecutive writing accesses to the main SRAM array are needed: one to write on, which was the current tail, the index of the new tail (which is given by the free-list), and one to write the actual new cells onto what has already become the new tail of the queue, i.e., the available entry obtained from the free-list.
SRAM Buffer Implications in GHDS Systems
For any GHDS configuration, we must observe two main issues. First, the cells coming from the DRAM memory system of a given queue may come out-of-order. Second, for CFDS, the SRAM must contain additional entries to be able to hold elements before they are scheduled by the MMA. The first problem can be easily overcome by implementing some basic changes to our proposed SRAM structures to allow them to insert cells from a queue out of its natural order:
. global CAM: Implementation of writing operations out-of-order is trivial in this configuration as only more bits are required in the ordinal tag field. . unified linked-list: Out-of-order writing operations are complex inside a linked-list. However, an easy solution is to implement Q Â M linked-lists instead of Q as M is the number of banks and two operations on the same bank are always performed in strict order. The selection of the subqueue can be easily performed with a round-robin mechanism per queue.
SRAM Buffer Performance

RADS SRAM Buffer Limitations
Fig . 13 shows the access time and the area of the different SRAM implementations for OC768 and OC3072 as a function of the number of slots of the lookahead. The required SRAM size depends on the size of the lookahead: It is maximum for lookahead ¼ 0 and minimum for lookahead ¼ Q ðB À 1Þ þ 1 (see [29] ). Note that, in the numbers of Fig. 13 , we account for the effect of both t-SRAM and h-SRAM (the area is the combination of both, while the time is the most restrictive one).
For the area graphs, we also show the size of the SRAM structures in MB. The OC768 system SRAM size ranges between 600 kB (for the minimum lookahead) and 128 kB (for the maximum lookahead). The OC3072 system SRAM size ranges between 10.5 MB (for the minimum lookahead) and 4.0 MB (for the maximum).
For an OC768 system, we need to access a new cell every 12.8 ns (assuming 64 bytes wide cells). We can observe from Fig. 13 that the access times of all SRAM alternatives are far below that number, even for the shortest lookahead. Therefore, as access time is not a concern, we shall focus on those implementations with a minimum area. For instance, the global CAM implementation has an overall area of more than 0.4 cm 2 for lookaheads shorter than 100-200 slots, which could represent a significant fraction of the overall transistor budget of a medium cost system. On the other hand, the time-mux unified linked list exhibits an area smaller than 0.2 cm 2 with very small lookaheads and as low as 0.1 cm 2 . For this reason, in the rest of the paper, we will target the time-mux unified linked list SRAM for all OC768 systems.
For an OC3072 system, we need to access a new cell every 3.2 ns, which is a significantly harder constraint to meet, taking into account that the SRAM buffers are now larger. Indeed, if we look at Fig. 13 , we can clearly appreciate the fact that no SRAM implementation is able to comply with the 3.2 ns target (even for the longest lookaheads). The fastest implementation is the global CAM, which has an access time slightly higher than 7 ns, and that for extremely long lookahead delays. Furthermore, if we look at the area results for the different alternatives, we can observe that all the configurations exhibit areas larger than 2 cm 2 , which could be another limiting factor of design even for the budget of high-end systems. As the access time is clearly the constraint for OC3072, for the rest of the paper we will use the global CAM implementation (the fastest) for evaluation purposes.
RBAU Performance Improvements
Fig. 14 allows us to demonstrate the performance benefits of using RBAU instead of RADS and CFDS. It shows the area (of both h-SRAM and t-SRAM ) and most restricting access time for OC768 and OC3072 for the maximum lookahead according to the data granularity. Again, the number of queues Q is 128 for the OC768 system and 512 for the OC3072 system. We assume the number of banks M to be 256.
We should remember that RBAU should always leverage smaller areas and shorter latencies than CFDS for a given configuration since RBAU does not need the extra SRAM entries needed by CFDS, used to accommodate the bounded (yet, still significative) level or reordering of memory accesses. As shown previously, this improvement comes at the cost of a very small percentage of packet losses (when CFDS was targeted at granting zero packet losses).
There are two main conclusions that can be inferred from the results in Fig. 14 . First, one can see the evident advantages of either CFDS and RBAU relative to RADS. For an OC768 (in which area is the main factor of merit), a RBAU system with b ¼ 2 achieves an area 2.3X smaller than RADS and 10 percent smaller than CFDS. For b ¼ 1, the differences are even greater (4.1X compared to RADS, 1.4X compared to CFDS). Note that it is not always possible to keep decreasing the main memory data granularity as we may be ultimately limited by the main bus cycle time. However, for OC768, a DRAM cycle time of 12.8 ns would be perfectly affordable.
However, the especially significant gains are for the OC3072 system. An RBAU system with b ¼ 8 is compliant with the requirements of buffering packets at 160 Gbps as the access time is lower than 3.2 ns. For instance, the CFDS system requires a lower granularity (b ¼ 4) to match the same requirements, putting more pressure on the DRAM memory system performance (as it requires a shorter main bus cycle time). Moreover, this is accomplished with an affordable area (less than 0.75 cm 2 overall and as low as 0.37 cm 2 if we keep decreasing the granularity to the minimum value). This contrasts strongly with its RADS counterpart, which is not able to access data in 7 ns (hence, not being compliant for an OC3072 design), even with a delay of more than 50 s and the nonirrelevant area of 2 cm 2 (almost the size of the Pentium IV or a 90 nm version of IBM's cell chip, which has an die size of 214 mm 2 ). The second important conclusion is that there is an optimal value of b for a given CFDS implementation. The reason for this is the trade-off between the SRAM size required to tolerate the unpredictability of arrivals from the arbiter-which is proportional to b-and the SRAM size required to absorb the level of reordering of the accesses from the DRAM-which is proportional to 1=b-(see [17] ).
RBAU does not suffer from this shortcoming as it presents a trade-off between packet loss percentage and reordering size. At the cost of a very small probability of packet losses, we can still keep reducing the data granularity to match the timing and bandwidth specifications.
Scalability for Future Implementations
A very interesting experiment is to determine the scalability of our proposed system for a near future scenario. There are two parameters that determine whether a given GHDS implementation is compliant with the timing targets of a packet buffer. The first one is the characteristics of the link rate (the packed bandwidth), while the other is the technological process used to implement the SRAM buffers.
In order to extrapolate the future scalability, we measured the maximum number of queues tolerated by GHDS, varying three different parameters: the granularity b, the link rate, and the technological process. For the link rates, we selected, in addition to OC768 and OC3072, a near future value of OC12288 (640 Gbps). For the technological processes, in addition to 130 nm, we selected the close to current state-of-the-art 60 nm process (to be featured next year) and the following one (45 nm). Note that we did not consider 90 nm, which is the current state-of-the-art process, as we are interested in future projections. Fig. 15 shows the maximum number of queues that the different SRAM buffer approaches can afford, taking into account the access time constraints (12.8 ns of maximum delay for OC768, 3.2 ns for OC3072, and 0.8 ns for OC12288). The graphs also show the effect of using different technological processes (130, 60, and 45 nm). It is expected that the higher the link rate, the lower the number of queues. At the same time, the more advanced the technological process, the higher the number of queues. On the other hand, for CFDS, we should expect a trade-off in the granularity b as too low a granularity may produce too high a level of disorder between transactions that override the granularity reduction benefits. Such a trade-off does not exist for the more general GHDS since we are no longer looking for a zero packet loss rate. However, we should take into account that some values of b cannot be feasible for certain link rates as the limiting factor ends up being the DRAM main memory system cycle time. For instance, assuming a DDR3 memory system at 1 GHz, we would not be able to implement an OC12288 system with b ¼ 1 as the required cycle time would be 0.8 ns.
As shown in the figure, RBAU (compared to RADS) allows around four more queues for OC768, around 16 times more queues for OC3072, and around 50 times more queues for OC12288. The differences remain quite constant for the different technology processes. Additionally, RBAU, compared to CFDS, allows around two times more queues for OC768, around three times more queues for OC3072 and, finally, around six times more queues for OC12288 (with a realistic granularity of b ¼ 2).
Another very interesting consequence that we may observe from the graphics is that, if we have a target of around 2,000 queues, 130 nm is good enough for OC768, while 65 nm is good enough for OC3072. On the other hand, we may observe that there is a dramatic change in behavior for OC12288. We may observe that, even for the best technological process (45 nm), the number of maximal queues we can handle is low (less than 32 for CFDS and less than 200 for RBAU and a reasonable granularity). This last observation poses the interesting challenge of designing a new system that is able to handle hundreds of queues for OC12288 as it is clear that RBAU benefits reach their limits due to the advance of link rate technology and the CMOS scalability problems. Ultimately, it becomes evident that the performance of highperformance routers may become limited by the packet buffer performance in a span of five years.
RELATED WORK
Virtual Output Queuing was proposed for first time in [43] (with the name of "dynamically-allocated multi-queue buffers"). The amount of buffering and the line rates considered in this seminal paper were far lower than those required for our target application: high-speed backbone routers. For OC192 (10 Gbps) line rates, a time slot is lower than the random access time of DRAM. Nikologiannis and Katevnis [37] propose a design using DRAM only for a VOQ buffer architecture working at this line rate. The proposed design uses out-of-order memory access in order to reduce the number of bank conflicts, although it does not guarantee zero miss loses. Hasan et al. [22] propose techniques that exploit row locality whenever possible in order to enhance the average-case DRAM bandwidth. However, this scheme may have a significant miss probability for special traffic patterns.
For faster line rates, a hybrid SRAM-DRAM implementation of a VOQ buffer using ECQF for the h-MMA is discussed in [29] . This is the starting point of our work.
There are many proposals that exploit the bank organization of DRAM memory [40] , [46] , [27] . This is especially true in the vector processor domain. The novelty of our proposal resides in the application of this technique to the context of fast packet buffering. For our RBAU system, we assumed a theoretically ideal random hashing mechanism for distributing cells across memory banks. In a practical implementation, we could use many of the proposed hashing mechanisms found in the literature.
CONCLUSIONS
In this paper, we have proposed a general design for hybrid SRAM/DRAM packet buffers. We have shown that two previously proposed hybrid SRAM/DRAM packet buffer designs ( [29] and [17] ) can be seen as particular cases of our general scheme.
Based on this general scheme, we have proposed a Random BAU (RBAU) Scheme that randomly chooses a DRAM memory bank for every transfer between SRAM and DRAM. The numerical results show that this scheme would require an SRAM size almost an order of magnitude lower than the scheme given in [29] , without the memory fragmentation problem of the scheme proposed in [17] .
Although the RBAU Scheme proposed in this paper does not have a zero miss probability, the results support the conclusion that the randomization process among memory banks allows an extremely low miss probability to be guaranteed for any traffic pattern. We think that our design may be useful for building very large and fast future packet switches.
We are planning to use Cacti 4.0 (since a near release has been recently announced) to add even better tech process projections and include also power consumption numbers.
