I. INTRODUCTION
High-performance routers need buffers to store data during congestion. These buffers can either be shared by multiple line cards (e.g., in a shared memory router) or belong to a single line card (e.g., in a distributed memory router). These buffers are typically required to store a large amount of data at high rates, which translates to using fast, high density memories.
Though there can be multiple criteria for deciding the amount of storage required, a rule of thumb indicates that, in order for TCP to work well, the buffer should be able to store an amount of (per-port) data equal to the product of the line rate and the average round-trip-time (RTTI [lS] . Though this rule of thumb has been challenged recently [21, it is still widely used,
In addition, the arrivingldeparting packets may cause memory accesses in an arbitrary and unpredictable (i.e., random) order, thus requiring random access guarantees (in packetds) in addition to raw bandwidth guarantees (in bitds).
To get a feel for the size and speed requirements, consider a packet buffer on a 4OGb/s (OC76Sc) linecard of a distributed memory router, Assuming an average TCP RTT of 0.25s 131, this buffer needs to store lOGb of data. In addition, assuming a constant stream of 40-byte packets, which corresponds to minimum size IP packets containing TCP ACKs, the packet buffer must read and write a packet every 8ns. This translates to one memory operation every 4ns, or a random access speed of 250Mpacketsls
Note that a packet buffer supporting sixteen 2.5Gbls ports in a shared memory router would have the same size and speed requirements. We now look at two papular memory devices -SRAM and DRAM -to see if they satisfy the above requirements.
Static random access memories (SRAMs) are relatively fast but small, and power-hungry. At the time of writing, state-of-theart commercial SRAM [15] holds 32h4bits, has a random access time of 4ns. and consumes 1.6W. This means a 40Gbls linecard would require over 300 SRAM devices and consume approximately 500W. So, even though SRAMs meet the speed requirement, implementing a packet buffer using only SRAM memories would be impractical in terms of area as well as power. The rest of the paper is organized as follows. Section I1 describes the packet buffer architecture used to provide the statistical guarantees. Section I11 provides analytically derived statistical guarantees. Section IV discusses implementation details and provides a design example in the context of the 40Gb/s linecard introduced earlier in this section. Section V concludes the paper.
INTERLEAVED MEMORY ARCHITECTURE

A. Basic Idea
Memory interleaving has been traditionally used in computer systems to increase the performance of memory and disk-array sub-systems [8] [13]. The idea is simple (Figure 1) For sake of simplicity we assume that data is accessed in fixed-size cells. This cell can be a single byte or more. depending on the implementation. In case of variable size packets, the packet can be split into cells and written into multiple DRAMS.
We then define R as the rate at which these cells need to be accessed from the packet buffer. 
As an example, consider our 40Gb/s linecard which requires a random access every 4ns. With 40ns random access time DRAMs. b turns out to be 10 (= 40/4 ). Now, if we require a speedupof1.1,k wouldhavetobe 11(= 1 . 1~1 0 ) .
E. SRAM FiFOs
If the packet buffer maintained just one flow, the operation would be simple: the arriving cells could be written immediately into the DRAM memories in a round-robin manner, without the need for any intermediate storage. The cells corresponding to the incoming requests would be read from the DRAM memories in the same manner. This scheme works since there is a cell written to (and read from) each memory every k 2 b cell-times, and no DRAM memory is ever over-
subscribed,
Things get more complicated when there are multiple (say Q ) flows in the system. Let us assume for now that the memory management algorithm is writing to DRAM memories in a round-robin manner on a per-flow basis. It could happen that two consecutive arriving cells (at rate R 1 corresponding to two different flows get mapped to the same DRAM memory. Since a DRAM memory can accept only one of them per b celltimes, the other one has to be queued in a write FIFO ( Figure   1 ). Multiple consecutive cells could be mapped to the same write FIFO, with most of them waiting for the DRAM memory.
Given the unpredictable nature of cell arrivals, and the fact that most realistic packet buffers deal with multiple flows, it would be impossible for any memory management algorithm to avoid short-term overloads on the DRAM memories. Onchip SRAM FFOs are needed to buffer up these short-term overloads.
In our discussion above, we focussed on the write FJFOs. It is interesting to note that the write and read FIFOs are similar except that write FIFOs store actual cells whereas the read FIFOs store requests for cells. We expect their behavior to be identical otherwise. 
C. Memory Manugement
A DRAM memory interacts only with its own write and read RFOs, independendy of the other DRAM memories. As cells arrive to the packet buffer they are written by the memory management algorithm (MMA) to the tail of a write FIFO, where they wait to be written to the corresponding DRAM memory. Similarly, as requests m i v e to the packet buffer, they are written to the tail of a read FWO, where they wait for the corresponding cell to be read from the appropriate DRAM. The MMA comes into the picture only while writing a cell into the DRAM memories. For every flow, the order in which the DRAM memories are accessed is completely determined during the write operations. Thus, the per-flow requests must be queued into the read FPOs in the same order the corresponding cells were written into the write FIFOs.
D, Problem Statement
This scheme is not without its limitations. If the SRAM containing the FIFOs is not big enough, then cells may get dropped during overload situations. In addition, due to FIFO queueing, cells may experience variable latency in uaversing the system.
In Section I11 we provide a memory management algorithm that ensures that the drop probability is minimized given fixed size SRAM. Using minimal assumptions on arrivals rand departures), we also provide statistical guarantees on drop probabilities and maximum latency. We show that reasonable performance guarantees, i.e., low drop probabilities and low maximum latency, can be provided using small values of speed-up.
PROVIDING STATISTICAL GUARANTEES
In this section, we focus on the analysis of the SRAM containing the write FIFOs. As mentioned earlier, the behavior of the read FWOs is identical to the write FIFOs, and so is the analysis.
A. Preliminaries
We assume that k = b . This ensures that the SRAM buffer is rate stable since the cumulative service rate (= 1 ) is greater than or equal to the maximum incoming rate. We then analyze for the effects of speedup by looking at incoming rates that are less than unity. For simplicity, we assume time to be a continuous variablea similar analysis could be carried out in discrete time domain as well. We denote by A ( t ) the cumulative number of cells arriving at the SRAM in [0, I ] . A(t) is assumed to be the sum of Q stationary and ergodic arrival processes A'(t) corresponding to Q flows, where i E 11, Ql . A'(r) are assumed to have rates hi such that the sum of rates is less than 1 (to ensure rate stability).
We also assume Ai(t) to be independent of each other. We believe the independence assumption to be a reasonable one 
B. Memory Munagenrent Algorithm
The memory management algorithm (MMA) assigns every incoming cell to one of the b write FIFOs uniformly and at random (u,a,r.), We refer to this MMA as the Write At Random
Since each incoming cell is assigned to the write FIFOs u.a.r., the same holds for cells belonging to a flow. The pattern of writes, i.e., the order in which the DRAM memories are accessed, uniquely determines the pattern of reads on a perflow basis. Thus, on the read side, the per-flow requests would also be disuibuted u.a.r. across the DRAM memories. n e u.a.r. assignment means that each A ' ( t ) can be written down as the following (2)
MMA (WAR-MMA).
Thus, we can envision FIFO j as shown in Figure 3 . It has Q sources A"'(l) , with rate h , / b , multiplexing into it, and is work conserving with rate l/b , with total load h . Rate stability is preserved per FIFO, i.e. 
C. Drop ProbabiEiries
Since the event that the SRAM overflows is equivalent to any of the FlFOs overflowing, the overflow probability in the statically partitioned SRAM of size S can be given by To find the drop probability from a finite SRAM of size S, we start by assuming that the SRAM is infinite, i.e.. that each of the statically partitioned FIFOs are infinite. We then obtain the steady-state probability of the occupancy of an infinite FWO exceeding S / b (Le., P(L' > S / b ) ) as a surrogate for the overflow probability in a finite FIFO of size S / b . This s u ogate, aIso referred to as the buffer exceedence probability, suffices since it is generally an upper bound to the actual drop probability, i.e., Here the left hand term corresponds to the drop probability from a finite FIFO of size S / b , whereas the right hand term corresponds to the steady-state buffer exceedence probability
Finally, since Lj(r) are identically distributed (Lemma 11, we can re-write Equation (7) as (8) Equation (8) indicates that it suffices to find the steady-state distribution of buffer exceedence probability for any FIFO.
D. Bufer Exceedence Probabiliry occupancy can be given by
We start by noting that using Lindley's recursion, the FIFO
This indicates that, given the steady-state distribution of the arrival processes A ' ( t ) , it is theoretically possible to derive the steady-state distribution of the FIFO occupancy. However, for general arrival patterns it is often not easy to find a solution. The good news is that for a broad range of arrival traffic patterns A'(t) , characterized by the following assumptions, it is possible to do s0.l
We assume that each ,4'(f) is a simple point process satisfying the following properties [41. First, the expected value of A'(t) in any interval [0, t ] sources increases, the steady-state buffer exceedence probability approaches the corresponding probability assuming Poisson sources, which is known explicitly through the analysis of the resulting M/D/1 system 1121.
We then look at the case where the A'(!) are independent but not necessarily identically distributed. Using a large deviation argument [17] , we can show that the buffer exceedence probability in this case is upper bounded by the corresponding probability for the i.i.d. case. Thus, in order to get the drop probability from an SRAM, it suffices to analyze for the i.i.d. case
Iv. IMPLEMENTATION CONSIDERATIONS
WAR-MMA is pretty simple to implement since it only requires randomly spreading the incoming traffic. This requires no state to be kept. In addition, statically allocated SRAM (using circular buffers) and dynamically allocated DRAM (using linked lists) are common place. Thus, the interleaved memory architecture is fairly easy to implement. We believe this to be its biggest strength.
We now provide a realistic design example. In Table 1 we list values of FIFO sizes (i.e., x ) needed to guarantee a steady- This example vdidates our claim that a small S R A M (-40K bytes) can support extremely low drop probabilities using a smalI amount of speedup. v. CONCLUSIONS Packet switches, regardless of their architecture, require packet buffers. The general architecture presented and analyzed here can be used to build high bandwidth packet buffers. The scheme uses a number of DRAMS in parallel, all controlled independently in an intcrleaved manner, as well as SRAM FIFOs to absorb short-term overloads on the DRAM memories.
In this paper, we establish exact bounds relating the SRAM size to the drop probabihty. In particular, we show that reasonable performance guarantees, i.e.. low drop probability, can be provided using small SRAMs as well as a small amount of The costs of the technique are: (1) a (presumably on-chip) SRAM cache that grows in size linearly with line rate, and (2) A memory management algorithm that must be implemented in hardware.
While there are systems for which this technique is inapplisped-up.
*. In practice. we start observing the convergence indicated in Theorem 1 when the numbzr of flow exceeds 100.
0-7803-8924-7/05/$20.00 (~) 2 0 0 5 IEEE. 
