Routers must do a best matching pre x lookup for every packet solutions for Gigabit speeds are well known. As Internet link speeds higher, we seek a scalable solution whose speed scales with memory speeds while allowing large prex databases. In this paper we s h o w that providing such a solution requires careful attention to memory allocation and pipelining. This is because fast lookups require on-chip or o -chip SRAM which i s l i m i t e d b y either expense or manufacturing process. We s h o w that doing so while providing guarantees on the number of pre xes supported requires new algorithms and the breaking down of traditional abstraction boundaries between hardware and software. We i n troduce new problem-speci c memory allocators that have p r o vable memory utilization guarantees that can reach 100% this is contrast to all standard allocators that can only guarantee 20% utilization when the requests can come in the range 1 : : : 32]. An optimal version of our algorithm requires a new (but feasible) SRAM memory design that allows shifted access in addition to normal word access. Our techniques generalize to other IP lookup schemes and to other state lookups besides pre x lookup.
INTRODUCTION
Internet usage has been expanding both because of a growing number of users and an increasing demand for bandwidth intensive data. To k eep up with increased tra c, the speed of links in the Internet core has been increased to at least 622 Mbps, and vendors are working to build faster routers that can handle Gigabit and now e v en Terabit links. Thus there is a major need for high performance routers at speeds that keeps increasing. and possibly OC-768 (40 Gigabit) links will soon reach the marketplace, and router vendors (e.g., Cisco, Juniper, Extreme Networks) are already considering routers that target these higher speed links.
A router that forwards a message has two major tasks: rst, looking up the message's destination address and second, internally transferring the message to one of many possible output links. The second task is well understood with most vendors using fast busses or crossbar switches, and some exciting new switching technologies MIM97, TCFF97]. In the last three years, several new solutions have appeared to the address lookup problem as well for Gigabit speeds DBCP97, SV98, WVTP97]. However, our paper will argue that there are some new problems in the areas of memory allocation and pipelining that need to be solved when scaling such lookup schemes to OC-192 and higher speeds.
The design of such a high-end router will typically have the goal of forwarding minimum sized TCP packets (say 4 0 bytes, fty percent o f I n ternet packets are this size TMW97]) at 10 Gbps rates or higher. Given 32 nsec to forward a minimum size packet and the fact that ordinary memory (called DRAM) takes 60-100 nsec nsec to do a single READ, a common approach is to design a custom chip to do IP lookups at each router port. 1 The pre x database can be stored in external SRAM. External SRAM is similar to CPU cache memory with 10-20 nsec access times. However, speed, pin count limitations, and the need for a smaller number of chips tend to favor the use of on-chip memory for pre x storage. Given that even aggressive manufacturing processes today can guarantee only 16 Mbits of on-chip memory on a custom chip, it appears necessary to use a compressed IP lookup data structure such as a Lulea trie DBCP97] to store pre xes.
While IP lookup schemes with large update times are feasible, they require the use of two copies of the database: one for the copy being updated and the second for the copy used to forward packets. Since this cuts the utilization of the limited on-chip SRAM by half, it is preferable to have a s c heme that allows incremental updates that only requires a small constant increase in memory to handle updates. Incremental updates may also be useful to quickly handle multicast routes, and the most extreme instabilities caused by b a c kbone routing protocols LMJ97]. Using say a compressed trie, incremental update can result in adding or deleting a pre x from a trie node, which in turn requires deallocating memory for the existing compressed trie node and allocating memory for another trie node of a slightly di erent s i z e . If the trie nodes can be any size from (say ) 1 t o 3 2 w ords, such a s c heme requires a memory allocator capable of allocating and deallocating chunks of memory in the range 1 : : : 32]. 1 We will argue that Content Addressable Memories (CAMs) are not adequate for backbone routers later.
Unfortunately standard memory allocators 2 do not guarantee good worst case utilization. It is possible for allocates and deallocates to break up memory into a patchwork of holes and allocated blocks. Speci cally, the blocks are minimum sized, and the holes are just small enough not to be usable for a maximum size request. Thus only 1 32 of chip memory can be guaranteed to be used.
If one ignores the allocator it is easy to show that 16 Mbit of on-chip memory can be used to support (say) 250,000 prexes even in the worst-case. If one takes the allocator into account, the chip can only advertise 7500 pre xes. But Content Addressable Memory (CAM) vendors today are advertising worst-case numbers of 8000-128,000 pre xes (though some of the high end CAMs have not yet appeared in the market) with 15 nsec search times and single cycle updates. CAMs may still be too slow for the fastest backbone links, may h a ve too small a worst-case number of pre xes for a backbone router, and cannot be used to integrate all of IP forwarding on a single chip. But it is important f o r a c hip solution that competes with CAM solutions to have a large worst-case pre x database size it can advertise.
Besides the poor worst-case performance of standard allocators, there is a standard result Rob71] which states that no allocator (that does not compact memory) can have a utilization ratio better than 1 log 2 W (i.e., 20 percent i f W = 32), where W is the largest possible allocation request. Since this is still unacceptable, in this paper we consider allocators that do compaction. Compaction refers to the moving of allocated blocks to increase the size of holes.
Compaction has two immediate problems. First, moving a piece of memory M requires correcting all pointers that point t o M . Second, the existing literature on compaction is in the context of garbage collection (e.g., Wil92, Bak78]), and tends to use global compactors that scan through all of memory in one compaction cycle. A plausible solution comes from real-time garbage collectors Bak78] that interleave compaction with computation. But such collectors often break up memory into two \semi-spaces", and compact one space while allocating from the other. The use of semispaces forces derivative s c hemes to utilizations of 50 percent or less. 3 Thus one of the questions we ask in this paper is whether there exists a local compactor that compacts only a small amount of memory around the region a ected by a n update?
The answer to this question is yes. Our paper introduces two s u c h local compaction schemes for the rst time (to the best of our knowledge) in the literature. Our algorithms do 2kW units of compaction work in return for a utilization ratio of k k+1 , w h e r e k is a tunable parameter. By choosing say k = 9, a lookup chip needs to do only 18W compaction work to obtain 90% utilization. This in turn would allow the chip vendor to guarant e e a w orst-case of 250 000 pre xes for the lookup chip. Since an update must write and read W units in the worst case, this additional compaction work per update is a reasonable cost.
2 WJNB95] contains an excellent s u r v ey of 30 years of research on memory allocators 3 Semi-space and generational garbage collector schemes are discussed in more detail in the Related Work section.
Together with the use of pipelining, our IP lookup scheme scales with memory speeds, allowing the possibility of IP lookup schemes even for fast links without the use of CAM technology. CAMs have historically been slower and less dense than SRAM, and this trend appears likely to continue. Besides IP lookups, we also argue that our approach is bene cial for other state lookup tasks in networking such as bridge lookups, accounting lookups, ow lookups, and lter lookups. To test our ideas we implemented them in the context of a compressed trie IP lookup algorithm (a scheme very similar to one described by P er99] and di erent from the Lulea DBCP97] scheme) and report on our results for pre x utilization using real BGP traces that allow us to compare our problem-speci c allocators with the best known general purpose allocators. We note that we can solve the rst compaction problem of adjusting a ected pointers only in the context of the tree-like data structures used in many n e t working lookup tasks.
Unfortunately, w e s h o w that the standard pattern of access to memory will force any s c heme (that reads complete trie nodes) to have a utilization of at most 50% or to take t wice the lookup time. This is because if a trie node straddles two memory words, the node lookup will require two memory accesses. By looking \under the covers" of an SRAM memory design, we s h o w that a limited number of shifts can easily be designed (in a custom memory design for a custom chip) with a very small percentage increase in the number of column multiplexors. We show that adding two shifts with an increased memory width can yield 100% memory utilization. We a l s o brie y comment on the interaction of pipelining with memory allocation.
The rest of this paper is organized as follows. In Section 2 we describe some background on IP Lookups, a model of a l o o k u p c hip and a sample IP algorithm. In Section 3 we describe previous work in IP lookups and memory allocation. In Section 4 we describe some common infrastructure. In Section 5 we describe our simplest memory allocation schemes based on Frame Compaction (LFC). In Section 6 we describe our second (and preferred) allocation scheme based on Segment Hole Compaction (SHC). We describe experiments to compare our IP lookup-speci c allocation schemes to a benchmark best-t allocator in Section 7. We return to IP lookups in Section 8, where we describe the performance of a lookup chip using our allocators, describe the need for a new form of shifted memory, and outline some of the problems with pipelining. We conclude in Section 9.
IP LOOKUPS
To motivate our discussion of issues that arise with doing IP lookups we describe key requirements and constraints (Section 2.1), describe a model of a lookup chip (Section 2.2), and then describe a sample IP lookup scheme (Section 2.3).
Requirements and Constraints
There are three three key requirements for lookup schemes, size, speed and dynamism.
(Size) Internet backbone routers have large databases (e.g., around 50,000 pre xes Mer] today and increasing rapidly). After incorporating multicast addresses, multiple hops, host routes and growth, it is not unreasonable for a backbone router J] to aim to support 150,000 to 250,000 pre xes.
Speed: Vendors are designing products for OC-12 rates (2.5 Gbps) and are looking ahead to OC-768 rates (close to 40 Gbps). Studies TMW97] show that 50 percent of backbone tra c consists of 40 byte TCP acknowledgements. At O C -192 rates, a 40 byte packet must be processed in 32 nsec to provide wire-speed forwarding. While memory is considered \cheap", this applies only to o -chip DRAM (dynamic RAM) memory and its variants that have large densities but access times of around 60-100 nsec. While a 4 memory access lookup can be pipelined across 4 RAMBUS IBM97] banks to provide a lookup every 60 nsec, external DRAM technologies cannot provide lookup times under 60 nsec for a single lookup.
To handle 30 nsec lookups, implementors must use either o -chip SRAM (Static RAM, 10-20 nsec), on-chip SRAM (1-5 nsec), or on-chip DRAM (10 nsec). O -chip SRAM requires extra address and data pins especially for a pipelined lookup if the lookup is pipelined using M stages, the lookup chip will require M sets of address pins (18-32 bits each) and M sets of data pins (typically greater than 32 bits each) this quickly becomes infeasible for large M . External SRAM is also around 10 times the price of external DRAM thus reducing the amount o f o -chip SRAM can result in overall cost savings. Embedded DRAM may not be fast enough for the highest speed applications. On the other hand, vendors like T exas Instruments and IBM o er 16 Mbits of on-chip SRAM today using half a (fairly large) die on a custom chip. ASICs o er even less on-chip SRAM. Thus we believe highspeed memory, whether on or o -chip, is worth managing e ciently. For our design center in the rest of the paper, we will use on-chip SRAM of 16 Mbits though our techniques apply to any situation where memory must be used e ciently.
Dynamism: IP pre xes are updated by the Border Gateway Protocol(BGP) RL95], While update times can be two orders of magnitude slower than lookup times, there is still evidence that update times should be reasonably fast. For example, LMJ97] reports unstable BGP implementations that require updates in the order of milliseconds. Multicast protocols like DVMRP and PIM may also add multicast routes at high rates. Given the use of damping timers to ameliorate BGP instabilities, a more compelling reason to have fast incremental updates is to better utilize high speed memory a lookup scheme that requires the lookup structure to be completely rebuilt on an update will typically require two copies of the database. One copy i s u s e d b y Search and one is used by U p d a t e thus the memory utilization drops to no more than 50%.
While this paper concentrates on IP lookups we note that there are many other forms of lookup that have similar problems and for which the techniques in this paper could be useful. These include exact matching (e.g., for bridges, ARP caches, ow ID lookups) and packet classi cation using lters. All these applications can potentially have large, dynamically changing databases that need to be looked up at high speeds.
Lookup Chip Model
Based on the arguments above, Figure 1 describes a model of a lookup chip that does search and update. The chip has a Search and an Update process, both of which access a common SRAM memory that is either on or o -chip (or both). The Update Process allows incremental updates and (potentially) does a memory allocation/deallocation and a small amount of local compaction for every update. The actual updates can be done either completely on chip or partially in software (in which case the Update Process on chip is simpler or non-existent.) We assume that each access to SRAM is fairly wide, say 1000 bits, as is feasible today using a wide bus. We also assume that the search and update logic can process a large numberofbits(say 500) in parallel in one memory cycle time. We assume that Search and Update share access to the common SRAM (that stores the lookup database) using time multiplexing. Thus the Search process is allowed S consecutive accesses to memory and then the Update Process is allowed K accesses to memory. I f S is say 20 and K is say 1 , this allows Update to periodically steal a cycle from Search while slowing down Search throughput by only a small fraction, and yet allow atomic updates.
The Chip has pins to receive inputs for Search (e.g., keys) and Update (e.g., update type, key, result), and can return search outputs (e.g., result). T h e m o d e l c a n b e i n s t a n tiated for various types of lookups including IP lookups (e.g,, 32 bit IP addresses as keys and Next Hops as results), bridge lookups (48 bit MAC addresses as keys and output ports as results), or lter matching LS98] (e.g., packet headers as keys and matching lters as results).
We assume that each addition or deletion of a key can result in a call to deallocate a block, and to allocate a di erent size block. A memory allocator is a program that manages a xed area of physical memory 4 , and handle a stream of allocates and deallocates. We assume that each allocate request can be in any range from 1 to W memory words, that there is a total of M words that can be allocated. The goal of an allocator is to satisfy as many allocate requests as possible. We will evaluate allocators f r o m a w orst-case and average-case viewpoint. Let L denote the sum of the 4 In software systems, the allocator often manages virtual memory adding a level of indirection through a page table is too expensive for our purposes as it can slow d o wn search by a factor of two. Thus our allocators manage physical memory within the SRAM. allocated blocks in memory. We wish to know h o w h i g h L can be with respect to M (ideally they should be the same) in the worst case, and in typical cases.
As a design center, we will assume that M is 16 Mbits. We will also assume that W is quite small (no more than 512 but more typically around 32 words) in order to allow the hardware to retrieve a complete block in one memory access.
Sample IP Lookup Algorithm
Backbone routers use something akin to telephone area codes to reduce forwarding table size these \area codes" are known as pre xes. Pre xes are sequences of up to 32 bits. As an example, consider the IP lookup database consisting of the following 9 pre xes: P1 = 101*, P2 = 111*, P3 = 11001*, P5 = 0*, P6 = 1000*, P7 = 100000*, P8 = 100*, and P9 = 110*. A pre x like 100* matches all IP addresses that start with 100. An address that starts with 100000 matches P7, P6, P8, and P0 but P7 is the longest match. The longest matching pre x problem is to return the longest matching pre x of a 32 bit IP address.
Any IP lookup scheme that allows incremental updates and h a s g o o d w orst case guarantees on memory su ces for our purpose. Thus our ideas can be instantiated using Lulea compressed tries DBCP97] or variable stride tries SV98]. However, the Lulea scheme does not have a fast incremental update scheme and variable stride tries do not have d e t e rministic bounds on memory utilization. While it is possible to modify these existing schemes to make them usable for our purposes, we prefer to illustrate our ideas using a scheme closely related to a scheme due to Perlman and described in her book Per99]. Perlman's scheme does have w orst case memory guarantees and fast incremental updates.
To understand our version of Perlman's scheme, we start with an expanded multibit trie Per99, NK98, SV98]. Figure 2 shows a multibit trie for our example database using a trie that examines an address 3 bits at a time. Each a r r a y has 8 locations, each of which can cont a i n a p o i n ter to another trie node and a stored pre x. Thus the 100 entry in the root node points to all pre xes that start with 100, and also stores the pre x P8 = 100*. All pre xes are expanded SV98] to lengths that are multiples of 3. Thus P5 = 0* expands to four 3 bit pre xes 000*, 001*, 010*, and 011*, and is stored in the rst four locations of the root array.
We use a compressed trie data structure that is similar to Perlman's Per99] but which i s v ery di erent from the bitmap compressed structure used in the Lulea scheme DBCP97].
The compressed trie version of the same database ( Figure 3 ) replaces any expanded pre xes by a single pre x, and separates out pointer entries in the bottom portion of the node. The top portion of each node stores pre xes (e.g., P5 in the root) together with the bits (e.g., 0*) that extend the bit path to this node to form the pre x. Pre xes like P8 = 100* that also have an associated pointer are not stored in the root node, but are pushed down to the child node. Thus P8 is stored in the rightmost child node with only the bits *. There is no need for further bits to identify P8 = 100* because the path to the rightmost node uses the bits 100. The bottom portion of each node (e.g., the root has two pointers) contains pointers to other trie nodes, together Figure 2 by replacing all expanded pre xes within a node with a single pre x entry, and treating pointers separately.
with the bits that would have identi ed the location of the pointer in the original expanded node.
Figure 3 has less memory (10 words) than Figure 2 (27 words), but appears hard to search. When indexing into a compressed trie node N with a chunk C of bits, we rst nd the best matching pre x of C in the set of stored pre xes within N . We also look for any pointer associated with the bits of C in the bottom portion of the compressed trie node. These correspond to accessing a stored pre x and following a pointer in the corresponding expanded node.
However, these extra operations are not a problem in a hardware implementation that can read the entire compressed node within a single (wide) memory access. The needed operations can be performed using simple combinational logic. This clearly limits the size of each trie node access to be less than 8 bits. Doing trie search 4 bits at a time appears to be slower than the Lulea DBCP97] scheme that searches in strides of 16, 8 and 8. However, our scheme requires only one memory access per node as opposed to two or three per node in Lulea. Thus the speeds are competitive. Both schemes can be pipelined for more speed.
More importantly, unlike the Lulea scheme it is easy to do incremental updates on compressed tries. For example, if we add the pre x P0 = 10011* to the example database of Figure 2 , this will require modifying the last two locations of the rightmost trie node. Correspondingly, in Figure 3 , we only add a single entry (11*, P0) to the rightmost compressed trie node. This requires incremental allocation to deallocate the old trie node (of size 3) and allocate a n e w node (of size 4).
Equally importantly, unlike the Lulea scheme, the worst case storage for N pre xes in a compressed trie (after one-way branches have b e e n r e p l a c e d b y text strings) can be shown to be: 2N pointers of say 2 0 b i t s e a c h (16 bit pointer plus 4 bits to identify which pointer), 2N text strings of 32 bits each, and N next hop pointers (20 bits each). This works out to a worst-case total of 124 bits per pre x. A careful look at the standard IP databases Mer] however shows that 90% of the text strings are 4 bits or less any (rare) text strings longer than this can be handled using extra nodes. Similarly the next hop pointer can often be relegated 5 to a single o -chip SRAM access whose address can be computed from the trie node in which the search terminates. This reduces the total to around 50 bits per pre x in on-chip SRAM.
Thus, with perfect memory allocation, a compressed trie can store 250,000 pre xes using 250000 50 which is 12.5 Mbits. Thus storing 12.5 Mbits at 85% storage e ciency requires around 16 Mbits of on-chip memory which is just feasible today. However if storage was much more fragmented (say 20%), then the chip would only be able to handle only 25,000 pre xes. The bottom line is that since on-chip/ochip SRAM is limited, it is crucial to use a fast and e cient allocator to minimize total worst-case storage.
PREVIOUS WORK
IP Lookups: Existing fast IP lookup schemes are either based on multibit tries DBCP97, NK98, GKM98, SV98] or on binary search of hash tables WVTP97]. Hashing schemes are used by some vendors at Gigabit rates but provide non-deterministic search times and require larger storage WVTP97] than multibit trie solutions. This makes multibit trie solutions more attractive for schemes that require limited SRAM. CAM solutions are ourishing, especially for edge routers, with some vendors even announcing CAMs with 128,000 pre xes. However, most major backbone vendors (e.g., Cisco, Bay, Ascend, Juniper) we k n o w of use special purpose hardware that implements some algorithmic solution (mostly based on tries) perhaps because they plan for even larger databases and because they prefer to integrate the entire forwarding algorithm into the chip that does lookups.
Caching 32 bit IP addresses has been traditionally considered to have poor hit rates Par96] but recent interesting work CP99] has shown the possibility of better hit rates. Despite this, the lack of clear tests on a large number of backbone tra c traces, the lack of determinism, vendor and ISP perception, and the need to receive and process packets at wire speeds while guaranteeing QoS LS98], have made caching solutions unpopular so far in the backbone. We now discuss previous work in memory allocation.
Schemes that do not Use Compaction: The simplest allocator SG97] maintains a linked list of allocated blocks and holes. An allocate request is satis ed by scanning the hole list for either the rst hole that ts ( rst t), the smallest hole that ts (best t), or the rst hole after the last allocated block that ts (next t). On the other hand, the buddy system Knu73] maintains holes using separate lists for hole sizes that are powers of two. For our purposes, the major di culty with such s c hemes is their poor worst-case fragmentation properties. Figure 4 shows an example. First 5 While most previous work in IP lookups assumes a small 20 bit value for next hop, the actual next hop information used in real routers is much larger because of the need to handle load splitting and adjacencies on LANs. Thus it seems best to leave the nal next hop lookup to external SRAM.
we allocate all of memory using size 1 requests. . Thus we are forced to examine compaction. In the literature Wil92], memory compaction is mostly associated with garbage collection. There are two major problems with standard real-time copying collectors Bak78] for our application. First, garbage collectors solve t h e p o i n ter adjustment problem by l e a ving a forwarding pointer at a relocated block B that points to the new location of B. The use of forwarding pointers can slow d o wn search b y a factor of two (each memory access may g o through a forwarding pointer) in the worst-case.
Even if we adapted copying collectors using our solutions to pointer adjustment (Section 4), a second problem is that the utilization of copying collectors can only approach fty percent. This is because memory is divided into two halves (semi-spaces) where each space must be large enough to contain the entire data structure Bak78]. There are also generational garbage collectors LH83, Wil92] that use multiple spaces, but these only help reduce the average compaction work without improving worst-case fragmentation.
COMMON INFRASTRUCTURE
Since compressed lookup structures require allocation and deallocation of variable amounts of storage, we describe fast memory allocation algorithms that provides guarantees on worst case memory usage. To do so, we rst describe some common infrastructure used by all our allocators.
Locating Holes: First, to identify holes and block boundaries, we u s e t wo tag bits per memory word. In our applications, a memory word is > 24 bits thus tag overhead is < 8%. 00 denotes a free word, 01 denotes a word that is allocated and is the start of a block, 10 denotes a word that is allocated and is the end of a block.
To handle an allocate request of size i, our allocator will search for the smallest size hole of size i. (For W Pointer Adjustment: T o i m p r o ve memory utilization, our schemes compact memory by m o ving an allocated block B to a new location that starts at say P . Thus all memory locations in the application data structure that point t o B (which w e will call the parents of B) m ust be adjusted to point t o P before search can proceed. The main problem is to e ciently locate parents without searching all of memory.
Fortunately, many networking data structures (e.g., all forms of tries, binary search trees, hash tables) have the simple property that each block has at most one parent. The simplest way to locate parents is to add a parent eld to the s t a r t o f e a c h block that points to the (single) parent pointer location. This can be updated when the parent c hanges. If B is moved to B 0 , the parent location is adjusted to point to B 0 . All children of B (at most W of them) also must change their parent elds to B 0 . This adds no storage overhead if the parent pointers are kept in a software copy of the data structure and not in on-chip memory. The parent eld can be entirely avoided (even in the software copy) if block B stores a unique key that can be searched for to nd the parent o f B. Since search t a k es only a few memory accesses this will not slow d o wn compaction appreciably.
But there are also networking applications such as lter search (e.g., SVSW98]) where each node can have multiple parents. To allow this, we can allow a b l o c k t o c o n tain multiple (say u p t o P ) p a r e n t p o i n ters. To a l l o w e a c h b l o c k to contain exactly as many parent pointers as it needs, the unused tag bit combinations are used to denote a \parent pointer". Thu s t h e r s t w ords of a block can have a t m o s t P parent p o i n ters. Next, we increase the maximum allocation request from W to W + P . Increasing the maximum allocation size only increases the amount of compaction additively. The overhead for any s i z e P is at most 50%. This is because if the original memory required for the data structure was say L, there can be at most L parents. Thus there can be a total of L parent p o i n ters, leading to a total of at most 2L memory, a 5 0 % o verhead.
FRAME BASED ALLOCATORS
In this section, we describe the rst of our two new memory allocator designs. We emphasize the following di erences from standard memory allocators. i) Compaction possible: To a void the Robson bound, our allocators do local compaction. We can do this unlike s a y Unix's Malloc because we h a ve speci c applications that use only a nite number of parent pointers.
ii) No garbage collection: Our networking applications explicitly do deallocates. Our allocators compact only to gain worst-case utilization, not to identify garbage.
iii) Finite time: Since our major application is search, we cannot lock out search while compacting all of memory. Similarly, we cannot use forwarding pointers because they can slow search b y a factor of two. We also prefer not to waste half the memory as a temporary area to build a compacted database while search w orks on the other half. Thus our allocators can only a ord a small amount of compaction after each update. iv) Local compaction: Our allocators only compact in the neighborhood of the last update as opposed to using a global compaction sweep. More precisely, a simple measure of locality is the size of the neighborhood (which m ust include the allocated or deallocated node) which a compaction algorithm reads or writes. If this size is only a constant factor larger than the maximum node size, then we say the algorithm does local compaction. By contrast, real-time garbage collectors that use global sweeps are not local and only approach 50% utilization. Generational collectors are less global but not as local as ours, and do not help worst case utilization.
We n o w describe a simple frame based allocator in this section and describe a more sophisticated allocator in the next section.
Frame Based Schemes and Lazy Frame Compaction
To show h o w simple local compaction schemes can be, we rst describe an extremely simple scheme that does minimal compaction and yet achieves 50% worst-case memory utilization. We then extend this to what we c a l l Lazy Frame Compaction (LFC) whose utilization can be tuned to approach 100%.
In Frame Merging, we divide all M words of memory into M W frames 6 of size W . Frame Merging seeks to keep the memory utilization at least 50%. Thus all utilized frames should be at least 50% full. If M W is large, we can relax this a little (and still achieve close to 50% utilization) if we allow one awed f r ame that is non-empty but less than 50% full. In summary, F rame Merging maintains the following simple invariant: all but one un lled frame is at least 50% full. If so, and M W is much larger than 1, this will yield a guaranteed utilization of almost 50%. The simple trick t o m a i n tain the invariant is as follows. Assume there is already a a wed frame F and a new awed frame F 0 appears on the scene. Then we merge the contents of F and F 0 into F , leaving F 0 empty, maintaining the invariant. This is clearly possible because both frames F and F 0 were less than half full, and thus can be merged into a single frame by compacting the allocated blocks within F and moving the allocated blocks within F 0 to the end of the remaining free space in F . An example of merging is depicted in Figure 5 . Frame Merging is reminiscent of the merging of B-tree nodes used in B-trees CLR90] except that Frame Merging abstracts the merging technique for use in a general memory allocation context. The worst-case utilization of Frame Merging can be improved as follows. We increase the frame size to k W, and require that at most one frame has utilization less than k k+1 , where k is a compaction parameter that can be used to tune the algorithm. Increasing k improves utilization but linearly increases the compaction work.
However, an even simpler scheme is as follows: only compact within frames. If after compaction, a frame is still useless to satisfy further allocate requests, then the frame must have a hole of at most W ;1. Thus if an allocate request cannot be handled, every frame has a hole of size at most W ;1. Since each f r a m e i s o f s i z e k W, this leads to a worst-case utilization ratio of at least k;1 k . We can also be lazy about compacting until an allocate request cannot be satis ed, by placing any frame that has more than W total free memory words (but less than W contiguous free space) in a collapsible frame list. We call this scheme Lazy Frame Compaction (LFC).
SEGMENT-HOLE COMPACTION
Lazy Frame Compaction (LFC) does not compact across frames. If there are two adjacent frames with W ; 1 free space in each frame, LFC will give up on both frames as neither is collapsible. However, a hole of size W could be produced if a scheme were to compact across frames. Thus, while LFC guarantees a good worst-case utilization, we might suspect that its average memory utilization may only be slightly better than its worst-case utilization. Do we care about the average memory utilization? If we recall the motivation and model of Section 2.2, we w ould like o u r lookup chip to guarantee a large numberofkeys in the worst case. Howeve r , i t w ould be nice to also be able to handle an even larger numberofkeys in the \typical case".
Let us call each maximal sequence of contiguously allocated blocks a segment. In other words a segment consists of a number of blocks packed together without any i n tervening holes, and which cannot be extended to form a larger segment. Intuitively, for any v alue of parameter k, we w ould like a n y segment to be at least k W in size (this would rule out examples like Figure 4 ). We can relax this condition a little bit and require the following Either-or invariant: either each segment is at least k W in size or the hole immediately following the segment i s a t l e a s t W in size. The only exception we make is for the last segment t h a t m a y not have a following hole.
The Either-Or invariant still guarantees a good worst-case utilization of k k+1 . Suppose an allocate request cannot be satis ed. Thus all holes mu s t b e o f s i z e W ; 1 o r l e s s . But in that case, all blocks preceding these holes (with the pesky exception of a hole at the start of memory) must be of size at least k W, where k is once again a compaction parameter. Thus memory is a patchwork of segments of size k W interleaved with holes of size W ; 1. Thus the utilization is very nearly k k+1 . Figure 6 shows a simple example of a state of memory satisfying this invariant. The rst hole is of arbitrary size the rst segment is of size > k W and the second hole is of size > W (although this is not needed, it is not prohibited). The second segment i s o f s i z e > k W which a l l o ws the third hole to be of size < W . The third segment i s o f s i z e < k W but the fourth hole is fortunately > W ( W would have su ced).
The gure also shows that we do not place any restriction on the last segment. If M is much larger than k W, the arbitrary size of the rst hole and the last block contribute edge e ects that only marginally a ect worst-case utilization.
Allocates can a ect the invariant b y allocating within a hole that was W and making the hole < W while its preceding segment i s < k W . Deallocates can a ect the invariant b y ant. Note that a segment can contain several blocks all packed together without any holes in between. The invariant ensures that either a segment is \large" or is followed by a l a rge enough hole.
deallocating within a segment that was k W, splitting the segment i n to pieces that do not satisfy the invariant. While the exact algorithm to maintain the invariant h a s a n umber of cases, the essential intuition can be had by looking at the example in Figure 7 . The top of Figure 7 represents the same example as in Figure 6 after a deallocate has been done in the middle of Segment 2. This splits Segment 2 i n to two small segments, say S e gment 2a and 2b that are both of size < k W . Segment 2 a could now violate the invariant if the hole between 2a and 2b is < W . But Segment 2b can also violate the invariant because the hole following the old Segment 2 (see Figure 6 ) was < W . The easiest way to restore the invariant is to slide the concatenation of the two \ a wed" segments 2a and 2b to the end of the hole before the old Segment 3 . This results in the picture shown on the bottom of the gure. Since there are now only three segments we h a ve renumbered the segments.
Thus the base intuition is as follows: when a segment b ecomes \ awed", we simply slide the segment d o wn to merge with the next segment t o t h e r i g h t. It should now be clear why w e do not require the last segment t o b e k W.
Finding whether a segment i s a wed requires an examination of only k W tag bits (to check if the segment i s o f s i z e < k W ) and a further W bits (to check if the following hole is of size < W ). The actual sliding process costs at most k W. It is also easy to see that deallocates can cause at most two s e g m e n ts to become awed, and allocates can cause at most one segment to become awed. The worst case work is (2k + 1 ) W .
EXPERIMENTAL RESULTS
We already know the worst-case performance of our allocators. However, the theoretical analysis does not provide answers to the following questions.
Q1. What is the average case performance of the two allocators? How m uch compaction do they do on average, and (more importantly) what is their average memory utilization. If the average memory utilization is 100% and worst-case utilization is (say) 85%, that implies a 15% increase in the number of pre xes (keys) that can be handled by the lookup chip. Also, what advantages does the slightly more complex Segment-Hole compaction have over Frame Based compaction? What value of the compaction parameter k should we c hoose?
Q2. How m uch better do our allocators do than the best standard allocators on an actual application? A recent paper J97] shows that good allocators (e.g., addressordered best-t) do quite well on actual memory traces. However, these studies apply only to standard benchmark processor applications and not to router lookup schemes. Although our schemes do o er a large worst-case improvement, it wo u l d b e n i c e t o a l s o k n o w whether they improve average performance compared to standard allocators. To a n s w er these questions, we implemented the Lazy Frame Compactor (LFC) and the Segment-Hole Compactor (SHC) and tested their average performance using our sample IP lookup algorithm and BGP traces (which cause pre x deletions and additions which in turn lead to allocates and deallocates). We also implemented a benchmark best-t allocator that never compacts memory. To g o b e y ond IP lookups, we also implemented a simple hashing lookup application that uses our allocators and could be used in (say) an ARP cache. We n o w p r o vide more details.
Benchmark All o cator: The benchmark allocator we implemented uses exactly the same infrastructure described in Section 4 to keep track of holes (a list for all sizes < W and a single list for holes of size > W ) and uses tag bits in order to keep track of allocated words in memory. Given an allocate request of size n, as in Segment-Hole compaction, the benchmark allocator nds the smallest size hole that satis es the request. It deletes the rst hole on this list, allocates n memory words from this hole, and returns the leftover hole to the appropriate list. Given a deallocate request, the allocator resets the tag bits for these memory words. It then coalesces the newly created hole with any adjacent holes, and inserts the resulting hole into the appropriate list. Unlike our other allocators, the benchmark allocator never compacts memory. Thus it represents the class of best-t allocators.
Best-t allocators have b e e n k n o wn to o er good memory usage and cause the least fragmentation among conventional allocators J97] and thus provide a good point of comparison. Actually address-ordered best-t does even better J97] but address ordering would require sorting the hole lists which would slow down allocates and deallocates (and hence inserts and deletes) considerably. Recall that the worst-case for all conventional allocators, including this benchmark allocator, does not meet our performance requirements. However, it is useful to compare average performance.
Lookup Data: For both the IP lookup and hashing applications, we used data from the Mae-East database Mer] on 1/20/99 at 10:19:10 hrs. The database had 42732 IP prexes. For the IP lookup application, we used a BGP trace obtained used the Route Tracker tool Mer2]. Since the tool does not let the user save updates to a le we used the system call trace facility ktrace on a NetBSD machine. The resulting trace had duplicate insertions and deletions presumably due to repeated BGP updates or due to system call behavior. We eliminated these duplicate pre x updates by not allowing a pre x already present to be inserted again, and not allowing an absent pre x to be deleted from the data structure. We extracted this data for 05/20/99, 00:00:00 hrs to 08:30:00 hrs.
Experimental Metrics: Average performance includes two metrics, memory utilization and compaction. First, consider average memory utilization. Computing the average memory needed is not trivial because (especially for the trace experiments) the size grows and falls as pre xes are deleted and added. We wish to nd the minimum amount o f m e mory M needed to satisfy a run of an allocator on the trace. We know that M must be larger than the peak size of the actual data structure (say M l ). However, it could be larger because the allocators are not perfect. However, it cannot be larger than that caused by w orst case memory utilization (say M h ). We then do binary search b e t ween M l and M h by doing repeated runs of the same trace until we nd the smallest value M of memory between M l and M h such that all allocates are satis ed. We compute the average memory utilization as M l =M (because the allocator required M words of memory when a perfect allocator with 100% utilization would require only M l words).
Although it is of secondary importance, we also measure compaction. If compaction is done by a software process other than the lookup chip, decreasing compaction work allows the route processor to have more time for more directly useful functions such as fast route updates. We measure compaction work by the numberof words of memory written or read during any update operation. We measure the average compaction, the total compaction work divided by the number of updates processed.
Hash Application: Besides IP lookup, we u s e d a h a s h l o o k u p application that provides deterministic hashing. If the maximum numberof keys that collide in a hash bucket is W , a W size block c o n taining all the keys that hash into a bucket can be retrieved in a single memory access. For example, if there at most 4 collisions (Figure 8) , we can allocate the four keys in a single size 4 block. The naive method would require allocating a size 4 block for each non-empty bucket. Clearly, this can be avoided using dynamic memory allocation. If only a few buckets have 4 collisions and the other buckets have only 1 collision, the naive s c heme will require 4 times the amount of storage. To improve speed in terms of memory access, multiple entries that hash together in a bucket can be placed in a contiguous block of memory (middle). Dynamic allocation can be used to avoid allocating every block to be the size required to handle the worst case number of collisions (right).
Results
For lack of space we only provide a few sample curves and the highlights of our experimental results. To test compressed tries, we built two tries. Figure 10 show the memory utilization 7 for Mae-East with the two tries. To plot multiple points, we calculate the memory utilization after inserting the rst 2000 pre xes, the next 2000 pre xes etc., until all 42732 pre xes are inserted. We plot the memory utilization for all three 7 Recall this is the ratio of the minimum possible memory required to hold the data structure (ignoring the allocator) at any point t o the actual minimum memory (taking into account the allocator). The latter memory is calculated by binary search to nd the smallest memory size at which n o allocate requests fails. Notice from Figure 9 and Figure 10 that all allocators do signi cantly better than the worst case of 66%. The benchmark allocator has a low of just below 95% for the 8/8/8/8 trie but falls to below 80% for the 12/4/4/4/4/4 trie. LFC is almost consistently worse than both allocators (although it is signi cantly better than the worst case predicted by the compaction parameter k) falling to below 80% after the 8/8/8/8 trie is completely built. On the other hand, SHC gets consistently over 95%. It is easy to increase the utilization of LFC b y increasing the compaction parameter k to say 8 . For k = 8 , using the same database the utilization of the benchmark remains constant ( w e h a ve o m i t t e d the graphs for lack of space) but LFC does better than the benchmark (over 90% for both tries). For k = 8, SHC does even better (over 98%).
In general, in all the experiments we performed SHC does consistently better than both the benchmark and LFC a llocators, achieving consistently over 95% even for low values of k. It is interesting to also compare the compaction work done by L F C and SHC. For the 8/8/8/8 trie described above a n d f o r k = 2, the worst case compaction per update is 2kW which was equal to 988 words. However, the average numberof words compacted per update by L F C w as only 0.18 while the average for SHC was 25.10. Moving to k = 8, the worst case compaction goes up to 2964 words but the average numberofwords compacted per update by LFC w as only 0.5 while the average for SHC goes up to 94. Thus the average numberofwords compacted by both SHC and LFC is signi cantly less than the worst case but LFC's compaction work is much less than SHC.
The Mae-East table building experiment is clearly a special case that can happen when say a router rst boots up. Table  building provides very stylized allocates and deallocates with the deallocates closely following the allocates in a regular pattern as pre x insertions cause nodes to expand. Thus we should also examine more arbitrary inserts and deletes that occur as routes go up and dow n i n a b a c kbone router.
Thus for the second set of experiments, we used a BGP trace (see description earlier) to construct each o f t h e t wo tries. We only show results for the 12/4/4/4/4/4 trie. The memory utilization results are shown in Figure 11 . In this gure, we plot the utilization at 5 points for every 1/5th of the trace. When the x-axis says number of pre xes, it means the number of pre xes processed so far (added or deleted) in the trace. The graph is similar to Figure 10 except that Figure 10 uses a static database while Figure 11 uses a dynamic database. Note also that in this dynamic graph the total memory used (not shown) grows and shrinks as pre xes are added and deleted. The most interesting observation from Figure 11 is that it is fairly consistent with the other graphs, with SHC doing close to 100% while the benchmark allocator falls at some point to nearly 75%. Once again, LFC is sometimes better and sometimes worse than the benchmark, but all allocators are signi cantly better than the 66% worst case predicted by the compaction parameter. The results for compaction were also similar with LFC compacting 0.6 words per update on average, while SHC compacted 21 words per update on average for k = 2 . Other experiments provided similar results.
So far we h a ve only considered IP lookup applications it is natural to ask whether our results would change for other applications. Thus we also tested the hash function application using IP addresses. For simplicity, w e obtained these IP addresses from the Mae-East snapshot by padding prexes with 0's. The hash function chosen uses an array size of 8192. The memory utilization is shown in Figure 12 for compaction parameter k = 2 . The results are qualitatively similar, but the smaller variability in node sizes seems to make all the allocators behave v ery well Based on these results we o er preliminary answers to the two questions we raised earlier.
A1. The average case memory utilization of SHC was always over 95 % for every experiment w e conducted even for very low v alues of the compaction parameter. Thus for an IP lookup application, one can use a low v alue of k (say 4 ) t h a t guarantees 80% utilization but for which w e can expect an SHC allocator to allow 15% more pre xes (than the worst case would predict.) LFC on the other hand does worse than the benchmark for low v alues of k. LFC d o e s better for higher values of k but is always inferior to SHC. The average compaction work performed by SHC is an order of magnitude less than the worst case compaction work but is still worse than LFC. However, since the average compaction work is small compared to the work required anyway ( b y the route update process) to write the elds in the trie nodes, this seems unimportant. Thus, despite its slight increase in complexity we believe SHC is a better allocator. This is possibly because it can compact across frame boundaries and so produces fewer but larger holes. Only if compaction work were a major factor, would LFC (with a high value of k) be preferred to SHC.
A2. The benchmark allocator does very well on our experiments achieving over 75%. We believe our schemes are better than the benchmark allocators for three reasons. First, our allocators provide a guaranteed level of performance which is important when comparisons are made to technologies like CAMs that provide tight guarantees. We also only used a single BGP trace we leave t o f u t u r e w ork the task of checking whether the benchmark can do worse on other traces. Second, our SHC allocator does better on average (over 95% even for low v alues of k) than the benchmark (as low as 75% in our trace study). Third, our allocators can be tuned by adjusting k if the observed utilization is found to be inadequate, unlike the benchmark.
BACK TO IP LOOKUPS
Since our original focus was IP lookups, we revisit the implications of the results of the last few sections for a lookup chip. Suppose the lookup chip uses 16 Mbits of on-chip SRAM with a 4 nsec access time (easily feasible today for a c u s t o m c hip), then the 6 level 12/4/4/4/4/4 trie will require 6 memory accesses (of width less than 500 bits) which yields a 24 nsec lookup. This allows wire speed forwarding at OC-192 rates. Using the earlier numberwe computed of 50 bits per pre x for our sample IP lookup scheme, 250,000 pre xes requires 12.5 Mbits at minimum to just store the data structure. 8 8 We assume the parent pointers used for compaction are stored o -chip. For example, if update is done entirely in software, then the route processor could keep parent p o i n ters in its copy of the trie.
Using a compaction parameter of k = 5 yields a worst-case memory utilization of k = (k + 1) = 83.3%. Assuming a 5% overhead for tag bits (recall we needed these in Section 4 for nding holes), we get an overall utilization of 83:333 0:95 which is 79%. Since 79% of 16 Mbits is smaller than 12.5 we can expect to t 250,000 pre xes in this memory. Our average case experiments for the same value of k = 5 using SHC show the actual memory utilization is actually closer to 98%. Thus we can expect to store 15% more pre xes in practice, which w ould allow 287,500 pre xes. This provides an extra cushion for unexpected growth, or allows the use of on-chip SRAM for other purposes such as lters or accounting.
For update times, assume the update process gets a memory access every 20 memory accesses, and each memory access is W bytes wide. Counting the worst case sequence including the time to locate parent n o d e s , w e estimate 4520 memory accesses for an update. Incorporating the slow access for update and using a memory access time of 4 nsec we g e t a worst case update time of 3.164 msec. This seems more than adequate at present, and the average time is much better. However, two problems remain which w e n o w discuss.
The Memory Access Problem
The reader may h a ve noticed the following problem. Our sample IP lookup scheme requires a W word memory access, but our allocators do not guarantee to layout trie nodes within W bit word boundaries. For example, in Figure 13 , we see a node that straddles two W bit boundaries. It is easy to show that any allocator that does not allow nodes to straddle W bit boundaries can guarantee only a little over 50% utilization (consider a series of allocate requests of size W = 2 + 1). But for conventional memories, if a node straddles two W bit boundaries, the only way to access the entire node (needed for compressed tries) is to make two memory accesses. This would slow d o wn speed by a factor of two! The simplest way t o a void this problem is to use a scheme like SV98] or a version of DBCP97] (suitably modi ed) which also use variable size trie nodes but only require a 1 word memory access for lookup. However, there is also a simple trick that avoids this dilemma using a new memory design that allows shifted access. In Figure 13 , for example, we show a shifted access that allows a W bit READ to start at any position k W = 2 for any k, In other words, we can read from bit position 0 to W as usual but we can also read from W = 2 to 3W=2. Since all on-chip SRAM memory on a cutom design is generally custom designed, it is possible to design a new form of SRAM memory that allows shifted access. First, most memory is internally designed in terms of large rows which are rst decoded (to select the row) and then the appropriate bits within the row are chosen. Allowing nodes to straddle rows seems to be very hard, and is fortunately not needed because typical row sizes are larger than the k W needed for high utilization. 9 However, allowing a simple W = 2 shift within a row is easy. This is because each bit in a r o w can only go to two output positions. For example, bit W = 2 in Figure 13 can either go to output bit position W = 2 (bit W = 2 in normal access) or to output position 0 ( rst bit in shifted access). This only requires a small change in the column multiplexors memory designers we h a ve consulted estimate such extra column multiplexor wires will only add about 5% extra logic.
Finally, it is easy to see that using 2W size memory access (twice the size of a node) plus one shifted access can allow any node to be read out in one memory access. Using shifted access reduces memory density b y s a y 5% using twice the access width required only means that we must examine one less bit at each node. For example, if a 1000 bit access allowed 5 bit access at each n o d e , w e m a y h a ve to settle for 4 bit access instead. This will increase overall lookup time by only 1 memory access for typical cases, and is a good tradeo .
The Pipelining Problem
A 12-4-4-4-4-4-4 compressed trie can do an IP lookup in 6 memory accesses which is around 24 nsec using 4 nsec SRAM. To obtain a faster lookup that scales with memory speeds, we need to pipeline the trie to obtain a lookup every memory access (4 nsec). There are several new problems created by pipelining. We brie y describe one problem: the interaction of pipelining with memory allocation.
The simplest scheme is pipelining by height. Thus the pipeline would have S stages, where S is the tree height: each stage has some logic and some SRAM memory to store trie nodes. The root of the trie is assigned to stage 1, all children of the root are in stage 2 in general, nodes of height i are assigned to stage i. An address to be looked up rst ows through stage 1 which passes a pointer to stage 2 along with the address while the address is in Stage 2, a second address can enter Stage 1, creating a pipeline.
Unfortunately, e a c h of the stage memories must be strictly partitioned so that each stage only accesses its own memory. This is because of the di culty in building large multiport memories 10 . It appears that we m ust statically divide the available on-chip memory (say 16 Mbits) among the S stages.
Unfortunately, with height pipelining, the amount o f m e mory allocated to a stage can vary drastically depending on the keys entered into the trie. This is because the trie is not a balanced tree whose memory needs at each h e i g h t a r e predictable. With the exception of Stage 1 which c o n tains the root, for any i, one can nd a set of pre xes where most 9 One would have to modify the SHC scheme to only maintain the invariant w i t h i n e a c h r o w of memory. 10 Register les in CPUs are multiport memories but are much smaller in size that the amount of SRAM memory (16 Mbits) we need.
of the trie nodes are at height i. In other words, suppose the non-pipelined trie requires M memory in the worst case then for every i 6 = 1, there is some set of pre xes where Stage i of the height pipelined trie requires M memory. Since we are forced to do static memory partitioning this condemns us to allocate M memory for almost every stage i. However, this is clearly wasteful because the total amount of memory required across all stages for any database is only M . With S stages, this would be a factor S waste of memory. We l e a ve solutions for this problem to future work.
CONCLUSIONS
This paper describes an IP lookup scheme that can scale with memory access speeds while allowing tight guarantees on the number of pre xes that can be supported and providing fast updates times. To do so, we introduced a set of locally compacting allocators that can be tuned to obtain close to 100% utilization of the limited on-chip SRAM needed for high speed lookups. Such compacting allocators are, we b e l i e v e, a crucial component o f any IP forwarding chip that supports close to 250,000 pre xes and has fast update times. Such performance gures are beyond the reach of today's CAM technology. Without using compacting allocators, an IP forwarding chip must either use two copies of the database (in which it case it can only guarantee half the number of pre xes and have update times in the order of seconds) or use an incremental scheme with a conventional allocator (in which case the guaranteed number of pre xes goes down by an order of magnitude).
Our locally compacting allocators only compact in the vicinity of the last update. We have shown that the average performance of one of our allocators, the Segment-Hole allocator does much better than a benchmark Best-Fit allocator on average, is tunable, and provides good worst case memory utilization guarantees.
Our compacting allocators are problem-speci c unlike g e neral purpose allocators like Best-Fit. However, our allocators work for any application for which e a c h allocated node is only pointed to by a s m a l l n umber of \parents". While we have only emphasized applications like lookup and hashing that have a single pointer, it can also be used for applications with a small number of parents. We conjecture that other applications that use limited fast memory (cache memory for software, or SRAM for hardware) can also bene t from local compaction. Perhaps local compaction could become part of the \bag of tricks" available to systems implementors.
While our sample IP lookup scheme requires a shifted word access to access entire trie nodes, this need for an unconventional memory design can be avoided by using more standard IP lookup schemes like Lulea tries DBCP97]. However, such compressed trie schemes need to be modi ed to allow incremental updates in order not to limit memory utilization to 50%. Finally, w e note that pipelining is often cited in earlier work as a simple means for making lookup speeds scale with memory speeds (e.g., SV98]). However, we h a ve shown that pipelining adds new memory allocation problems of its own. While there are good solutions to this important problem, we l e a ve details for a future paper.
ACKNOWLEDGEMENTS
The compaction problem was originally suggested to us by Will Eatherton and Zubin Dittia of Growth Networks. Will Eatherton provides an alternate (and elegant) solution for compaction in his thesis Eat99] that works well for small size granularities. We are also grateful to the anonymous referees, and to our shepherd Nick McKeown who helped us nd an appropriate title. We also thank John Holst for helping assure us that a shifted memory design was feasible.
