IP address lookup is becoming critical because of increasing routing table size, speed, and traffic in t h e Internet. Our paper shows how binary search can be adapted for best matching prefix using two entries per prefix and by doing precomputation. Next we show how to improve the performance of any best matching prefix scheme using a n initial array indexed by t h e first X bits of the address. We then describe how t o take advantage of cache line size to do a multiway search with 6-way branching. Finally, we show how to extend t h e binary search solution and t h e multiway search solution for IPv6. For a database of N prefixes with address length W, naive binary search scheme would take O(W * logN); we show how to reduce this t o O ( W + l o g N ) using multiple column binary search. Measurements using a practical (Mae-East) database of 30000 entries yield a worst case lookup time of 490 nanoseconds, five times faster than t h e Patricia trie scheme used in BSD UNIX. O u r scheme is attractive for IPv6 because of small storage requirement (2N nodes) and speed (estimated worst case of 7 cache line reads)
I. INTRODUCTION
Statistics show that the number of hosts on the internet is tripling approximately every two years [oT] . Traffic on the Internet is also increasing exponentially. Traffic increase can be traced not only to increased hosts, but also to new applications (e.g., the Web, video conferencing, remote imaging) which have higher bandwidth needs than traditional applications. One can only expect further increases in users, hosts, domains, and traffic. The possibility of a global Internet with multiple addresses per user (e.g., for appliances) has necessitated a transition from the older Internet routing protocol (IPv4 with 32 bit addresses) to the proposed next generation protocol (IPv6 with 128 bit addresses).
High speed packet forwarding is compounded by increasing routing database sizes (due to increased number of hosts) and the increased size of each address in the database (due to the transition to IPv6). Our paper deals with the problem of increasing IP packet forwarding rates in routers.
In particular, we deal with a component of high speed forwarding, address lookup, that is considered to be a major bottleneck.
When an Internet router gets a packet P from an input link interface, it uses the destination address in packet P to lookup a routing database. The result of the lookup provides an output link interface, to which packet P is forwarded. There is some additional bookkeeping such as updating packet headers, but the major tasks in packet for-George Varghese was supported by an ONR Young Investigator Award and NSF Research Award NCR-9628145 Irnn .edu warding are address lookup and switching packets between link interfaces.
For Gigabit routing, many solutions exist which do fast switching within the rout,er box [NMH97] . 0espit.e t,his. thp problem of doing lookups at Gigabit speeds rc't11il.i tis. For example, Ascend's product [Asc] has hardware assistanc? for lookups and can take up to 3 ,us for a single lookup in the worst case and 1 p s on average. However, to support say 5 Gbps with an average packet size of 512 bytes, lookups need to be performed in 800 nsec per packet. By contrast, our scheme can be implemented in software on an ordina.rv PC in a worst case time of 490 nsec.
The Best Matching Prefix Problem: Address lookup can be done at high speeds if we are looking for an exact match of the packet, destination address to a corrcsponding address in the routing datahasp. b:xa.c-t. niat~(:J.iiii,~ can be done using standard techniques such a.s Iiasliit1g 01. binary search. Unfortunately, most routing protocols ( 1 1 1cluding OS1 and IP) use hierarchical addressing to avoid scaling problems. Rather than have each router store a database entry for all possible destination IP addresses. the router stores address prefixes that represent a. group of addresses reachable through the same int.erEac-e. 'I'he use' of prefixes allows scaling t,o wmlrIwirl~t i e t l \ v r l < \ The use of prefixes introduces a new diiiieiisioii i.o 111(* lookup problem: multiple prefixes may match a. given address. If a packet matches multiple prefixes, it is intuitive that the packet should be forwarded corresponding to t>he most specific prefix or longest prefix match. IPv4 prefixes are arbitrary bit strings up to 32 bits in length as shown in Table I . To see the difference between the exact matching and best matching prefix, consider a 32 bit address A whose first 8 bits are 10001111. If we searched for A in the above table, exact match would not give us a match. However prefix matches are 100* and 1000*, of which the best matching prefix is lOOO*, whose next hop is L5. The rest of this paper is organized as follows. Section I1 describes related work and briefly describes our contribution. Section I11 contains our basic binary search scheme.
Section IV describes a basic idea of using an array as a front end to reduce the number of keys required for binary search. Section V describes how we exploit the locality inherent in cache lines to do multiway binary search; we also describe measurements for a sample large IPv4 database. Section VI1 describes how to do multicolumn and multiway binary search for IPv6. We also describe some measurements and projected performance estimates. Section VI11 states our conclusions. in the routing table, this takes 1.5 to 2.5 p s on the average. These numbers will worsen with larger databases.
PREVIOUS WORK
[Skl] mentions that the expected number of bit tests for the Patricia tree is 1.44 log N, where N is the number of entries in the table. For N=32000, this is over 21 bit tests. With memory accesses being very slow for modern CPUs, 21 memory accesses is excessive. Patricia tries also use skip counts to compress one way branches, which necessitates backtracking. Such backtracking slows down the algorithm and makes pipelining difficult.
Many authors have proposed tries of high radix [PZ92] but only for exact matching of addresses. OS1 address lookups are done naturally using trie search 4 bits at a time [Per921 but that is because OS1 prefix lengths are always multiples of 4. Our methods can be used to lookup OS1 address lookups 8 bits at a time.
[NMH97] claims that it is possible to do a lookup in 200 nsec using SRAMs (with 10 nsec cycle times) to store the entire routing database. We note that large SRAMs are extremely expensive and are typically limited to caches in ordinary processors.
Caching is a standard solution for improving average performance. However, experimental studies have shown poor cache hit ratios for backbone routers[NMH97]. This is partly due to the fact that caches typically store whole addresses. Finally, schemes like Tag and Flow Switching suggest protocol changes to avoid the lookup problem altogether. These proposals depend on widespread acceptance, and do not completely eliminate the need for lookups at network boundaries.
In the last year, two new techniques [BCDP97], [WVTP97] for doing best matching prefix have been announced. The approach in [BCDP97] is based on compressing trie nodes so that they will fit into the cache. The approach in [WVTP97] is based on doing binary search on the possible prefix lengths. Another approach invented a.nd patented by 11s ha.setl on prefix expansion [SV98] seems to be tahe simplest. a.nd fastest of the schemes we know for IPv4. Detailed coinparisons with other schemes a.re present.ed in [SV%] .
O u r Contributions:
In this paper, we start by showing how to modify binary search to do best matching prefix. Modified binary search requires two ideas: first, we treat each prefix as a range and encode it using the start and end of range; second, we arrange range entries in a binary search table and precompute a mapping between consecutive regions in t,he binary search table and the corresponding prefix.
Our approach is completely different from either [WVTP97] as we do binary search on the number of possible pnefixes as opposed to the number of possable prefix lengths.. For example, the naive complexity of our scheme is log,iV + 1 memory accesses, where N is the number of prefixes; by contrast, the complexity of the [WVTP97] scheme is log2 W hash (computations plus memory accesses, where W is the length of the address in bits.
At a first glance, it would appear that the scheme in [WVTP97] would be faster (except potentially for hash computation, which is not required in our scheme) than our schleme, especially for large prefix databases. However, we show that we can exploit the locality inherent in processor caches and fast cache line reads using SDRAM or RDRLAM to do multiway search in log,+, N + 1 steps, where le > 1. We have found good results using k = 5. By contrast, it appears to be impossible to modify the scheme in [WVTP97] to do multiway search on prefix lengths because each search in a, hash table only gives two possible outcom'es.
FhrthLer, for long addresses (e.g., 128 bit IPv6 addresses), the true complexity of the scheme in [WVTP97] is closer to O ( W / M ) log, W , where M is the word size of the machine.' This is lbecause computing a hash on a W bit address takes O(Wlh.2) time. By contrast, we introduce a multicolumn binary search scheme for IPv6 and OS1 addresses that takes log2 N -+ W / M + 1. Notice that the W / M factor is additive and not multiplicative. Using a machine word size of M = 32 and an address width W of 128, this is a potential multiplicative factor of 4 that is avoided in our scheme.
The approach in [BCDP97] is based on compressing k-bit trie nodes and takes O(W/k) time. It differs entirely from our approach. While t:he approach in [BCDP97] has search times comparable to ours for IPv4 (around 500 nsec), our approach should scale better for IPv6 when W becomes 128.
We also describe a simple scheme of using an initial array as a front end to reduce the number of keys required to be searched in binary search. Essentially, we partition the original database according to every possible combination of the first X bits. Our measurements use X = 16. Since 'Details of this method itre not in the public domain yet due to the patenting process.
2The scheme in [WVTP97] starts by doing a hash of W/2 bits; it can then do a hash on 3W/4 bits, followed by 7W/8 bits etc. Thus in the worst case, each ha.;h may operate on roughly 3W/4 bits. the number of possible prefixes that begin with a particular combination of the first X bits is much smaller than the total number of prefixes, this is a big factor in practice.
Our paper describes the results of several other measurements of speed and memory usage for our implementations of these two schemes. The measurements allow us to isolate the effects of individual optimizations and architectural features of the CPUs we used. We describe results using a publically available routing database (Mae-East NAP) for IPv4, and by using randomly chosen 128 bit addresses for IPv6.
Our measurements show good results. Measurements using the (Mae-East) database of 30000 entries yield a worst case lookup time of 490 nanoseconds, five times faster than the performance of the Patricia trie scheme used in BSD UNIX used on the same database. We also estimate the performance of our scheme for IPv6 using a special SDRAM or RDRAM memory (which is now commercially available though we could not obtain one in time to do actual experiments). This memory allows fast access to data within a page of memory, which enables us to speed up multiway search. Thus we estimate a worst case figure of 7 cache line reads for a large database of IPv6 entries.
Please note that in the paper, by memory reference we mean accesses to the main memory. Cache hits are not counted as memory references. So, if a cache line of 32 bytes is read, then accessing two different bytes in the 32 byte line is counted as one memory reference. This is justifiable, as a main memory read has an access time of 60 nsec while the on-chip L1 cache can be read at the clock speed of 5 nsec on an Intel Pentium Pro. With SDRAM or RDRAM, a cache line fill is counted as one memory access. With SDRAM a cache line fill is a burst read with burst length 4. While the first read has an access time of 60 nsec, the remaining 3 reads have access times of only 10 nsec each [Mic] . With RDRAM, an entire 32 byte cache line can be filled in 101 nsec [Ram] .
ADAPTING BINARY SEARCH FOR BEST MATCHING PREFIX
Binary search can be used to solve the best matching prefix problem, but only after several subtle modifications. Assume for simplicity in the examples, that we have 6 bit addresses and three prefixes 1*, 101*, and lOlOl*. First, binary search does not work with variable length strings. Thus the simplest approach is to pad each prefix to be a 6 bit string by adding zeroes. This is shown in Figure 1 . Now consider a search for the three 6 bit addresses 101011, 101110, and 111110. Since none of these addresses are in the table, binary search will fail. Unfortunately, on a failure all three of these addresses will end up at the end of the table because all of them are greater than 101010, which is the last element in the binary search table. Notice however that each of these three addresses (see Figure 1 ) has a different best matching prefix.
Thus we have two problems with naive binary search: first, when we search for an address we end up far away from the matching prefix (potentially requiring a linear search); second, multiple addresses that match to different prefixes, end up in the same region in the binary table (Figure 1 ). E n c o d i n g Prefixes a s Ranges: To solve the second problem, we recognize that a. prefix like 1* is really a range of addresses from 100000 to 11 11 11 Thus instead of encoding 1* by just 100000 (the s t a t , of t,he range), we encode it using bot,h the s t x t and c\Iicl of r;111g,r Thus each prefix is encoded by two full leiigt,li Ibii s i . r l i~g s . These bit strings are then sorted. The result for the sa.me three prefixes is shown in Figure 2 . We connect the start and end of a range (corresponding to a prefix) by a line in Figure 2 . Notice how the ranges are nested. If we now t,r'y to search for the same set of addresses, they each end in a different region in the table. To be more precise, the sparrli for address 101011 ends in an exact match. The search for address 101110 ends in a failure in the region between 101011 and 101111 (Figure a) , and the search for address 111110 ends in a failure in the region between 101111 and 111111. Thus it appears that the second problem (multiple addresses that match different prefixes ending in the same region of the table) has disappeared. Compare Figure 1 and Figure 2 . To see that this is a general phenomenon, consider Figure 3 . The figure shows an arbitrary binary search table after every prefix has been encoded by the low (marked L in Figure 3 ) and its high points (marked JJ) of the corresponding range. Consider an arbitrary position indicated by the solid arrow. If binary search for address A ends up at this point, which prefix should we map A to? It is easy to see the answer visually from Figure 3 . If we start from the point shown by the solid arrow and we go back up the table, the prefix corresponding to A is the first L that is not followed by a corresponding H (see dotted arrow in Why does this work? Since we did not encounter an H corresponding to this L, it clearly means that A is con-tained in the range corresponding to this prefix. Since this is the first such L, this is the smallest such range. Essentially, this works because the best matching prefix has been translated to the problem of finding the narrowest enclosing range. 
A. Using Precomputation to Avoid Search
Unfortunately, the solution depicted in Figure 2 and Figure 3 does not solve the first problem: notice that binary search ends in a position that is far away (potentially) from the actual prefix. If we were to search for the prefix (as described earlier), we could have a linear time search.
However, the modified binary search table shown in Figure 3 has a nice property we can exploit. Any region in the binary search between two consecutive numbers corresponds to a unique prefix. As described earlier, the prefix corresponds to the first L before this region that is not matched by a corresponding H that also occurs before this region. Similarly, every exact match corresponds to a unique prefix.
But if this is the case, we can precompute the prefix corresponding to each region and to each exact match. This can potentially slow down insertion. However, the insertion or deletion of a new prefix should be a rare event (the next hop to reach a prefix may change rapidly, but the addition of a new prefix should be rare) compared to packet forwarding times. Thus slowing down insertion costs for the sake of faster forwarding is a good idea. Essentially, the idea is to add the dotted line pointer shown in Figure 3 to every region.
The Our scheme is somewhat different from the description in [Per92]. We use two pointers per entry instead of just one pointer. The description of our scheme in [Per921 suggests padding every address by an extra bit; this avoids the need for an extra pointer but it makes the implementation grossly inefficient because it works on 33 bit (i.e., for 1Pv4) or 129 bit (Le., for IPv6) quantities. If there are less than 216 different choices of next hop, then the two pointers can be packled into a 32 bit quantity, which is probably the minimum storage needed. 
B. Insertion into a Modified Bznary Search Table
The simplest way to build a modified binary search table from scratch is to first sort all the entries, after marking each entry as a high or a low point of a range. Next, we process the entries, using a stack, from the lowest down to the highest to precompute the corresponding best matching prefixes. Whenever we encounter a low point (L in the figures), we stack the corresponding prefix; whenever we see the corresponding high point, we unstack the prefix. Intuitively, as we move down the table, we are keeping track of the currently active ranges; the top of the st,a.cl< keeps track of the innermost active range. The prefix on top of tlhe stack can be used to set the > pointers for each entry, and the = pointers can be computed trivially. This is an O ( N ) algorithm if there are N prefixes in the table.
One might hope for a faster insertion algorithm if we had to only add (or delete) a prefix. First, we could represent the binary search ta.ble as a. bina.ry tree in t h e i i s i i a l way This avoids the need 1,o shift entries to iiiake rooiii lor a new entry. Unfortunately, the addition of a new prefix can affect the precompukd information in O ( Y) prcfise.;. 'T'hi.:
is illustrated in Figure 5 . The figure shows an outernlost. range corresponding t,o p~efix P : inside this range ai'(' .\' -1 smaller ranges (prefixes) t.hat do iiot. i i i t e regions not covered by these smaller prefixes, wc' i n a p i o P . unfortunately, if we now add Q (Figure 5 ) , we cause all these regions to matp to Q, an O ( N ) update process.
Thus there does not appear to be any update technique that is faster than just building a table from scratch. Of course, many insertions can be batched: if the upda.te process falls behind, the batching will lead to ii-iorc cficioiit updates.
IV. PRECOMPUTED 16 BIT PREFIX TABLE
We can improvp t , h r x worst case n i~i n l x~~ of rncniory a('cesses of the basic ljiiia,ry sea.rc11 sclieiiit. u.it.li a p r w u i i iputed t,able of best matching prefixes for t h e first Y . h1t.s. The main idea is to effectively partition the single binary search table into multiple binary search tables for each value of the first Y bits. This is illustrated in Figure 6 . We choose Y = 16 for what follows as the table size is about as large as we can afford, while providing maximum partitioning.
L -Prefix
Without the initial table, the worst case possible number of memory accesses is log2N + 1, which for large databases could be 16 or more memory accesses. For a sample database, this simple trick of using an array as a front end reduces the maximum number of prefixes in each partitioned table to 336 from the maximum value of over 30,000.
The best matching prefixes for the first 16 bit prefixes can be precomputed and stored in a table. This table would then have M a r = 65536 elements. For each index X of the array, the corresponding array element stores best matching prefix of X. Additionally, if there are prefixes of longer length with that prefix X, the array element stores a pointer to a binary search table/tree that contains all such prefixes. Insertion, deletion, and search in the individual binary search tables is identical to the technique described earlier in Section 111. Figure 7 shows the distribution of the number of keys that would occur in the individual binary search trees for a publically available IP backbone router database [Mer] after going through an initial 16 bit array. The largest number of keys in any binary table is found to be 336, which leads to a worst case of 10 memory accesses. Todays processors have wide cache lines. The Intel Pentium Pro has a cache line size of 32 bytes. Main memorv is usually arranged in a matrix form, with rows and columns. Accessing data given a random row address and column address has an access time of 50 to 60 nsec. However, using SDRAM or RDRAM, filling a cache line of 32 bytes is much faster, which is a. burst access to 4 cont>iguous 64 hit, D R A M locations, is much faster than accessing 4 random DRAM locations. When accessing a burst of contiguous columns in the same row, while the first piece of data would be available only after 60 nsec, further columns would be available much faster. SDRAMs (Synchronous DRAMS) are available (at $205 for 8MB [Sim] ) that have a column access time of 10 nsec. Timing diagrams of micron SDRAMs are available through [Mic] . RDRAMs [Ram] are available t,hat can fill a cache line in 101 nsec. The Intel Pentium pro has a 64 bit data bus and a 256 bit cacheline [Inta] . Detailed descriptions of main memory organization can be found in [HPSG] .
The significance of this observation is that it pays to restructure data structures to improve locality of access. To make use of the cache line fill and the burst mode, keys and pointers in search tables can be laid out to allow multiway search instead of binary search. This effectively allows us to reduce the search time of binary search from logz N to logk+l N , where k is the number of keys in a search node. The main idea is to make IC as large as possible so that a single search node (containing k keys and 2k + 1 pointers) fits into a single cache line. If this can be arranged, an access to the first word in the search node will result, in the entire node being prefetched into cache. Thus the accesses to the remaining keys in the search node are much cheaper than a memory access.
We did our experiments using a Pentium Pro; the parameters of the Pentium Pro resulted in us choosing k = 5 (i.e, doing a six way search). For our case, if we use k keys per node, then we need 2k + 1 pointers, each of which is a 16 bit quantity. So in 32 bytes we can place 5 keys and hence can do a 6-way search. The initial full array of 16 bits followed by the 6-way search is depicted in Figure 8 . This shows that the worst case (for the Mae East database after using a 16 bit initial array) has 336 entries leading to a worst case of 4 memory accesses (since 64 =1296 takes only 4 memory accesses when doing a 6-way search). ki 5 ki-11. information pointer or i2 next node pointer.
A . Search
Each pointer has a bit which says it is an
The following search procedure can be used for both IPv4 and IPvti. For IPv6, 32 bit keys can be used instead of 16 bits.
1. Index into the first 16 bit array using the first 16 bits of the address.
2. If the pointer at the location is an information pointer, return it. Otherwise enter the 6-way search with the initial node given by the pointer, and the key being the next, 16 I-)it,s of t,he a.ddrPss. 3. In the current 6-way node locate the position of the key ,among the keys: in the 6-way node. We use binary search among the keys within a node. If t,he key equals any of the keys keyi in the node, use the corresponding pointer p t r i . If the key falls in any ra.nge formed by the keys, use the pointer ptri,,+l. If this poinkr is a.n information pointer, return it; otherwise repeat this step with the new &way node given by the pointer.
In addition, we al1o.w multicolumn search for IPv6 (see Sect:ion VII) as follows. If we encounter an equal to pointer, the search shifts to the next, 16 I i t , s of t.hc% input address. This feature ca.n be ignored for now and will be understood after reading Section VII. As the data structure itself is designed with a node size equal to a cache line size, good caching beha.vior is a ronsequence. All the frequently accessed nodes will s h y i i i t,lir\ cache. To reduce the worst case access time. the first few I + levels in ,a worst rase depth tree can hc (-n(.hcd Each node in the &way search table has 5 keys kl to k g , each of which is 16 bits. There are equal to pointers pi to p5 corresponding to each of these keys. Pointers pol to p56 correspond to ranges demarcated by the keys. This is shown in Figure 9 . Among the keys we have the relation
VI. RdEASUREMENTS A N D COMPARISON FOR 1PL-l
We used a Pentium Pro [Intb] based inacliiiie. b v i L l i a 200 MHz clock (cost under 5000 dollars). ft, 1ia.s a. X Kl3,vt.p K B y k diial port,rtl t,wo-\va,v ser assor'iiit p r i i t i : i t '~' i l a t :I cache. The L2 cache is 258 KBytes of SlL2M t l i i~~, I> CUIIpled to thie core processor through a full clock-speed, 64-bit, cache bus.
We used a practical routing Table with over 32000 entries that we obtained from [Mer] for our experiments. Our tables list results for the BSD Radix Trie implementation (extracteld from the BSD kernel into user space), binary search (Bsearch) and 6-.way search.
Repeated lookup of a single address: After adding the routes in the route database VI, random IP addresses were generated and a lookup performed 10 millioii times for each such address. We picked 7 of these results to disp1a.v in Table 11 . Average search time: 10 million IP addresses were generated and looked up, assuming that all IP addresses were equally probable. It was found that the average lookup time was 130 nanoseconds. or:iat,i\.(l)riiiiaL> itist.rii(,t ioti i ' ; i t ; i i i i l ( I
Memory Requirem.ents and Worst case time
The mlemory requirement for the 6-way search is less than that for basic binary search! Though at first this looks counter-intuitive, this is again due to the initial 16 From Table I11 we can see that the initial array improves the performance of the binary search from a worst case of 1310 nsec to 730 nsec; multiway search further improves the search time to 490 nsec.
Instruction count:
The static instruction count for the search using a full 16 bit initial table followed by a 6-way search table is less than 100 instructions on the Pentium Pro. We also note that the gcc compiler uses only 386 instructions and does not use special instructions available in the pentium pro, using which it might be possible to further reduce the number of instructions.
VII. USING MULTIWAY AND MULTICOLUMN SEARCH FOR IPv6
In this section we describe the problems of searching for identifiers of large width (e.g., 128 bit IPv6 address or 20 byte OS1 addresses). We first describe the basic ideas behind multicolumn search and then proceed to describe an ~ 1254 implementation for IPvG that uses both iiiulticoluiiiii aiid multiway search. We then describe sample measurements using randomly generated IPv6 addresses.
A. Multicolumn Binary Search of Large Identifiers
The scheme we have just described can be implemented efficiently for searching 32 bit IPv4 a.ddresses. 1. Jnfortiinately, a naive implementation for IPv6 can lead to inefficiency. Assume that the word size M of the machine implementing this algorithm is 32 bits. Since IPv6 addresses are 128 bits (4 machine words), a naive implementation would take 4 . log2 (2N) comparisons. It is important to note that this optimization we describe can be useful for any use of binary search on long identzfiers, not just the best matching prefix problem
The strategy is to work in columns, starting with the most significant word and doing binary search in that column until we get equality in that column. At that point, we move to the next column to the right and continue the binary search where we left off. Unfortunately, this does not quite work.
In Figure 10 , which has W / M = 3 , suppose we are searching for the three word identifier B M W (pretend each character is a word). We start by comparing in the leftmost column in the middle element (shown by the arrow labeled 1 
Probe I
Fig. 10. Binary Search by columns does not work when searching for

B M W
The problem is caused by the fact that when we moved to the quarter position in column 2, we assumed that all elements in the second quarter begin with B . This assumption is false in general. The trick is to add state to each element in each column which can contain the binary search to stay within a guard range.
In the figure, for each word like B in the leftmost (most significant) column, we add a pointer to the the range of all other words that also contain B in this position. Thus the first probe of the binary search for BMW starts with the B in B N X . On equality, we move to the second column as before. However, we also keep track of the guard range corresponding to the B's in the first column. The guard range (rows 4 through 6) is stored with the first B we compared. Thus when we move to column 2 and we find that M in BMW is less than the N in B N X , we attempt to half the range as before and try a second probe at the third entry ( the M in A M T ) . However the third entry is lower than the high point of the current guard range (4 through 6). So without doing a compare, we try to halve the binary search range again. This time we try entry 4 which is in the guard range. We get equality and move to the right, and find B M W as desired.
In general, every multiword entry W1, W2, . . . , W, will store a guard range with every word. The range for Wi, points to the range of entries that have W l , Wz, . . . , Wi in the first i words. This ensures that when we get a match with Wi in the i-the column, the binary search in column i + 1 will only search in this guard range. For example, the N entry in B N Y (second column) has a guard range of 5 -7, because these entries all have B N in the first two words.
The naive way to implement guard ranges is to change the guard range when we move between columns. However, the guard ranges may not be powers of 2, which will result in expensive divide operations. A simpler way is to follow the usual binary search probing. If the table size is a power of 2, this can easily be implemented. If the probe is not within the guard range, we simply keep halving the range until the probe is within the guard. Only then do we do a compare.
The resulting search strategy takes log2 N + W/M probes if there are N identifiers. The cost is the addition of two 16 bit pointers to each word. Since most word sizes are at least 32 bits, this results in adding 32 bits of pointer space for each word, which can at most double memory usage.
Once again, the dominant idea is to use precomputation to trade a slower insertion time for a faster search.
We note that the whole scheme can be elegantly represented by a binary search tree with each node having the usual > and < pointers, but also an = pointer which corresponds to moving to the next column to the right as shown above. The subtree corresponding to the = pointer naturally represents the guard range.
B. Using Multicolvmn and Multiway Search for IPv6
In this section we explore several possible ways of using the k-way search scheme for IPv6. With the 128 bit address, if we used columns of 16 bits each, then we would need 8 columns. With 16 bit keys we can do a 6-way search. So the number of memory accesses in the worst case would be log6 (2N) + 8. For N around 50,000 this is 15 memory ~ 1255 accesses. In general, if we used columns of 34 b i b . the worst case time would be logk+l N + W / M where VV = 128 for IPv6. T h e value of 6: depends on t,he cache lincsize C .
Since k keys requires 2k -t 1 pointers, the following illequality must Ihold. If we use pointers that are p bits long, k M + ( 2 1 c + l ) * p < C :
For the Intel Pentium pro, C is 32 bytes, i.e. 32 * 8 = 256 k ( M + 32) 5 240, with the worst case time being
In general, the worst case nurriber of iiieiiiora ~C C~N N O~
W is the number of bits ' Ln the address, M is the number of bits per column in t,he iriult,iple coliimn hinary s c a r r h . k. 1 9 the number of keys in one node, C is tlie caclic litiesizc i l l bits, p i s the number of bits to represent the pointers wit,liin the structure and T is the worst case number of memory accesses. Fig 11 shows that the 14' bits in an IP address are divided into M bits per column. Each of these M bits make up a M bit key, k of which are to be fitted in the search node of length C bits along with 2k + 1 pointers of length p bits.
bits. If wle use p = 16, logk+lN -t 128/M. For typical values of IV, the number of prefixes, the following However, by using the initial array, the number of prefixes in a single tree cart be reduced. For IPv4 the maximum nu:mber in a single tree was 336 for a practical database -with N more than 30000 (i.e., the number of prefixes that have the same first 16 bits is 168, leading to 336 keys). For IPv6, with p = 16, even if there is an increase of 10 times in the numb'er of prefixes that share the same first 16 bits, for 2048 pre:fixes in a tree we get a worst case of 9 cache line fills with a 32 byte cache line. For a 64 byte cache line machine, we get a worst case of 7 cache line fills. This would lead to worst case lookup times of less t.ha.n 800 nsec, which is competitive with the scheme presented in [WVTP97].
C. Measurements
We generated random IPv6 prefixes and inserted into a k-way search with an initial 16 bit array. From the practical IPv4 database, it was seen that with N about 30000, the maximum number which shared the first 16 bits was about 300, which is about 1% of the total number of prefixes. To capture this, when generating IPv6 prefixes, we generated the last 112 bits randomly and distributed them among the slots in the first 16 bit table such that the maximum number that falls in any slot is around 1000. This is necessary because if the whole IPv6 prefix is generated randomly, even with N about 60000, only 1 prefix will be expected to fall in any first 16 bit slot. On a Pentium Pro which has a cache line of 32 bytes, the worst case search time was found to be 970 nsec, using M=64 and p=16.
VIII. CONCLUSION
We have described a basic binary search scheme for the best matching prefix problem. Basic binary search requires two new ideas: encoding a prefix as the start and end of a range, and precomputing the best matching prefix associated with a range. Then we have presented three crucial enhancements: use of an initial array as a front end, multiway search, and multicolumn search of identifiers with large lengths.
We have shown how using an initial precomputed 16 bit array can reduce the number of required memory accesses from 16 to 9 in a typical database; we expect similar improvements in other databases. We then presented the multiway search technique which exploits the fact that most processors prefetch an entire cache line when doing a memory access. A 6 way branching search leads to a worst case of 5 cache line fills in a Pentium Pro which has a 32 byte cache line. We presented measurements for IPv4. Using a typical database of over 30,000 prefixes we obtain a worst case time of 490 nsec and an average time of 130 nsec using storage of 0.7 Mbytes. We believe these are very competitive numbers especially considering the small storage needs. For IPv6 and other long addresses, we introduced multicolumn search that avoided the multiplicative factor of W / M inherent in basic binary search by doing binary search in columns of M bits, and moving between columns using precomputed information. We have estimated that this scheme potentially has a worst case of 7 cache line fills for a database with over 50000 IPv6 prefixes database.
For future work, we are considering the problem of using different number of bits in each column of the multicolumn search . We are also considering the possibility of laying out the search structure to make use of the page mode load to the L2 cache by prefetching. We are also trying to retrofit our Pentium Pro with an SDRAM or RDRAM to improve cache loading performance; this should allow us to obtain better measured performance. 
