Abstract-We propose a data structure-dynamic tree bitmapfor the representation of dynamic IP router tables that must support very high lookup and update rates. In fact, the dynamic tree bitmap is able to support updates at the same rate as lookups and is very competitive with other structures-tree bitmap and BaRT-proposed earlier for dynamic tables. Although the dynamic tree bitmap requires more memory than is required by the tree bitmap and BaRT structures, the required memory remains reasonable. The real value of our structure is its ability to support a very high update rate.
I. INTRODUCTION
An Internet router classifies incoming packets into flows 1 utilizing information contained in packet headers and a table of (classification) rules. This table is called the rule table  (equivalently, router table) . In this paper, we assume that packet classification is done using only the destination address of a packet. Each rule-table rule is a pair of the form (F, N H), where F is a filter and NH is a next hop. In this paper, we assume that each filter is a destination-address prefix and that a filter matches all destination addresses for which it is a prefix. For example, the filter 10* matches all destination addresses that begin with the bit sequence 10; the length of this prefix is 2. Since an Internet rule-table may contain several rules that match a given destination address d, a tie breaker is used to select a rule from the set of rules that match d. Traditionally, ties are broken by selecting the next hop associated with the longest prefix that matches the packet's destination address. In this paper, we focus on longest-prefix matching. We refer to the associated rule tables as LMPTs. We use W to denote the maximum possible length of a prefix. In IPv4, W = 32 and in IPv6, W = 128.
Data structures for longest-prefix matching have been intensely researched in recent years. Ternary content-addressible memories, TCAMs, use parallelism to achieve O(1) lookup [22] . Each memory cell of a TCAM may be set to one of three states 0, 1, and don't care. The prefixes of a router table are stored in a TCAM in descending order of prefix length. Assume that each word of the TCAM has 32 cells. The prefix 10* is stored in a TCAM word as 10???...?, where ? denotes a don't care and there are 30 ?s in the given sequence. To do a longest-prefix match, the destination address is matched, in parallel, against every TCAM entry and the first (i.e., longest) matching entry reported by the TCAM arbitration logic. So, using a TCAM and a sorted-by-length linear list, the longest matching-prefix can be determined in O(1) time. A prefix may be inserted or deleted in O(q) time, where q is the number of different prefix lengths in the table [32] . Although TCAMs provide a simple and efficient solution for static and dynamic router tables, this solution requires special hardware, costs more, and uses more power and board space than solutions that employ SDRAMs. EZchip Technologies, for example, claim that classifiers can forgo TCAMs in favor of commodity memory solutions [2] , [21] . Algorithmic approaches that have lower power consumption and are conservative on board space at the price of slightly increased search latency are sought. "System vendors are willing to accept some latency in their searches if it means lowering the power of a line card" [21] .
Ruiz-Sanchez, Biersack, and Dabbous [24] review data structures for static LMPTs and Sahni, Kim, and Lu [30] review data structures for both static and dynamic LMPTs. Although several data structures have been proposed for dynamic LMPTs [35] , [27] , [28] , [14] , [15] , [16] , [17] , these structures improve insert/delete complexity at the expense of lookup complexity. Eatherton et al. [1] have have proposed the tree bitmap (TBM) data structure for dynamic LMPTs. Although this data structure results in fast lookups, inserts and deletes make an excessive number of memory accesses. The reasons for the poor performance of TBM on insert and delete operations stem from the fact that, in a TBM, the children of a node are stored in contiguous memory using variable-size memory blocks. This has two detrimental consequences for insert and delete operations: 1) A complex memory management system is needed to allocate and deallocate variable size memory blocks. In the specific design detailed in [1] , 17 different block sizes are employed. The memory management system described by them requires over 3000 memory accesses 2 , in the worst-case, to allocate a node. 2) The insertion/deletion of a prefix may change the number of children that a node has. This requires the deallocation of an old memory block and allocation of a 2 Although [1] asserts that an allocation can be done with 1892 memory accesses, their analysis is flawed. In TBM, the memory is divided into 17 parts: 2-node blocks, 3-node blocks, 4-node blocks, · · ·, and 18-node blocks. The worst case for memory allocation occurs when an 18-node block is to be allocated and the only free space available is between the 2-node memory and the top of memory. To free space for an 18-node block, one needs to move 2 17-node blocks up; to free space for 2 17-node blocks, one needs to move 3 16-node blocks up; · · ·; and to free space for 36 3-node blocks, one needs to move 54 2-node blocks up. Hence, the amount of memory that needs to be passed, in the worst case, from end to end is not 34 nodes as claimed in [1] .
Prefix bitmap: 1010100 new memory block. Consequently, insertion and deletion require more than 3000 memory accesses each, in the worst case. In this paper, we propose an alternative tree bitmap scheme, which we call dynamic tree bitmap (DTBM). In DTBM, each node has an array of child pointers. This change greatly simplifies memory management and results in much faster inserts and deletes; lookup speed is not sacrificed. Although the DTBM uses more memory than does a TBM, the required memory is well within the capacity of commercially available SRAMs. We propose two alternatives-packed and lazy-to the basic (called free-mode) DTBM data structure. The lazy-mode alternative is expected to have very good performance in real-world applications. The DTBM structure is described in Section II and experimental results presented in Section III.
II. DYNAMIC TREE BITMAP (DTBM)
A. The Data Structure Figure 1 (a) shows the one-bit trie representation of the prefix set R = { * , 01 * , 1 * }. For any node x in the one-bit trie, let pref ix(x) be an encoding of the path from the root to x. Left branches add a 0 to the encoding and right branches add a 1. The encoding always ends in a * . So, pref ix(root) = * , pref ix(root.lef t) = 0 * , pref ix(root.right) = 1 * , pref ix(root.lef t.lef t) = 00 * , and so on. In our figure, a node x is shaded iff pref ix(x) ∈ R.
Suppose we extend the one-bit trie of Figure 1 (a) to a full trie whose height is 2 ( Figure 1(b) ). This full trie may be compactly represented using a prefix bitmap [1] (also known as the internal bitmap (IBM)) that is obtained by placing a 1 in every shaded node and a 0 in every non-shaded node and then listing the node bits in level order. For the full trie of Figure 1 (b) , this level order traversal results in the IBM 1010100. Notice that the IBM of a full one-bit trie of height h has exactly 2 h+1 − 1 bits, one bit for each node in the trie. Note also that if we number the bits of the IBM in level order beginning at 1, then we can simulate a walk through the corresponding full trie by using the formulae: left child of i is 2i, right child is 2i + 1 and parent is i/2 [29] . To obtain the next hop for the longest matching prefix, we supplement the IBM with a next hop array, nextHop, that stores the next hops stored in the nodes of the full trie. For a full trie whose height is h, the size of the next hop array is 2 h+1 − 1. For our example trie, the next-hop array has 7 entries, 4 of which are null.
One way to find the next hop associated with the longest matching prefix is to simulate a top-to-bottom walk beginning at the trie root and using the bits in the destination address. The simulated walk moves through the bits of the IBM keeping track of the last 1 that is seen. If the last 1 was in position i of the IBM, we retrieve the next hop from position i of the next-hop array.
An alternative strategy is to start from the appropriate leaf of the trie and move up toward the root stopping at the first 1 that is encountered in the IBM. In the alternative strategy, we start at bit 2 h + d.bits(h) of the IBM and simulate a walk up the trie by dividing the current bit index by 2 for each move to a parent. The simulated walk terminates when we either reach a bit whose value is 1 or when the bit index becomes 0. Figure 2 gives the algorithm that uses this alternative strategy to find the length of the longest matching prefix for d. IBM (i) returns the ith bit of the IBM (note that bits are numbered beginning at 1).
{i/ = 2; length − −;} return length; } Since the height of the one-bit trie for an IPv4 (IPv6) prefix set may be as large as 32 (128), it isn't practical to represent the prefix set by the IBM and next-hop array for the corresponding full one-bit trie. Instead, we pick a stride s and construct the stride s full extension (SFE) of the one-bit trie. The SFE is obtained by partitioning the one-bit trie into subtries whose height is s − 1, beginning at the root; partitions that have no descendants may have a height smaller than s−1. Each partition is then expanded to a full trie whose height is s − 1. Figure 3 shows an example stride-3 SFE. In partition x 4 , the root and its left child were also in the original one-bit trie for the prefix set. The remaining 4 nodes were added to complete the partition into a full trie whose height is 2. Each partition of a stride-s SFE is represented by a 2 s − 1 bit IBM, a next-hop array whose size is 2 s − 1, an array of child pointers, and a count of the number of non-null children partitions that this partition has. The array of child pointers can accommodate 2 s pointers to children partitions. Figure 4 shows the representation for the stride-3 SFE of Figure 3 . Each node of this figure represents a stride-3 partition. The first field of each node is its IBM, the second is the count of children, and the next 2 s fields are the children pointers. The next-hop array isn't shown in the figure. Figure 4 defines our dynamic tree bitmap (DTBM) structure. Figure 5 shows the general format for a DTBM node. Figure 6 gives the lookup algorithm for a DTBM. As an example, consider searching the DTBM of Figure 4 using d = 1110111 2 . The partition represented by node x 1 is searched using the first 2 bits (11) 4 requires us to first append a 0 to d to obtain 10 2 . Next we search the IBM of x 4 beginning at position 2 2 + 2 = 6. This search returns −1 and node is updated to null. We exit the do loop and return the nextHop [5] value of x 3 . Figure 7 gives the algorithm to insert a prefix p into a DTBM whose stride is s. In the while loop of lines 3 through 6, we walk down the DTBM one node at a time. Each such move consumes s bits of p. This walk terminates when either p has fewer than s bits left or the next DTBM node to move to doesn't exist. In the latter case, we continue to sample p s-bits at a time adding nodes to the DTBM until we are left with fewer than s bits in p. The remaining bits (less than s) of p are used to insert p into the last node Z encountered. If p has no remaining bits, the leftmost IBM bit of Z is set to 1 and Z.nextHop [1] is set to the next hop for p. Otherwise, bit 2 p.length + p.bits(p.length) of the IBM of Z is set to 1 and the corresponding nextHop entry set to the next hop for p. Figure 8 gives the algorithm to delete a prefix p from a DTBM. The algorithm first searches for the DTBM node that contains p. During this search, the last sequence of nodes with count field equal to 1 and IBM equal to 0 together with the index of the child pointer used to move from each node in this sequence are stored in the array nodes. Whenever the search reaches a DTBM node with count larger than 0 or with a nonzero IBM field, this sequence is reset to null. In case p is found in the DTBM, it is removed from its DTBM node Z. If now the IBM of Z is 0 and Z has no children, then Z along with its contiguous ancestors with count equal to 1 and IBM equal to 0 are deleted (using deleteCleanup). 
B. Lookup

C. Inserting a Prefix
D. Deleting a Prefix
Algorithm delete(p){
E. Modes of Operation and Analysis
We propose three possible modes of operation for a DTBMfree, packed, and lazy. To analyze the performance of the DTBM data structure in each of these three modes, we assume the data structure will be implemented on a computer with Algorithm deleteCleanup (node, nodes, i, x, xBits) a Cypress FullFlex Dual-Port SRAM. The memory capacity of this SRAM is 36 Mb (512K 72-bit words). The SRAM has two independent ports, both of which support read and write. A 72-bit word may be read/written via either port in 2 cycles (i.e., 8ns using a 250MHz clock). Multiple read/writes may, however, be pipelined and each additional word may be read/written in a single cycle (4ns). We assume that the CPU is sufficiently fast that the time required for a lookup, insert and delete is dominated by the time to read/write data from/to the SRAM. Hence, we analyze the complexity of lookup, insert, and delete by counting the (worst-case) number of memory accesses. Note that in our SRAM, two words may be read/written in a single memory access. The reader should note that while the analysis and specific DTBM implementations of this section are very specific to the chosen SRAM, the concepts may be applied easily to any other SRAM.
F. Free Mode
In this mode, memory management is done using a single chain of free nodes, each node on this chain is of the same size. A node is allocated by removing the first node on this chain and a node is deallocated by adding the deallocated node to the front of this chain. All of memory is initialized to 0 and the way our algorithms work, whenever a node is deallocated, all its children fields also are 0. Consequently, the algorithms work without explicitly setting child pointers to 0 each time a node is allocated. Because of the simple memory management scheme in use, node allocation and deallocation each require a single memory access, that is, 2 cycles.
The worst-case lookup time is 12 cycles (or 48ns). So, under our assumptions, we can perform almost 21 million lookups per second. The worst-case for an insert is when the insert requires us to add a new node at each of the levels 2 through 5. So, an insert requires 18 cycles in the worst-case. The worstcase for a delete occurs when we delete a prefix that is in a level-5 node and the cleanup step for this deletion, deallocate a node at each of the levels 2 through 5 (note that we reserve nodes 1 through 8 for level 1 and these nodes cannot be deallocated). So, a delete requires at most 9 memory access or 18 cycles. Please refer to [31] for detail.
G. Packed Mode
Although lookups, inserts, and deletes are rather fast in the free mode, the free-mode is wasteful of memory. This is because most of the nodes in a TDBM are leaves and a leaf may be detected by examining its count field (which is 0). Hence a leaf does not require children fields. These fields account for 16 of the 28 words that comprise a node. In an effort to improve the memory efficiency of the TDBM structure, we propose the packed mode in which several leaves may be packed into a single 28-word node. We refer to a 28-word node as a type A node.
For the packed mode, we consider a 466656 implementation. This implementation is motivated by the observation that, in practice, IPv4 rule tables have very few prefixes whose length is more than 24. Consequently, a 366666 DTBM is expected to have very few nodes at level 5. So, virtually all of the level-4 nodes are leaves. Further, very few leaves are expected at levels 2 and 3. In the 466656 design, we use type-B nodes at level 5 and type-C nodes at level 4.
In a 466656 TDBM, the root stride is 4, the stride at level 4 is 5, and the stride at every other level is 6. Now, the root of the DTBM requires a 15-bit IBM, 16 child pointers, and 15 next-hop fields. As was the case for our 366666 design, the DTBM root is represented by storing its IBM in a register and its next-hop fields in predetermined memory locations. The 16 child pointers are eliminated and instead, we designate type-A nodes numbered 1 through 16 as the (up to) 16 children of the root of the DTBM. The size of a type-C node is 8 words. A level-4 leaf is represented using a single type-C node. The first word of this node is used to store a 31-bit IBM, a 5-bit count field, an 18-bit pointer (which is null), and a 12-bit next-hop field. The next 6 words are used to store the remaining 30 next-hop fields. The last word is unused. A level-4 non-leaf uses 2 type-C nodes. The first is used as for a level-4 leaf except that the pointer field now points to the second type-C node. The second type-C node stores 32 18-bit child pointers. A type B node requires 12 words and has all the fields of a type A node other than the count field and children pointers. Notice that with 18-bit pointers, we can index only one-fourth as much memory as indexed in free mode. This is because a pointer in a type-A node needs to point to type-C nodes embedded within a type-A node.
Although 3 node-types, each of a different size, are used, memory management is relatively simple (Please refer to [31] for detail). The available memory is partitioned into type A nodes. A type A node may be used as such (i.e., as a type A node) or may be used to house 3 type C nodes (CCC) leaving 4 words unused or to house 2 type C and 1 type B node (CCB). One memory access is needed to allocate a node of type A and up to 3 to allocate a node of type B or C. Deallocating a node of type B or C may make up to 5 accesses; a node of type A may be deallocated with a single memory access. The worst-case lookup time is 14 cycles. An insert requires 30 (15 memory accesses) cycles in the worst-case. An deletion takes 24 memory accesses (48 cycles) in the worst case.
H. Lazy Mode
In this mode, prefix deletion is done by simply setting the appropriate bit of the IBM of the node that contains the deleted prefix to 0. The cleanup action performed by deleteCleanup is not done. Consequently, nodes are never deallocated. While it is easy to see that lazy mode operation, through its elimination of the cleanup action, supports faster deletes, lazy mode operation also speeds inserts as memory management is simplified. Besides a reduction in the worstcase time for an insert, average-case time also may be reduced as future inserts may reuse nodes that would, otherwise, have been deallocated during the cleanup phase of a delete. The lazy mode of operation, however, runs the risk of failing to make an insert that would have succeeded had we freed nodes as is done by deleteCleanup. To avoid this outcome, we propose rebuilding the data structure whenever we get "close" to running out of memory. The rebuild time, even for large router-tables, is of the order of tens of milliseconds.
We evaluate lazy-mode implementations of both 366666 and 466656 DTBMs. For 366666, we use nodes of type A for levels 1 through 4 and of type B for level 5. The root of the DTBM is stored as for the non-lazy implementation; type A nodes numbered 1 through 8 are used for the (up to) 8 children of the root of the DTBM. The root of a lazy-mode 466656 DTBM is represented as for its non-lazy counterpart and we designate type A nodes numbered 1 through 16 as the (up to) 16 children of the root of the DTBM. Type A nodes are used at levels 1, 2, and 3 and type C nodes at levels 4 and 5. Please refer to [31] for detail.
III. EXPERIMENTAL RESULTS
We compare the DTBM structure with two other structuresBaRT [18] , [19] and TBM [1] -that have been proposed for dynamic router tables. For BaRT, we use the versions 8888, 86558 and 844448 reported on in [18] for stand-alone mode and for TBM, we use the 13-4-4-4-4-3 software reference design of [1] . For our analysis, we assume that each memory accesses requires two cycles. Although the Cypress FullFlex Dual-Port SRAM is able to pipeline memory accesses so that all but the first access take one cycle each, none of the data structures we consider is able to use this feature and the number of cycles becomes twice the number of memory accesses. Table I gives the worst-case number of cycles needed for a lookup and update for the considered data structures. As can be seen, the DTBM variants have considerable better update characteristics; all are comparable on lookup cycles. Table I gives also the memory required by the different data structures for 6 publically available router tables. These databases were obtained from [11] . The databases Paix1, Pb1, MaeWest and Aads were obtained on Nov. 22, 2001 , while Pb2 and Paix2 were obtained on Sep. 13, 2000. These 6 databases ranged in size from a low of 16,172 (Paix1) rules to a high of 85,987 (Paix2) rules. Although our TDBM scheme
