The 
Introduction
The wide spread of the World Wide Web and the multimedia application make the Internet tremendous in both the size and the amount of traffic. The investigation [l] shows that the number of hosts on the Internet has been tripled within recent two years. Also, the traffic is doubling every few months [IO] . This gowth has placed excessive strain on the Internet infrastructure, especially on routers. To solve the bottleneck of the Internet performance, faqter links and high performance routers are required. The advance of fiber-optic transmission technology can support the bandwidth required in transmission medium. Therefore, routers are the key of the Internet performance. The IP address lookup is the single most time-consuming operation and typically defines the upper bound on the router's forwarding rate. Assuming that the average size of packet is ' This work was supported by the Korea Science and Engineering Foundation(K0SEF) through the Advanced Information Technology Research Center(A1Trc) 1000 bit, 1Gb1s router must route a million packets per second. Hence, one lookup must have been completed within 1 ps. Moreover, the IP address lookup is complex by the fact that entries in the forwarding table have variable lengths, and also that many entries may represent valid routes to the same destination. Thus, the IP address lookup has grown even more challenging in recent years. In this paper, we propose the fast and easily updatable IP address lookup scheme Fundamental tasks of the router are the routing process and the packet forwarding process. The routing process deals with the network management such as running the routing protocols, creating and maintaining the forwarding table. The packet forwarding process determines the next hop (the output interface) of the packet based on information contained in the forwarding The rest of this pa:per is orgnized as follows. Section 2 describes previoius hardware-implementable lookup schemes. Section 3 demibes our lookup scheme. Section 4 describes the performlance analysis considering the logic delay. Section 5 states our conclusion.
Previous Work
We describe previous hardware-implementable schemes for IP lookup in aspect of the size of the forwarding 40 ,000 prefix entries is about 150 Kl3 and mall enough to fit in L2 cache (SRAM in hardware implementation). The number of memory accesses is between 3 and 9. The incremental update is not easy.
Gupta [5] developed a novel lookup scheme through the introducion of large memories(DRAM). This scheme utilizes the distribution of the prefix length, in which there are very few prefixes longer than 24-bits. The size of the forwarding table is fixed to 33 MB. The number of the memory access is between 1 and 2. Since large memories made the update expensive, several update schemes also were proHuang [6] reduced the size of the forwarding table by utilizing the distribution of the prefix in a segment, which is based on the indirect lookup mechanism with variable offset length. By employing a modified form of Degermark's compression scheme, [he table size can be further reduced. The size of the forwarding table is about 450 KB and the number of the memory access is between 1 and 3. The incremental update is not easy by fact that the segment requires the variable sizcd memory and the internal fragmentation in memory occurs at the prefix insertion and deletion. Since prefix insertion!; and deletions may occur every few seconds, the performance might be degraded severely due to the memory bandwidth contention.
Wang further reduced the forwarding table by employing the compression scheme and considering both the distribution of prefix and the common prefix of prefixes belonged to the same segment. 
The Proposed Scheme
The proposed scheme is based on the indirect lookup mechanism, as shown in Fig. 1 . The forwarding table is divided into the segment table (SEGT), the next hop / pointer table (NHPT) and the next hop table (NHT). The NHPT is replaced with the code word table (CWT) and the compressed NHPT(C-"PT) to reduce the size of the forwarding table. Our scheme has two significant features. First, the CWT, the C-NHPT and the NHT are allocated in fixed size. The allocation in the manner of fixed size reduces the update overhead and the incremental update is feasible. Obviously, there are some disadvantages in the memory requirement and the number of memory accesses. We analyze these disadvantages in Section 4. Second, the logic delay of the lookup procedure in the hardware implementation is reduced using simple logics of smaller bits. For example, the logic delay of 8-bit adder is smaller than that of 16-bit adder. Since most of prefixes are stored in NHPT, the key to reduce the table size is to compress "PT. By employing the modified form of Degermark's compression scheme (called bitmap scheme), "T is replaced with the CWT and the C-"PT. CWT contains 8-bits bitmap entries and 8-bits base entries to compute the index of C-"PT. C-N"T contains the compressed next hop / pointer. The original bitmap scheme reduces the table size significantly and increases the number of memory accesses by 2, whereas our bitmap scheme reduces the table size less than the original, but increases the number of memory accesses by only 1.
Our bitmap scheme is similiar to Huang's scheme but we use 8-bit bitmap, so there are some gains in the logic delay.
We make the CWT and the C -" F T using the sibling tree of the prefix range. The prefix can easily be converted to the prefix range as in [7] . The sibling tree is based on the fact that for any two distinct prefixes, either one is completely contained in the other, or the two prefixes have no entries in common. The CWT and the C -" P T construction algorithm is shown in Fig. 2 .
The sibling tree represents all prefixes in a segment. The example of the sibling tree is shown in Fig. 3 . Each prefix is converted to the prefix range and is stored in the node of the sibling tree. The node also contains the pointer to its child and the pointer to its sibling. For example,pl=143.248.0/17 is converted to the prefix range (0,127), the N1 has pl's information, the pointer to N3, and the pointer to N2. The root node covers default-routes. The sibling tree is useful for updating the CWT and the C-NHPT.
While traversing the sibling tree in breadth-first, we obtain the range information, which is used to construct the CWT and the C-"PT. In the example of Fig. 3 , after visiting the root node, R is empty due to excluing the range of N2 and N3. After visiting N2, R = { (0,23,po), (32,79,po), (96,127,po)) by excluding the range of N3 and N4. The rest of the node in T are processed by breadthThe CWT and the C-"PT construction algorithm Input: The set P = { p o , p z , p 3 , ..., p,-1) of prefix of a segment,
Output:
The CWT and C-NHPT of a segment sorted by the prefix length.
Step 1. Construct the sibling tree of the prefix range, T for P.
For any pairs of prefixes, pi and p j , if p; contains p j , then p; is a parent node of p j . Otherwise p ; and pj is a sibling. The node, ni, contains a range information, (Si, Ei, p i )
Let R be a range information of a segment. Mark the range of the visiting node in R, excluding a range of the children.
Step 2. Traverse the sibling tree T in breadth-first search
Step 3. Obtain a bitmap information from R. Let CWT be a bitmap information of R, represented in 256 bits. For each ri in R, Sith bit of CWT is set to 1 and C-NHPT stores the next hop ofpi corresponding to ri.
Step 4. Stop.
Figure 2. The CWT and the C-NHPT construction algorithm
first, and the procedure for constructing R is shown in Table   1 P r e f i x e s : The CWT and the C -" P T of a segment are obtained from R. In example of Fig. 3 , the CWT has ten ones at bit 0, 24, 32, 80, 91, 92, 96, 128, 208, and 224 and the rest of the CWT is marked to 0. The C-NHPT is The time complexity of the proposed algorithm is O(n2), where n denotes the number of prefixes in a segment, since the time complexity of the sibling tree construction is O ( n 2 ) . The sibling tree has a good advantage in that it allows to update the CWT and the CNHPT partially. In example of Fig. 3, when pr=143. the range of N2 is changed. We update the corresponding bit of CWT to N2 and insert p7 to C-NHPT.
The CWT should be encoded as a sequence of bitmaps and bases with each 1 byte-width. The CWT contains 256-bit streams and has been partition into a sequence of 8-bit streams. Then, theses streams are put into the bitmaps, sequentially. The base equals the number of ones accumulated in the bitmaps of previous entries. Figure 4 shows the partial CWT and the C -" P T for prefixes in Fig. 3 . For example, the first two entries of the CWT are (10000000, 0) and (00000000, l), respectively. The base of the second entry has 1 since there is an one accumulated in the bitmaps of the previous entries. 
Examples
Consider the following examples of how lookups are performed on the table in Fig. 5 The second and the third prefixes are stored in the CWT and C-NHPT. The routing information of the fourth prefix is stored in NHT and the pointer to the NHT is contained in the CWT and the C-NHPT. The decoding result in the CWT is 4, and the fourth entry of the C -" P T indicates that "T must be referenced. DA[24:31] is used as an index into "T and will return the next hop (p4).
Huang's DIR-24-BASIC 3 3 Forwarding Table Updates 1 -3 19.5 ns 1 -2
ns
Since each segment requires 64 bytes (2 byte * 32 entries) in the CWT, the memory can be partitioned into a sequence of @-bytes block. If a prefix is inserted or deleted, only 64 bytes are updated. However, the problem is that the CWT and the C -" I T share the memory address stored in SEGT, and the memory requirement of a segment in the C -" P T is proportional to the number of prefixes. For example, if the memory requirement of a segment in the C-"PT is 70 bytes, then two blocks (128 bytes) are allocated in both the CWT and the C-"PT. That is, 64 bytes in the CWT are wasted. According to the simulation, most of segments require one block, so the memory waste is not large.
The memory waste in a block is necessary for the memory management.
When a prefix is inserted, the size of the memory to rewrite is eight blocks of the C -" P T and one block of the CWT in the worst case, and is 576 bytes. If the row size of S R A M is 2 byte and the access time of SRAM is Ions, it takes about 3ps to write the part of the memory. The average of the update time is faster, and the time to write one block of the CWT and the C -" E T is merely 64Ons.
Hardware Implementation
The high-level hardware architecture of the proposed scheme is shown in Fig. 6 . The entry of the SEGT consists of two parts: the width flag (lbit) and the next hop / pointer (15bits). The width flag indicates whether the width of the entry in the C -" P T is 1 byte or 2 byte. If the longest prefix in a segment is not greater than 24 bits, the C -" P T stores the next hops which can be represented in 1 byte. Since there are few prefixes longer than 24 bits, the memory requirement of the C -" I T is reduced. DA[O:15] is used as an index of the SEGT. If the corresponding entry of the SEGT has a pointer, the memory address of the CWT entry is the value from 1st bit to 15th bit of the SEGT entry * 32 + DA[ 16:23] . The base address of a segment in the CWT and the C -" I T is the value from 1st bit to 15th bit of the SEGT entry * 32. Actually, the multiplication of 2n does not need a multiplier. After the CWT entry is passed to the map filter and the (8,4) counter, the offset address is obtained. If the width flag is 1, the offset address is multiplied by 2. 
Performance Analysis
The performance of the lookup scheme is evaluated by the delay of the Table 2 shows the number of memory acceses and the average delay of the table search for three schemes. To obtain the logic delay, we investigate theses schemes in logic level and simulate the total delay in the table search. In our simulation, we use HD 0.65 pm two-metal3.3V CMOS technology and the gate delay is 0.21ns. 1,000,000 IP addresses are randomly selected from the IP address space which the routing table convers, and simulated Since the proposed scheme uses the logic of smaller bits than Huang's and the total number of gates is smaller than Huang's, our scheme is slightly faster than Huang's. Whereas the delay of the whole logic in Huang's is 14.7ns, that of the proposed scheme is 7.811s. In the worst case, total delays of the table search in Huang's, the propsed scheme, and DIR-24-BASIC are 4 4 . 7 1~ 47.8ns, and 100.2 ns, re-spec tively. Table 3 shows the memory requirement and the support for the incremental upda1.e of three schemes. The forwarding table of Huang's scheme requires the smallest memory, but for the table update, it needs dual-port memory or dualmemory banks. This may degrade the performance due to the memory bandwidth Icontension. Our scheme requires more memory than Huang's scheme, but supports for the incremental table update. 
Conclusion
In this paper, we have proposed a fast and updatable lookup scheme that can be easily implemented in hardware. We have also presented the memory allocation policy that support5 for the incremental update of the forwarding table. The memory allocation is performed in a block of fixed size, and the management of the memory is straightforward. The proposed CWT and C-PJHPT construction algorithm is optimized for updating the table. In the prefix insertion and deletion, the partial update of the CWT and the C-NHPT is possible. It is remarkable that the forwarding table of our scheme is small enough to fit into SRAM, while supporting for the incremental update. We have investigated the recent hardware-implementable schemes in logic level and simulated the total delay in lookup considering the logic delay. Since our lookup scheme can be implemented with the small-bit logic and lOns SRAM, the average delay per one lookup is about 18 ns. That is, our scheme can achieve 5 5 . 5 6~ 106 routing loolcups/s in average.
