Abstract-With a rapid increase in the data transmission link rates and an immense continuous growth in the Internet traffic, the demand for routers that perform Internet protocol packet forwarding at high speed and throughput is ever increasing. The key issue in the router performance is the IP address lookup mechanism based on the longest prefix matching scheme. Earlier work on fast Internet Protocol Version 4 (IPv4) routing table lookup includes, software mechanisms based on tree traversal or binary search methods, and hardware schemes based on content addressable memory (CAM), memory lookups and the CPU caching. These schemes depend on the memory access technology which limits their performance. The paper presents a binary decision diagrams (BDDs) based optimized combinational logic for an efficient implementation of fast address lookup scheme in reconfigurable hardware. The results show that the BDD hardware engine gives a throughput of up to 175.7 million lookups per second (Ml/s) for a large AADS routing 
1 This paper is based on earlier work presented at IEEE International Conference on Computer Communications and Networks, ICCCN2001, as referred in [32] .
prefix. With the advent of optical medium for data transmission, the link rates have rapidly increased from 10 Mb/s Ethernet to 40 Gb/s OC768c and there is every possibility of the line rates increasing well beyond. The primary concern in the design of next-generation routers is to obtain maximum possible packet throughput to meet the demand from the high-speed transmission links. The continuous increase in the number of users on the Internet causes the creation of some explicit routes for certain users and constrains the router from aggregating the routing table effectively. This results in the expansion of routing tables and, thus, the search space of prefixes against which the destination address of each packet needs to be matched. Further, when the Internet Protocol Version 6 (IPv6) routing protocol is introduced where the address length is 128 bits, the problem of routing millions of communication packets every second, based on longest prefix matching, becomes a labyrinth. In these circumstances, where IP routing tables are expanding in both the dimensions, i.e., address length and number of prefixes, the routing mechanisms developed should be capable of providing the throughput demand from high-speed transmission links.
In this paper, we propose a reconfigurable hardware solution, using the well received concept of binary decision diagrams (BDDs) , that provides a high-speed IP address lookup and enables a data throughput of 200 Gb/s (average packet size of 1000 bits) in the current day large routers [1] . BDDs are one of the biggest breakthroughs in CAD in the last decade. BDDs are a canonical and efficient way to represent and manipulate Boolean functions and have been successfully used in numerous CAD applications. Although the basic idea has been around for more than 30 years [2] , it was Bryant who described a canonical BDD representation [3] and efficient implementation algorithms [4] .
The rest of the paper is organized as follows. Section II presents an overview of the longest prefix matching problem. In Section III, we review the related work done followed by a brief overview on BDDs and the motivation for the proposed scheme. Section IV gives the details of the proposed scheme and the implementation issues. In Section V, we present the results obtained from the implementation and analyze the performance of the scheme. We also discusses in detail, the routing table update scheme and the scalability of the scheme to IPv6. Finally, Section VI concludes the discussion.
II. LONGEST PREFIX MATCHING
The routing of the communication packets in the IP domain is done on the next-hop basis, i.e., the router takes the responsibility of sending any incoming packet till the next hop only. Table I .
The length of the prefixes can vary from 0 to 32 bits. For an incoming packet, its destination address is compared with all the current prefixes in the routing table and the NHP associated with the longest matching prefix is determined to be the output port for the packet. For an example shown in Fig. 1 is considered to be the best match and the packet is routed to the output port associated with that particular prefix. In other words, routing based on the longest prefix matching is equivalent to routing the packet to the nearest possible IP address. If none of the prefixes match with the destination IP address, the packet is sent to a default port, which is associated with a prefix of length zero.
The metrics taken into consideration, in general, while designing the IP lookup algorithms are preprocessing time, storage requirements, lookup rate and update time. Lookup rate is the most significant parameter that needs to be addressed. With the latest advancements in the network technology, the communication speed is leaping from Ethernet of 10 Mb/s to fiber distributed-data interface (FDDI) of 100 Mb/s to gigabit Ethernet. With the OC192c Line (Line-rate 10 Gb/s), 31.25 million packets (average size of 40 Bytes) have to be processed each second, while for the OC768c (Line-rate 40 Gb/s), the processing rate required is 125 million packets per second. The data throughput rates of various transmission links and the corresponding time budget for packet processing in a network processor is shown in Table II . It is significant to note that, apart from the lookup and forwarding operation the packet processing in a network processor includes various other functions like protocol recognition and classification, segmentation assembly and reassembly (SAR), queueing and access control, and quality-ofservice (QoS). The time budget shown for packet processing includes the execution of all these functions, and the lookup operation is required to consume a portion of that budget. Hence, the significance of designing a mechanism for high-speed IP address lookups cannot be overemphasized.
The large IPv4 routing tables known today typically contain around 50 000 prefixes and a large IPv6 routing table is expected to contain around 500 000 prefixes. Consequently, the need for enormous amount of data processing at phenomenal speeds, based on longest prefix matching in a large database of prefixes, makes the problem more complicated. Further, the IP address, which is 32 bits long in Internet Protocol Version 4 (IPv4), would be 128 bits long, when IPv6 is introduced, making the problem of IP forwarding even more complex.
III. RELATED RESEARCH
The IP address lookup schemes introduced so far can be broadly classified into two categories, viz., software and hardware approaches [5] . In amelioration to the classical binary trie traversal approach, several software solutions have been proposed. One of the first was the prefix matching algorithm using path-compressed tries [6] based on the practical algorithm to retrieve information coded in alphanumeric (PATRICIA) trie introduced in 1968 [7] . The other schemes subsequently proposed include the various trie based approaches [8] - [10] and other binary search methods like binary search on prefix lengths [11] , [12] and multiway and multicolumn search [13] . Besides, other schemes based on prefix expansion [14] , string matching [15] and level compression tries [16] have been proposed. The software address lookup schemes mostly are based on the tree traversal approach and, hence, perform according to the computing environment used for the implementation of the algorithm. The key elements that play a pivotal role in the performance of the software mechanism are the processor speed and the memory characteristics (capacity and access time) of the computing environment in which the algorithm is implemented. The best known algorithm for IP address lookups is the binary search on prefix lengths with the complexity of the lookup operation being logarithmic in the prefix length (W), i.e., , independent of the number of prefixes (N). Even in such case, the lookup operation involves five memory accesses in the worst case for IPv4 and, thus, is limited by the capacity of the computing environment. The scheme with lookup complexity gives a throughput up to 10 million lookups per second (Ml/s), when implemented in a computing environment consisting of a Pentium-Pro-based computer with a clock speed of 200 MHz and a 512 kB L2 cache memory [5] . The throughput, even when scaled for faster processors and larger memory capacity, will not be able to meet the current day requirements in packet processing rates.
Apart from the above software schemes, attempts have been made to design hardware mechanisms for prefix matching to enable high-speed routing. Various hardware schemes like content addressable memories (CAMs) [17] , memory lookup-based schemes [18] - [20] , CPU caching [21] , and circuit logic implementation in ield programmable gate arrays (FPGAs) [22] have been proposed. Recently, Pao et al. proposed a hardware architecture [23] implemented with the partition of binary trie into multiple levels, and Taylor et al. proposed a reconfigurable device based fast Internet protocol lookup (FIPL) engine [24] for high-speed routing.
CAM can search all of its entries in parallel given a destination IP address. The scheme based on CAMs uses a separate CAM for each possible prefix length and, hence, will require 32 CAMs for IPv4 and 128 CAMs for IPv6 resulting in an expensive solution. Besides, CAM might not be able to keep pace with the fast developing high-speed networks as it depends on and is limited by the IC process technology. Memory lookup schemes are based on SRAM indirect indexing and, hence, require an additional ASIC to corporate with the memory. The basic scheme in [18] uses a two-level multibit trie with a fixed stride of 24 bits and 8 bits for the first and second level, respectively. This scheme is developed based on the important observation that in a typical backbone router, most of the prefixes are of length 24 bits or less. Thus, a prefix expansion methodology is used wherein all the prefixes with length less than 24 bits are expanded accordingly. The memory access speed might not be able to cope up with the advent of new optical link rates and, hence, limits its performance. The architecture with hardware indexing implementation [23] of binary trie promises a high throughput, but requires a large memory for the implementation. Consequently, the practicality of the implementation of the scheme for the next-generation Internet routing with IPv6 protocol remains to be seen. The FIPL engine [24] gives an average throughput of 10 Ml/s using eight FIPL engines for routing in the MAE-West router with 16 564 prefixes. Besides, the scheme does not discuss its scalability to the future trends in Internet routing. In this paper, we have shown a higher throughput of up to 168.6 Ml/s for MAE-West router with a larger number of prefixes (29 487), with an added advantage that our scheme is easily scalable for the rapidly expanding routing tables. In the past, caching has not worked well in backbone routers because of the need to cache full addresses. This potentially dilutes the cache with hundreds of addresses that map to the same prefix. Besides, typical backbone routers may expect to have hundreds of thousands of flows to different addresses. The Wilder study [25] reports up to 240 000 concurrent flows with less than 20 packets per flow. Short web transfers are a likely reason for this behavior. Some studies have shown cache hit ratios of around 50%-70% [26] . Thus caching can help, but does not avoid the need for fast lookups. Most important, the above schemes become impractical at the advent of IPv6 due to the requirement of larger storage capacity.
A. BDDs
As is well-known, a Boolean function B can be represented by a BDD, a directed acyclic graph obtained by applying an ordering constraint over the input variables and reduction operations on a binary decision tree, as proposed by Bryant [3] . For example, the binary decision tree and the diagram for the function are as shown in Fig. 2 . Furthermore, the complexity of the BDD is dependent on its size, measured in the number of nodes. Hence, since a long time, one of the main research focuses has been to reduce the number of nodes in BDD representation. Reduction operations consist in eliminating redundant nodes from the binary tree. A node can be eliminated if: 1) both the child nodes are equivalent, which means that the binary logic extracted from both the nodes leads to a same output and 2) there exists an other node at the same level in the decision tree and with equivalent high and low child nodes, respectively.
IV. BDD BASED IP ADDRESS LOOKUPS

A. Motivation
The proposed scheme is motivated from two observations, first being that even at the largest Network Access Point, the number of NHPs is generally not greater than 256. Hence, a NHP associated with any prefix in the routing table can be encoded using a 8-bit binary code. For example, any NHP in the MAE-East [1] routing table can be safely represented by a 6-bit binary code. Every bit of the output port can be computed by a combinational logic circuit whose optimal minimization is obtained with the help of BDDs. The second observation is that the number of effective nodes, defined as the minimal number of nodes required to construct a binary decision tree in order to cover all the prefixes in the routing table, is significantly smaller as compared with the upper bound on the theoretically required number of nodes. It is shown in the following section that, for the 32-bit IP address with the biggest available routing table of MAE-East [1] , the number of redundant nodes is more than 99.99%. Thus constructing the binary decision tree with a fewer nodes and without any redundant nodes makes it very attractive for the application of BDDs to optimize the logic. Besides, it is shown in the next sections that, while the upper bound on nodes increases exponentially with the IP address size, the number of effective nodes do not, making it an advantageous fact in view of the future implementation of IPv6 with the 128-bit IP address.
B. Details of the Scheme
For further understanding, consider the routing table given in Table I . The binary decision tree representation for the routing table is shown in Fig. 3 . A node is assigned with the associated NHP if the path taken till that node from the root node forms a valid prefix. Note that the root node is assigned with a default output port (in this case zero) as the length of the prefix at that node is zero. However, as mentioned earlier, the partial construction of the binary decision tree is sufficient to cover all the prefixes in the routing table resulting in only eight effective nodes. The redundant nodes, which can be conveniently ignored in the binary decision tree representation, are shown in dotted lines. Now, the number of distinct next-hop ports in this case being four, each NHP is encoded with a 2-bit binary code, and being the most significant (MSB) and the least significant (LSB) bits, respectively. When the ports are identified with the binary code, the binary decision tree representations for the and bits are as shown in Fig. 4 . It can be observed that a further reduction in the number of effective nodes is obtained, the process of which is explained in detail in the subsequent subsection. Note that any effective node without an output bit assigned to it, would inherit the output of its parent node.
For the sake of convenience, let us assume that the n-bit IP address is represented by the binary variables where represents the MSB of the IP address. Now, applying the BDD algorithm on the binary decision trees of each bit of the output port, the BDDs for the functions are obtained to be as shown in Fig. 5 .
C. Reducing Effective Nodes
When the output port is assigned to each of the node on the binary decision tree, with the further analysis, it is observed that the binary encoding of the output ports has given a further scope for the reduction in the number of nodes. For example, consider a situation where two leaf nodes, with a common parent, are assigned with output ports of 3 and 11, respectively. Suppose the parent node is assigned with an output port of two. When the next-hop port is encoded with a 4-bit binary code, it can be observed, as shown in Fig. 6 , that a child node with the same output bit as its parent becomes redundant. The redundant nodes are shown in dotted lines in the figure.
With the above procedure, it is shown that in each of the output bit representation, for the biggest available routing table of MAE-East [1] with 32-bit IP address, an additional 36% reduction is obtained in the number of effective nodes. This significant reduction in the number of effective nodes makes the application of BDD approach, for obtaining the optimized logic, even more effective.
The number of effective nodes are obtained during the construction of the binary decision tree for a few sample routing tables with IP address lengths of 3, 5, 8, and 16 and for the real-time 32-bit IP MAE-East routing table. The prefix distribution of a real-time routing table available at [1] , has enabled to generate the prefixes in similar lines for the sample routing tables with IP address lengths of 3, 5, 8, and 16. It is observed that the construction of the binary decision tree for the MAE-East routing table required only 91 925 effective nodes, which is largely insignificant as compared with the theoretical upper bound of more than eight billion nodes. Further, when the output port is encoded with a 6-bit binary code and the reduction procedure is applied on each of the trees for individual binary output bits, the number of effective nodes obtained were only around 64% of the actual effective nodes. The summary of results is shown in Table III .
D. Implementation Issues
As mentioned earlier, the output interface ports at any router can be identified by at most an 8-bit binary code. Hence, for the above proposed scheme, the combinational logic design has to be done for eight output bits and, hence, that would give eight BDDs to be processed. Each of the synthesized logic can be mapped into one or more configurable logic blocks (CLBs) in an FPGA as shown in Fig. 7 . The processing of the BDDs is performed with the SIS package [27] . The combinational logic subsequently obtained is implemented using Verilog coding and the logic synthesis is performed using the design analyzer tools from Synopsys [28] .
1) Timing Optimization: Since the logic design obtained for the IP routing table is a combinational circuit, the timing optimization can be achieved using the pipelining and retiming techniques. Pipelining involves the insertion of delay elements at specific points of a circuit and retiming is the process of moving delays around a circuit such that the overall computation is unaltered. It aims to move a computation in an attempt to reduce the critical path, the path with the longest computation time without delays. By pipelining the computational data path, the throughput in terms of number of address lookups per unit time can be increased with a little or no additional cost in the overall area and latency.
V. RESULTS AND ANALYSIS
The recent research in logic optimization [29] , [30] using BDDs has proved that the logic implementation, with a binary decision tree size of more than 100 000 nodes is done in less than a second. Subsequently, it is an encouraging factor when the routing table, with only around 50 000 effective nodes on average, is implemented as a combinational logic optimized using BDDs. While the complexity metrics are important for assessing the feasibility of the implementation, it is equally important to measure the performance of the schemes for real-time routing tables. We measured the performance of our scheme with prefix database of real-time snapshots of various routing tables [1] . The implementation of the routing mechanism is performed as discussed earlier. The lookup time is measured as the propagation delay between the input and output ports of the combinational logic. This is same as the time taken for signal propagation along the critical path between the input and outports of the logic. The critical path exists between one of 32 input signals and one of eight outputs. Thus, this measurement of propagation delay gives the worst case lookup time. The worst case lookup time and the corresponding packet throughput for the proposed scheme, for different routers, are shown in Table IV. A main advantage with the proposed scheme is that, for an n-bit binary encoding number of output ports can be represented and, hence, with an increase by one bit in the binary code twice the current number of output ports can be represented. Thus, the proposed scheme proves to be more beneficial in the scenario that the number of physical ports in a router would increase continuously. Besides, when the routing table is implemented in an FPGA, we can conclude that the IP address lookup rate is bounded only by the CLB delay. The maximum clock period bound for processing the IP address lookup would be the sum of one CLB delay and the maximum net delay. Previous hardware schemes have the lookup rate bounded by the RAM access speed. Further, in this scheme the resources required are utmost one FPGA while the other schemes require an ASIC and three or four-bank RAM.
A. Routing Table Update
As discussed in the earlier sections, the routing table update time is one of the important metrics to be considered for a scheme attempting the IP address lookup problem. In earlier hardware schemes for IP address lookups, the update scheme TABLE III  EFFECTIVE NODES FOR SAMPLE ROUTING TABLES AND THE REAL-TIME  32-BIT IP MAE-EAST ROUTING TABLE WITH is based on the assumption of availability of redundant hardware resources, which can be a duplicate memory bank [18] or CAMs. When one unit is actively involved in the routing of packets, the redundant unit is used by the backbone router to update the routing table offline. The two units are switched alternatively for routing mechanism in a periodical fashion. In our scheme too, we assume a similar mechanism. In this scheme, we show that when there is an update in the routing table, then in most cases, not all of the logic blocks have to be recomputed, thus reducing the computational complexity. For example, if a prefix is inserted with an associated next-hop port to be 1 into the routing table shown in Table I, then the new routing  table would be as shown in Table V .
The binary decision tree for the modified routing table would be as shown in Fig. 8(a) . It is obvious that there is no change in the BDD representation for the output bit while the slightly modified BDD representation for output bit is as shown in Fig. 8(b) .
However, this update of the logic may not be that simple for the 32-bit IP MAE-East routing table, but is also not as complex as assumed in general. To demonstrate this simplicity in updating the routing table, we have considered two consecutive snapshots of the MAE-East routing tables from [1] , with the number of prefixes 19 477 and 19525, respectively. The analysis for the updating of the table is done in terms of the number Table VI .
It is interesting to note that there is none or a significantly smaller variation in the output between the consecutive routing tables at the higher levels (level 0 to 15) and the lower levels (level 25 to 31). As is commonly known, the change in the logic would be minimal when the changes are minimum at higher levels in the binary decision tree, and we can observe that the same is the case in the current scenario. Furthermore, it can be observed that the number of nodes, in the levels [16] [17] [18] [19] [20] [21] [22] [23] [24] , that differ in their outputs, are significantly smaller. Based on the observations, it can be safely concluded that, when the routing table is updated, there would only be a partial change in the combinational logic for each of the output bit. Thus, the reconfiguration of only those logic segments, that need to have the updated logic, can be done. The commercial availability of partially reconfigurable FPGAs makes this update scheme even more attractive, where in, only those CLBs that have a modified design can be reconfigured leaving the remaining CLBs unaltered.
B. Scalability to IPv6
To demonstrate that an optimized combinational logic can be obtained for a mapping between 128-bit long IP address of IPv6 and the binary encoded next-hop port, it would be appropriate to show that the number of effective nodes that need to be processed by the BDD reduction techniques is significantly smaller than the theoretical upper bound. The performance analysis of our scheme in IPv4 was feasible since real-time 32-bit routing table were readily available [1] . However, a similar analysis was not possible in IPv6 due to the nonavailability of the routing table and, hence, to start with, we had to construct a routing table in line with the specifications of IPv6 protocol.
The building of the IPv6 routing table is done by incorporating the best-effort unicast address called aggregatable global [31] . This address format was designed to facilitate scalable Internet routing, by providing an address hierarchy flow aggregation. The address format has a fixed structure as shown in Fig. 9 and is organized into a three level hierarchy: public topology; site topology; and interface identifier. The public topology consists of a two level hierarchy of service providers with a top-level aggregation identifier (TLA ID) and a next-level aggregation identifier (NLA ID). The TLA ID is initially to be restricted to 13 bits which translates to 8192 routers in the core IPv6 network. This was done to constrain core routing table sizes. The NLA ID is 24-bits long and allows for a flat or hierarchial allocation of the NLA address space. The site-level aggregation identifier (SLA ID) is 16-bits long. It is used by an individual organization to define its local address hierarchy and subnets. The routing table constructed using the described IPv6 address format constituted a large database of 400 000 prefixes with prefix length ranging from 0 to 128 bits. The number of output ports were 512 and, hence, a 9-bit binary code is used to encode the NHP. The database is built such that the prefix length distribution in the IPv6 routing table should ratify the hierachial topology of aggregatable global unicast address. The number of effective nodes are obtained during the construction of the binary decision tree for the IPv6 routing table with 128-bit long address. It is observed that the construction of the binary decision tree for the IPv6 routing table required only 13.5 10 effective which is enormously insignificant as compared with the theoretical upper bound of 6.8 10 . Further, when the output port is encoded with the binary code and the reduction procedure is applied on each of the trees for individual binary output bits, the number of effective nodes obtained were only 7 10 , around 51% of the actual effective nodes. Thus, the significantly smaller number of nodes that need to be processed in the BDD solving of the logic shows that the scheme is scalable for the forthcoming IPv6. The implementation of the routing table and the measurement of the lookup time in IPv6 routing is reserved for our future study.
VI. CONCLUSION
With the advancements in the communication link technologies the IP address lookup is becoming a major bottleneck in router technologies. We propose a reconfigurable hardware solution, using the well received concept of BDDs, that provides an efficient IP address lookup along with providing a better scheme for updating the routing table. The argument, to support the adoption of BDD techniques for obtaining an optimized combinational logic, has been put forward by showing the fact that the number of effective nodes required to represent a 32-bit IP address routing table is significantly smaller than the theoretically required number of nodes. Besides, it has been shown that this number of effective nodes can be further reduced when the next-hop port is represented with a binary code and a tree representation is obtained for each of the output bits. The implementation of the routing scheme shows that the BDD hardware engine gives a throughput of up to 172.1 Ml/s for a large MAE-East routing table with 24 792 prefixes, a throughput of up to 168.6 Ml/s for an MAE-West routing table with 29 487 prefixes, and a throughput of up to 229.3 Ml/s for the Pacbell routing table with 6,822 prefixes. Thus, a data throughput of 200 Gb/s (with an average packet size of 1000 bits) can be obtained in the router implemented with the BDD based hardware address lookup engine.
Following the implementation of the scheme with the analysis of its performance in terms of the lookup time and packet throughput in IPv4 routing, and the proof that the processing time for the logic optimization in IPv6 routing tables is well under limit due to the emphatically smaller number of effective nodes, the next step is to obtain an implementation of the scheme for IPv6 and measure the lookup time.
