Abstract-Internet protocol (IP) address lookup is one of the major performance bottlenecks in high-end routers. This paper presents an architecture for an IP address lookup engine based on programmable finite-state machines (FSMs). The IP address lookup problem can be translated into the implementation of a large FSM. Our hardware engine is then used to implement this FSM using a structured approach, in which the large FSM is broken down into a set of smaller FSMs which are then mapped into reconfigurable hardware blocks. The design of our hardware engine is based on a regular and well structured architecture, which is easy to scale. Our simulation results demonstrate that the FSM based architecture can easily scale to wire speed performance at OC-192 rates. Unlike previous approaches, the performance of our architecture is not constrained by memory bandwidth and is, therefore, in principle scalable with very large scale integration technology.
I. INTRODUCTION
O VER THE past several years, the Internet has witnessed remarkable growth both in terms of the number and the bandwidth requirements of applications. The high bandwidth need requires faster communication links and faster packet processing capability in routers. Internet protocol (IP) address lookup remains one of the major performance bottlenecks for faster packet processing in routers. The IP lookup engine must have the ability to process every packet at "wire speed," i.e., it must be able to forward minimum size IP packets at line rates. This would mean that the lookup engine must have the capability to process millions of packets per seconds at line rates of the order of gigabits per second.
The primary reason for the complexity of IP address lookup is that the lookup requires longest prefix match computation which arises due to classless interdomain routing (CIDR). CIDR is usually employed in the Internet to address the problem of IP address space inefficiency. Instead of classful address aggregation (based on Class A, B, and C type addresses), CIDR allows arbitrary aggregation of addresses at various points within the Internet and the address prefix bits common to all the IP addresses at an aggregation point is used to denote the aggregate. The address prefix used to represent an aggregate of networks can vary in length from 1 to 32 bits depending upon the aggregation. Each routing table entry is represented by a (address prefix, prefix length) pair. When an IP packet is received, a lookup is performed in the routing table to determine which of the prefixes match the destination IP address. If multiple prefixes match, then the output interface corresponding to the longest match is selected and the packet is forwarded on this interface.
In this paper, we denote an address prefix by a bit string of zero and one followed by and up to a maximum length of 32 bits. For example prefixes could be of the form 110 and 11 . If the destination IP address is 11011, it matches both the entries. The one with the longest match, i.e., 110 is selected.
IP address lookup has been an active area of research in the recent past. Several innovative solutions [1] - [10] have been proposed in the literature for faster IP lookup algorithms. An exhaustive survey of these techniques can be found in [11] . Most of these approaches organize the routing database of address prefixes in a clever manner so as to reduce the required memory accesses for faster search times. The performance is, however, still limited by memory bandwidth and the worst case performance cannot always be guaranteed. We, therefore, believe that despite several interesting solutions proposed by earlier researchers, we do not have a "truly" scalable architecture, i.e., an architecture that would scale with very large scale integration (VLSI) technology.
In this paper, we take a fundamentally different approach and present a new paradigm for the problem of address lookup. The primary contribution of the paper is an architecture for a lookup engine in the form of a finite-state machine (FSM) that can be implemented using reconfigurable hardware. The FSM is generated using the routing database. Backbone routers may have routing database with more than 50 000 prefixes. For such a large number of prefixes, the number of states in the corresponding FSM would be very large. In the design of such very large FSMs (VLFSMs), we adopt a structured approach wherein the size and performance of the implementation is predictable given the knowledge of the graph of the FSM. We achieve this using decomposition of FSM graphs. The large FSM is decomposed into several small FSMs to realize an efficient architecture. One of the advantages of such an architecture is that it is not constrained by memory bandwidth.
The rest of the paper is organized as follows. Section II briefly discusses various solutions to IP address lookup schemes that have been proposed in the past. Section III proposes our lookup engine architecture. We also illustrate the generation of the route lookup FSM. Section IV discusses how a VLFSM can be decomposed into small FSMs. We describe the results of decomposition of a VLFSM generated from the Mae-East routing database [12] and the database of the Finnish University and Research Network (FUNET). Most research works in this area have primarily tested their algorithms on these databases. Section V illustrates the simulation results of IP address lookup performance. Section VI briefly discusses the VLSI architecture. In Section VII, we highlight some issues related to scalability and area optimizations of our approach. We finally conclude the paper in Section VIII.
II. IP ADDRESS LOOKUP SCHEMES
The performance of an IP address lookup algorithm is characterized by two parameters. One is the lookup time, i.e., the time required to determine the output interface corresponding to a destination IP address. Since routing table entries may change due to route updates, the time required by an IP lookup algorithm to respond to the changes in the routing database is another parameter used to characterize the IP address lookup. This is termed as update time.
IP address lookup engines can be broadly classified into two categories: one based on content addressable memories (CAM) and the other processor-memory combination. Our scheme actually creates a third category, which is based on programmable FSMs.
A. CAM-Based Solutions
In this model, the address lookup can be performed using ternary CAM (TCAM) [13] . In a TCAM, a mask of bits can be specified per word. The routing table entries are stored in the order of decreasing prefix lengths. The longest prefix match, thus, corresponds to the first entry among all the entries that match the destination IP address. A TCAM is an attractive solution for high-speed IP address lookup, however, TCAMs with large sizes are typically very expensive. Historically, the CAM technology has also not kept pace with the dynamic random access memory (DRAM) technology in terms of storage density. TCAMs are also very poor in terms of update time, though, recently some progress [14] has been made in this direction.
B. Processor-Memory Based Solutions
In this model, the routing table entries are present in memory and the lookup algorithm runs on a processor. The objective of an IP lookup algorithm is to organize the routing database in an intelligent manner such that during actual lookup operation as few memory accesses are required as possible.
For backbone routers with a large routing database, architectures that use off-chip DRAMs are usually employed. One measure of the lookup speed of an algorithm is the number of DRAM accesses that are required to be made. New memory technologies such as synchronous DRAM (SDRAM), RAMBUS, double data rate DRAM (DDR-DRAM) employ some form of parallel banks of memory and interleaving can be performed to hide memory access latency. As pointed out in [6] , each memory technology introduces some tradeoffs and IP lookup algorithms need to be carefully tuned across memory architectures to extract the best performance.
One of the simplest ways to store the routing database of address prefixes in memory is in the form of a 1-bit trie. A trie is a tree like data structure where the prefix bits are used to create tree branches. Several modifications to the basic 1-bit trie have been proposed in the literature. Path compression techniques [15] can be used to remove those nodes from the tree that have only one child. The missing nodes are denoted by a skip value that indicates how many nodes have been skipped on the one way path.
Instead of 1-bit tries, multibit tries [3] can also be used. Unlike in a 1-bit trie, where each node branches to its children depending upon the value of a binary bit, in multibit tries, the branching occurs depending upon the value of several bits taken together. The search also proceeds by inspecting several bits simultaneously. The number of bits to be examined is called the stride length. The strides can be of fixed length or of variable lengths at different levels of the tree. The address prefixes need to be converted into prefixes with lengths equal to the stride. The length of the strides offers a tradeoff between memory and search speed. The optimal strides can be computed using the prefix length distribution [3] .
In LC Tries [16] , each complete subtree of height is converted into a subtree of height 1 with children. Thus, a 1-bit trie gets converted into a multibit trie. In [5] , a multibit trie with fixed stride lengths is implemented using memory banks. By appropriate pipelining, the authors claim that lookup can be performed in one memory access. This is, however, achieved at the expense of large memory size.
Though the above algorithms have provided very novel techniques to arrange the prefixes in an intelligent manner, we believe that the scalability of processor-memory solutions is limited by the fact that the lookup operation requires DRAM accesses. Despite the considerable progress, the DRAM technology has not kept pace with the processor technology. We attempt to address this issue by proposing an architecture for lookup engine that is not constrained by memory bandwidth. The solution presented in [6] also belongs to processor-memory model but uses a very novel technique based on multibit tries and bit maps to compress the data structure so that it can be fit into a processor's cache. Since the accesses are now performed on cache, the lookup performance improves considerably. This is an attractive solution and we compare our results with this scheme later in the paper.
III. PROPOSED APPROACH

A. Lookup Engine Architecture
Our basic architecture for the lookup engine is shown in Fig. 1 . The reconfigurable hardware shown in the figure performs the address lookup. A reconfigurable hardware is essentially a circuit whose behavior can be modified on the fly. The hardware implementation is in the form of a programmable FSM. The state transition table can be loaded onto it by the processor. The processor computes the FSM for a given routing database of address prefixes and then compiles it in a format appropriate for programming the reconfigurable hardware. Due to a routing update, if there is a change in the routing database, then the state machine is recomputed again. In case of changes, either the entire FSM may have to be reprogrammed or changes to some part of the FSM graph may have to be made.
All the approaches discussed in Section II (except CAM-based solutions) require several memory accesses and, thus, the memory bandwidth is one of the major performance bottlenecks. The FSM-based architecture can be efficiently implemented using flip-flops (FF) and all the memory accesses can be reduced to accessing the high-speed registers. The implementation can, thus, scale with VLSI technology.
We now present ways to generate an efficient FSM for the routing database and evaluate the lookup speed of such an approach.
B. FSM for Lookup Engine
To illustrate our basic approach, we first consider generating an FSM from the 1-bit trie structure. Consider the 1-bit trie for the prefix Table I . The procedure for generating the 1-bit trie begins at the root node for each prefix. The bits in the prefix are examined one by one. If the bit is zero, then the left node is formed (if not already present) otherwise if the bit is one, then the right node is formed. To generate an FSM, each node in the resulting 1-bit trie can be associated with a state in a FSM. The 1-bit trie and the corresponding FSM for the prefix database of Table I are illustrated in Fig. 2 . We call this FSM a 1-bit FSM. The state transition table for this FSM is given in Table II . The state corresponding to an address prefix stores the corresponding output interface.
To perform a lookup, the destination IP address bits are applied in serial order and the state machine makes a transition from one state to another depending upon the bit. If a state representing a valid interface is encountered, the state number is stored. The IP address bits are applied until a node whose next state is FINAL is encountered. The search is terminated and the output interface number corresponding to the last stored state is retrieved. In the given example, if the destination IP address is 1100, the states that would be traversed are S1, S3, S5, and S8 and the output would be 2. In the worst case, 32 states might have to be traversed for IP lookup, but note that these are not memory accesses and, hence, can be quite fast.
For practical routing databases, the number of states present in the state machine would be large. We have calculated the number of states in the FSM generated for Mae-East, FUNET, and Ripe routing databases. The results are summarized in Table III . The large number of states may result in inefficient hardware implementation and higher delays. We, therefore, follow a structured approach where the FSM graph is partitioned into smaller machines each containing some maximum number of states, say 1024. These smaller machines are then connected together as shown in Fig. 3 to obtain the overall functionality of the FSM. The partitioning of FSM graph is done with a view to minimize the area of the chip and make the performance of the chip predictable. Each machine is made reconfigurable by introducing memory cells. When one machine completes the processing of a packet, the packet is handed over to an appropriate machine by the central block. We now investigate methods for decomposition of state machine into smaller state machines by exploiting the structure of FSM graphs.
IV. DECOMPOSITION OF THE FSM FOR IP LOOKUP ENGINE
We first illustrate through an example how a state machine can be decomposed into two interdependent state machines. Fig. 4 illustrates a simple FSM. The state transition table is given in Table IV. The state machine can be decomposed into two state machines A and B. The decomposition is illustrated in This example has illustrated the concept of decomposition of a state machine. However, we could not achieve any reduction in the number of states for this particular example. We now investigate ways to efficiently decompose a large FSM generated from the 1-bit trie of address prefixes.
There are several standard structures like completely balanced trees, paths, etc., that can be efficiently decomposed into smaller state machines at the cost of slightly more complex transitions [18] . We have explored the decomposition of an .   TABLE II  STATE TRANSITION TABLE   TABLE III  NUMBER OF STATES IN FSM FSM that exploits these structures and their applicability in the state machine of a routing database. Specifically, we have investigated the presence of these structures in Mae-East and FUNET routing database in order to ascertain the decomposition of FSM generated for these databases. We have observed from the results that the number of such structures is very small and, hence, efficient decomposition can not be obtained for minimizing the number of states present in FSM. These TABLE IV  STATE TRANSITION TABLE FOR ORIGINAL FSM results and observations have been discussed in [20] . Since the applicability of these results in an IP lookup engine is limited, we do not pursue them here. Instead, we attempt a two level hierarchical decomposition of the FSM. The first TABLE VI  STATE TRANSITION TABLE FOR FSM A   TABLE VII  STATE TRANSITION TABLE FOR 
A. Topographical Breakdown
In this section, we illustrate how the FSM can be decomposed into smaller FSMs using topographical breakdown. In topographical breakdown, the large FSM is decomposed into smaller FSMs such that the number of states in each smaller FSM does not exceed some upper limit. These smaller FSMs are interconnected to function as a 1-bit FSM. For example, Fig. 5 illustrates how a 1-bit FSM can be decomposed into smaller FSMs. In this case, the number of states in each small FSM is not more than three. Note also that each small FSM has only one incoming edge. The topographical breakdown may also be performed to optimize the number of smaller machines. The algorithm for topographical breakdown employed by us is explained in Algorithm 1. 
goto BEGIN end if end while
Note that the flow of control is always directed downwards in a 1-bit FSM, i.e., after visiting a node the control is transferred to its child node. Thus, the smaller FSMs are not interactive. After having traversed through one FSM, the control is transferred to its child FSM and the current FSM can start processing the next packet. These FSMs can work in a pipelined fashion and increase the throughput. The time spent in one FSM is dependent on its depth. To operate these smaller FSMs efficiently, all of them should ideally have the same depth so that the time spent in all FSMs is same.
In the rest of the paper, we refer to each of such smaller FSMs obtained by topographical breakdown as machine. Each machine is now partitioned into two sub-FSMs using orthogonal decomposition. These sub-FSMs are referred to as partitions in the rest of this paper. The approach followed by us is a factoring of the original machine (with states) into two partitions (each with or more states) and this factoring can be viewed as the meet of the two orthogonal partitions of the set of states of the original machine. For example, a machine with 100 states may be decomposed into two orthogonal partitions each with ten states. , where is the unique state in . It can be easily proved that the terminal behavior of the machine is identical to that of the partitions along with the combinational logic.
B. Orthogonal Partitioning
The orthogonally decomposed machine as above belongs to General Decomposition category discussed in the classical literature [17] . The FSM graph of a decomposed machine may contain parallel edges. If all the parallel edges emanating from a state and terminating on the state are replaced by a single directed edge (called multiedge) from to , then we get a diagraph where is a set of states and is the set of edges. This diagraph may, however, contain self loops.
We would like to decompose each machine into two orthogonal partitions: partition and partition such that the number of multiedges in diagraphs corresponding to the partitioned machines is minimized. It has been shown in the previous work of one of the authors of this paper (Desai) [18] that this reduces the area of the chip as the number of multiedges has a direct correlation with the area of the chip. It is also shown in [18] that this decomposition helps in reducing the delays as well. The Greedy Algorithm of [18] has been found to give 4%-8% less area and about 80%-100% improvement in delays than conventional state assignment approaches considered in the literature [19] . This provides us the motivation to apply the Greedy algorithm for decomposition of FSM of routing database. The Greedy algorithm builds partition by forcing tightly connected states into the same block so that the edges between them are replaced by self-loop. While building the second partition , the states are added one at a time by doing a local search (on assignments of states to vertices in the partition) to determine which assignment creates the minimum number of additional edges in the partition.
The pseudocode is explained in Algorithm 2. The algorithm generates the partitions and . The algorithm creates partitions such that will consist of blocks each containing at most states and will consist of blocks each containing at most states. The function returns the state with the maximum number of fan-in edges from the states already in a given block of the partition .
The function chooses one block among all the available blocks of partition such that when a state is put in that block, the additional number of edges created in the diagraph corresponding to partition is minimized.
C. Decomposition of FSM for IP Lookup Engine
In our implementation, the FSM is generated from the 1-bit trie obtained after preprocessing the routing database. This 1-bit FSM is topographically decomposed into machines. The maximum possible number of states in each machine is set to some upper limit. We have performed our simulations with different maximum limits of 256, 512, 1024, and 2048 states and evaluated the performance of our approach in each case. The machines are decomposed into interdependent orthogonal partitions using the above mentioned Greedy algorithm.
The number of states in the 1-bit FSM for Mae-East, FUNET, and Ripe database have already been given in Table III . The results of the topographical breakdown are given in Table VIII for the cases when the maximum number of states in each machine is restricted to 256, 512, 1024, and 2048. These are called Case 1, 2, 3, and 4, respectively, in the table. The results after each machine is decomposed into orthogonal partitions are given in Table IX . Note that we can achieve a substantial reduction in the total number of states. The results in the second and third column of Table IX indicate that we can achieve an effective partitioning of a very large FSM generated from routing database. The simulations performed with a C model of the IP lookup engine based on such a partitioned FSM architecture are discussed in the next section.
V. SOFTWARE MODEL AND PERFORMANCE RESULTS
We have developed a C model of IP address lookup engine based on FSM to analyze the performance of such a lookup chip in the network. In the software model, a VLFSM is generated using the routing database. The FSMs are controlled by FSM controller. For the simulation, the packet trace is generated from the database. For this, the prefixes are chosen at random from the database and these become addresses (prefix with trailing zeros) that need to be looked up.
TABLE X LOOKUP TIME FOR MAE-EAST AND FUNET DATABASES
The function of the FSM controller in our software model is to input the IP address bits one by one to an appropriate FSM. It also passes control from one FSM to another when necessary. We also assume pipelining of the packets. When one FSM completes its processing of an IP address, it can start processing another packet. In the model, we have assumed that two clock cycles are required for the transfer of control from one FSM to another FSM. This assumption is reasonable. Even if we assume that five clock cycles are required to transfer control, our simulations indicate that the average number of cycles for lookup without pipelining increases by about ten cycles while with pipelining, it increases only by two to three cycles.
The results of our simulations for Mae-East and FUNET routing database are given in Table X . From the results, we observe that the average number of cycles required for lookup without pipelining is of the order of 30 cycles for Mae-East and 20 cycles for FUNET. The pipelining reduces the number of cycles to about five to eight cycles for lookup. Note that the average number of cycles without pipelining in case of Mae-East is more than that of FUNET, however, with pipelining Mae-East requires actually smaller number of cycles. This is possible as the depth of the FSM graph is not same for every machine and the effect of pipelining is dependent on the FSM graph of a routing database and its partitioning.
If we assume that each machine runs at about 100 MHz (as has been explained in the next section, it is possible to achieve this clock speed in the actual realization of the chip), then the average lookup time is of the order of 200-300 ns without pipelining and 50-80 ns with pipelining. These roughly translate to lookup capability of the order of about 10 million lookups per second. We have compared these results with that of [6] . The authors of [6] have carefully tuned their scheme to give performance that can scale to OC-192 line rates. Our scheme gives comparable and with pipelining even superior performance to tree bit map scheme. Moreover, as pointed out earlier, our approach is not constrained by memory bandwidth.
In the next section, we briefly discuss the VLSI architecture of the lookup chip. For lack of space, we discuss only the salient features. The details of the VLSI implementation including the layout of the lookup engine have been discussed at length in [21] .
VI. VLSI ARCHITECTURE
Our basic lookup chip has been illustrated in Fig. 3 . In an actual implementation, the number of machines and the maximum number of states in the machine will be fixed. Each machine shown in Fig. 3 is drawn in Fig. 6 with input and output ports. These machines are controlled by the central control logic. The central block is also responsible for transferring control from one machine to another.
is the port where IP address bits are applied. The signal is low when the machine is operated in lookup mode. For update purposes, the signal is high. The other input signals shown in the figure are also required for programming the state machine. The output is high when the lookup operation terminates.
The FSM generated from the routing database is topographically decomposed into smaller FSMs that are mapped onto these machines. Each of these machines is partitioned into two orthogonal partitions as explained above. The block diagram of the machine with orthogonal partitions is shown in Fig. 7 . Each combinational logic block computes the next state or the output function using the present states of the two partitions and the external input. The complete VLSI architecture is depicted in Fig. 8 . Currently, we have not performed any Boolean optimization while implementing each machine. The optimization can be performed to achieve more number of states per machine within the same area. Each partition has been implemented with a double PLA (DPLA) as shown in Fig. 9 . Any FSM graph can be realized by programming the DPLA structure. The programmability of the architecture is obtained by introducing memory elements. When the FSM is to be updated, the input signal is set high. The values to be stored in the memory elements of a particular row of the PLA structure are scanned in the flip-flop scan chain (FF scan chain shown in Fig. 8 ) using the input ports for partition and for partition . For lack of space, we do not discuss here the clock distribution scheme and the design of central block (see Fig. 3 ). The central block is responsible for controlling the FSM machines and transferring the control from one machine to another. The performance of the chip has been predicted through simulations. MAGIC version 7.1 is used to do the layout for 0.25-technology n-well process. The circuit layouts are extracted and simulated using SPICE3f5 [22] and IRSIM, Version 9.5. The transistor model BSIM3, Version 3.1 [23], level 8 is used. The power supply used is 2.5 V. A plot of clock period of operation of the machine as a function of (where is the total number of states in the machine) is shown in Fig. 10 . These simulation results indicate that each machine can easily work at 100-150 MHz. Thus, we can achieve the lookup performance predicted by the C model of the chip.
VII. DISCUSSION
Apart from the lookup speed, other key issues that need to be considered in judging a route lookup solution are: area of the silicon, the problem of applying updates, and the scalability of the solution.
A. Area Considerations and Optimizations
The estimated area for a 1024 state programmable block is 1 in the reference 0.25-technology [21] . Thus, assuming a 50% overhead for the central block, we estimate that a VLSI chip in 0.25-technology can accommodate 50 000 states, that is, a table with up to 10 000 prefix entries. This packing will improve as technology scales, and in 0.13-technology which is currently available, a chip should accommodate a routing database with 40 000 entries.
The area efficiency can be improved further if we take advantage of the following observation: in the pipelined architecture, very few machines are active simultaneously (at most five machines are active in the cases we have studied). Thus, there is no need to have 50 machines to accommodate a 50 000-state FSM Instead, we can make do with a smaller number of machines, and use dynamic reconfiguration to map several sub-FSMs onto the same hardware block. One possible architecture is indicated Fig. 11 . Dynamically reconfigurable architecture.
in Fig. 11 . In this figure, the individual machines have attached configuration memories which store the possible sub-FSM that can be mapped to the corresponding machine. At run time, a machine is programmed with a sub-FSM depending on the state of the lookup. Essentially, this architecture trades off machine area for memory area. We are presently exploring this theme further, and initial results indicate that a considerable saving in area can result. The key technical difficulties that need to be addressed here are improvement of the reprogramming time of each machine, and the effective scheduling of the individual machines.
B. The Update Problem
Whenever the routing database changes, this change needs to be applied to the hardware. This update problem has two parts: first, the lookup trie needs to be modified; second, the modified update needs to be applied to the hardware. We will concentrate on the second aspect of the update problem. Addition or deletion of an entry typically leads to a small change in the trie. When the trie changes, the change is typically localized within a single sub-FSM. Hence the number of machines to be updated is small. Thus, for such local changes the update is not a serious issue.
If the entire database changes, then all the sub-FSMs need to be updated. In the worst case, assuming a sub-FSM size of 1024 states, 15 kbits of configuration memory need to be transferred for each machine [21] . Thus, for a circuit consisting of 50 machines, 750 Kbits need to be transferred into the system. Assuming a 32-bit access bus operating at 100 MHz, this much data can enter the chip in s. Thus, we can safely claim that the physical update for a chip with 50 K states can be performed in less than 100 s.
C. Scaling Issues
FSM based lookup engine is scalable with respect to the following properties.
1) As process technology improves, and feature sizes shrink, there is a direct benefit in packing and speed. This benefit will track the technology scaling exactly (as opposed to memory speeds which do not track as well). 2) As the prefix lengths scale (towards IPV6), the lookup FSM size is determined mainly by the size of the routing table, and not as much by the length of the tag being looked up. On the other hand, CAM-based approaches will have a direct penalty here. 3) As the key size increases, the key stored with the tag can be moved to memory outside the FSM. Thus, a lookup in the FSM can be followed by a (guaranteed) single lookup in memory.
VIII. CONCLUSION
In this paper, we have presented a new approach for an IP address lookup engine based on a programmable FSM. Apart from CAM-based solutions, the previous research has concentrated mainly on processor-memory solutions. The database is arranged in various forms of trie data structures. The researchers have carefully tuned their algorithms across various memory architectures and exploited some new DRAM architectures to extract the best performance. In contrast to these approaches, we have presented an approach that is not constrained by memory bandwidth. The partitioning approach followed by us results in a regular and well structured FSM. Our simulation results demonstrate that we can achieve a wire speed performance at OC-192 rates. We believe that our approach has the potential to scale with VLSI technology.
We have thus far considered an FSM generated from a 1-bit trie. Multibit tries and its variants can also be considered within our framework. As indicated earlier, the VLSI architecture can be optimized for area by using dynamic reconfiguration. This can improve the packing properties of the architecture. Our current partitioning algorithm is not optimal. Indeed, we have developed some heuristics (see [21] ) that have shown some promise for better performance.
