We propose the multiple LUT cascade as a way to configure an n-input longest prefix match (LPM) address generator commonly used in routers to determine the output port given an address. The LPM address generator accepts n-bit addresses which it matches against k stored prefixes. We implement our design on a Xilinx Spartan-3 FPGA for n ¼ 32 and k ¼ 504 $ 511. Also, we compare our design to a Xilinx proprietary ternary content-addressable memory (TCAM) design and to another design we propose as a likely solution to this problem. Our best multiple LUT cascade implementation has 5.17 times more throughput, 40.71 times more throughput/area and is 2.97 times more efficient in terms of area-delay product than Xilinx's proprietary design, but its area is only 15% of Xilinx's design. Furthermore, we derive a method to determine the optimum configuration of the multiple LUT cascade on an FPGA.
Introduction
The need for higher internet speeds is likely to be the subject of intense interest for many years to come. A network's speed is directly related to the speed with which a node can switch a packet from an input port to an output port. This, in turn, depends on how fast a packet's address can be accessed in memory. The longest prefix match (LPM) problem is one of determining the output port address from a list of prefix vectors stored in memory. For example, if the prefix vector 01001**** is stored in memory, then the packet address 010011111 matches this entry. That is, each bit in the packet address matches exactly the corresponding bit in the prefix vector or there is a * or don't care in that position. If other stored prefixes match the packet address, then the prefix with the least don't care values determines the output port address. That is, the memory entry corresponding to the longest prefix match determines the output port.
An ideal device for this application is a ternary content-addressable memory (TCAM) (Song and Lockwood 2005, Pagiamtzis and Sheikholeslami 2006) . The descriptor ''ternary'' refers to the three values stored, 0, 1, and *. In Kasnavi et al. (2005) , the authors propose pipelined TCAMs for the longest prefix match to increase TCAM efficiency. In Wang et al. (2005) , the authors used a TCAM and a small DRAM for the longest prefix match to reduce the required size of TCAM. Unfortunately, TCAM still dissipates more power than standard RAM (Renesas 2005) .
Several authors have proposed the use of standard RAM in LPM design. Gupta et al. (1998) showed a mechanism to perform LPM every memory access. Dharmapurikar et al. (2003) have proposed the use of Bloom filters to solve the LPM problem. have shown that a fast, power-efficient TCAM realization using a look-up table (LUT) cascade.
In this paper, we propose an extension to the LUT cascade realization: a multiple LUT cascade realization that consists of multiple LUT cascades connected to a special encoder. This offers even more efficient realizations in an architecture that is more easily reconfigured when additional prefix vectors are placed in the prefix table.
We have implemented six LPM address generators on the Xilinx Spartan-3 FPGA (XC3S4000-5): four using multiple LUT cascades, one using Xilinx's TCAM realization based on the Xilinx IP core, and one using registers and gates. In addition, we compare the six LPM address generators on the basis of delay, delay-area product, throughput, throughput/area, and FPGA resources used.
A preliminary version of this paper was presented at ARC2006 (Qin et al. 2006 ). We extend these results by introducing the optimum configuration of the multiple LUT cascade, and by showing how to realize the optimum multiple LUT cascade on an FPGA.
The rest of the paper is organized as follows: x 2 describes the multiple LUT cascade; x 3 shows other realizations for the LPM address generators; x 4 presents the implementations of the LPM address generator using an FPGA; x 5 shows the experimental results; x 6 discusses the optimum configuration of the multiple LUT cascade implemented on an FPGA; and finally, x 7 concludes the paper.
Multiple LUT cascades

LPM address generators
A content-addressable memory (CAM) (Shafai et al. 1998 ) stores 0s and 1s and produces the address of the given data. A TCAM, unlike a CAM, stores 0s, 1s, and *s, where * is a don't care value that matches both 0 and 1.
TCAMs are extensively used in routing tables for the internet. A routing table specifies an interface identifier corresponding to the longest prefix that matches an incoming packet, in a process called longest prefix match (LPM). In the LPM table, the ternary vectors have restricted patterns: the prefix consists of only 0s and 1s, and the postfix consists of only *s (don't cares). In this paper, this type of vector is called a prefix vector.
Definition 1: An n-input m-output k-entry LPM table stores k n-element prefix vectors. To assure that the longest prefix address is produced, TCAM entries are stored in descending prefix length, and the first match starting from the top of the table determines the LPM table's output. An address is an m-element binary vector for m ¼ dlog 2 ðk þ 1Þe, where dae denotes the smallest integer greater than or equal to a. The corresponding LPM function is a logic function f : B n ! B m , where fðxÞ is the smallest address of an entry that is identical tox except possibly for don't care values. If no such entry exists, fðxÞ ¼ 0 m . The LPM address generator is a circuit that realizes the LPM function.
Example 1: Table 1 shows an LPM table with 5 4-element prefix vectors. Table 2 shows the corresponding LPM function. It has 16 entries, one for each 4-bit input. The output address is stored for each input corresponding to the address of the longest prefix vector that matches it.
An LUT cascade realization of LPM address generators
An LPM function, such as that shown in table 2, can be realized by a single memory which operates as a programmable combinational logic circuit. However, this often requires prohibitively large memory size.
Theorem 1: (Sesao 2006) An n-input LPM address generator with k prefix vectors can be realized by an LUT cascade, where each cell realizes a p-input, r-output combinational logic function. Let s be the necessary number of levels or cells. Then,
where p > r and r ¼ dlog 2 ðk þ 1Þe. Since the delay is proportional to the number of levels in a cascade, we wish to reduce the number of levels. According to (1), if we increase p, the number of inputs to each cell, then the number of levels s is reduced. For each increase by 1 of p, the memory needed to realize the cell is doubled. However, as shown in figure 1 , we can use a multiple LUT cascade to reduce the number of levels s while keeping p fixed. For an n-input LPM function with k prefix vectors, let the number of rails of each LUT cascade be r. First, starting at the top of the LPM table, partition the set of prefix vectors into g groups of 2 r À 1 vectors each, except the last group, which has 2 r À 1 or fewer vectors, where g ¼ dk=ð2 r À 1Þe. For each group of prefix vectors, form an independent LPM function. Next, partition the set of n inputs into s groups. The inputs within a group will apply to a single cell within each cascade. Then, realize each LPM function by an LUT cascade. Thus, we need a total of g LUT cascades, where each LUT cascade consists of s cells. Finally, use a special encoder to produce the LPM address. Let v i ði ¼ 1, 2, . . . , gÞ be the ith input of the special encoder from the ith LUT cascade, and let v out be the output value of the special encoder. That is, v i is the output value of the ith LUT cascade, where its binary output values are viewed as a standard binary number. Similarly, v out is the output of the special encoder, where its binary output values are viewed as a standard binary number. Then, we have the relation
Note that v out is the position of a prefix vector v in the complete LPM table, while i is the index to the LUT cascade storing v. ði À 1Þð2 r À 1Þ is the position in the LPM table of the last entry of the previous (i À 1)-th LUT cascade or is 0 in the case of the first LUT cascade. Adding v i to this yields the position of v in the complete LPM table.
Example 2: Consider an n-input LPM function with k prefix vectors. When k ¼ 1000 and n ¼ 32, by Theorem 1, we have r ¼ 10. Let p ¼ r þ 1 ¼ 11. When we use a single LUT cascade to realize the function, by Theorem 1, we need dðn À rÞ=ðp À rÞe ¼ 22 cells, and the number of levels of the LUT cascade is also 22. Since each cell has 11 address lines and 10 outputs, the total memory size needed to realize the cascade is 2 11 Â 10 Â 22 ¼ 450,560 bits. Note that the memory size of each cell, 2 11 Â 10 ¼ 20,480 bits, is too large to be realized by a single block_RAM (BRAM) of our FPGA, which stores 18,432 bits.
However, if we use a multiple LUT cascade to realize the function, we can reduce the number of levels and the total memory. Also, the cells will fit into the BRAMs in the FPGAs. Partition the set of vectors into two groups, and realize each group independently; this requires two LUT cascades. For each LUT cascade, the number of vectors is 500, so we have r ¼ 9. Also, let p ¼ r þ 2 ¼ 11. Then, we need dðn À rÞ=ðp À rÞe ¼ 12 cells in each cascade. Note that the number of levels of the LUT cascades is 12, which is smaller than the 22 needed in the single LUT cascade realization. Since each cell consists of a memory with 9 outputs and at most 11 address lines, the total memory size is at most 2 11 Â 9 Â 12 Â 2 ¼ 442,368 bits. Also, note that the size of the memory for a single cell is 2 11 Â 9 ¼ 18,432 bits. This fits exactly in the BRAMs of the FPGAs.
Thus, the multiple LUT cascade not only reduces the number of levels and the total memory, but also reduces the size of cells to fit into the available memory in the FPGAs. Figure 1 shows the multiple LUT cascade realization. It consists of multiple LUT cascades and a special encoder. The inputs of each LUT cascade are common with other LUT cascades, while the outputs of each LUT cascade are connected to the special encoder. Each LUT cascade realizes an LPM function, while the special encoder generates the LPM address from the outputs of cascades.
The detailed design of each LUT cascade is shown in figure 2. Herex i ði ¼ 1, 2, . . . , sÞ denotes the primary inputs to the ith cell,d i ði ¼ 1, 2, . . . , sÞ denotes the data inputs to the ith cell and provides the data value to be written in the RAM of the ith cell, r denotes the number of rails, where r dlog 2 ððk=gÞ þ 1Þe,c j ðj ¼ 2, 3, . . . , sÞ denotes the additional inputs to the jth cell and is used to select the RAM location along withx j for write access. Note thatc j andd i are represented by r bits. All RAMs except perhaps the last one have p address lines; the last RAM has at most p address lines. When WE is high,c j is connected to the RAM through a MUX, allowing data to be written into the RAMs. When WE is low, the outputs of the RAMs are connected to the inputs of the succeeding RAMs through a MUX, and the circuit is a cascade that realizes the LPM function. Note that the RAMs are synchronous RAMs. Therefore, the LUT cascade resembles a shift register.
Example 3: Table 3 shows a 6-input 3-output 6-entry LPM table, and the truth table of the corresponding LPM function is shown in table 4. Note that the entries in the two tables are similar. Table 4 is a compact truth table, showing only non-zero outputs. Its input combinations must be disjoint. Thus, the two tables are the same except for three entries.
Single memory realization. The number of address lines is 6, and the number of outputs is 3. Thus, the total amount of memory is 2 6 Â 3 ¼ 192 bits.
Single LUT cascade realization. Since there are k ¼ 6 prefix vectors, by Theorem 1, the number of rails is r ¼ dlog 2 ð6 þ 1Þe ¼ 3. Let the number of address lines for the memory in a cell be p ¼ 4. By partitioning the inputs into three disjoint sets fx 1 , x 2 , x 3 , x 4 g, fx 5 g, and fx 6 g we have the cascade in figure 3(a). For simplicity, only the signal lines for the cascade realization are shown. Other lines, such as for storing data, are omitted. The total amount of memory is 2 4 Â 3 Â 3 ¼ 144 bits, and the number of levels is s ¼ 3. Note that the single LUT cascade requires 75% of the memory needed in the single memory realization.
Multiple LUT cascade realization. Partition table 3 into two parts, each with three prefix vectors. The number of rails in the LUT cascades associated with each separate LPM table is dlog 2 ð3 þ 1Þe ¼ 2. Let the number of address lines for the memory in a cell be p ¼ 4. By partitioning the inputs into two disjoint sets fx 1 , x 2 , x 3 , x 4 g and fx 5 , x 6 g, we obtain the realization in figure 3(b) . The upper LUT cascade realizes the 
Note that ðout 2 , out 1 , out 0 Þ viewed as a standard binary number, has value v out corresponding to the address in table 3. The total memory size is 2 4 Â 2 Â 4 ¼ 128 bits, and the number of levels is 2. Note that the multiple LUT cascade realization requires 89% of the memory and one fewer level than the single LUT cascade realization. 
Registers and gates
We also compare our proposed multiple LUT cascade realization with a direct realization using registers and gates, as shown in figure 4. We use a register pair (Reg. 1 and Reg. 0) to store each digit of a ternary vector. For example, if the digit is * (don't care), the register pair stores (1,1). Thus, for n bit data, we need a 2n-bit register. The comparison circuit consists of an n-input and gate and n 1-bit comparison circuits, each of which produces a 1 if and only if the input bit matches the stored bit or the stored bit is don't care (* or 11). For each prefix vector of an n-input LPM address generator, we need a 2n-bit register, n 1-bit comparison circuits, and an n-input and gate. For an n-input address generator with k registered prefix vectors, we need k 2n-bit registers, nk 1-bit comparison circuits, and k n-input and gates. In addition, we need a priority encoder with k inputs and dlog 2 ðk þ 1Þe outputs to generate the LPM address. If the n-input and gate is realized as a cascade of 2-input and gates, this circuit can be considered as a special case of the multiple LUT cascade architecture, where r ¼ 1, p ¼ 2, and g ¼ k. Note that the output encoder circuit is a standard priority encoder.
FPGA implementations
We implemented the LPM address generators for 32 inputs and 504$511 registered prefix vectors on Xilinx Spartan-3 FPGAs (XC3S4000-5) in three ways, the multiple LUT cascade, Xilinx CORE Generator 7.1i, and registers and gates. XC3S4000-5 (Xilinx Inc. 2005 ) has 96 BRAMs and 27,648 slices. Each BRAM contains 18K bits, and each slice consists of two 4-input LUTs, two D-type flip-flops, and multiplexers. For each implementation, we described the circuit by Verilog HDL, and then used Xilinx ISE 7.1i to synthesize and to perform place and route.
First, we used the multiple LUT cascade to realize the LPM address generators. To use the BRAMs in the FPGA efficiently, we chose the memory size of a cell in the LUT cascade not to exceed the size of a BRAM unit. Let p be number of address lines of the memory in the cell. Since each BRAM contains 2 11 Â 9 bits, we have the relation: 2 p Á r 2 11 Â 9, where r is the number of rails. Thus, we have p ¼ blog 2 ð9=rÞc þ 11, where bac denotes the largest integer less than or equal to a.
We designed four LPM address generators r6p11, r7p11, r8p11, and r9p11, as shown in table 6, where the column number of prefix vectors denotes the number of registered prefix vectors, the column r denotes the number of rails, the column p denotes the number of address lines of the RAM in a cell, the column Group denotes the number of LUT cascades, and the column Level denotes the number of levels or cells in the LUT cascade.
To explain table 6, consider r8p11 which is shown in figure 5 . For r8p11, since the number of rails is r ¼ 8, the number of groups is d510=ð2 8 À 1Þe ¼ 2: Thus, we need two LUT cascades. Since each LUT cascade consists of 8 cells, the number of levels of r8p11 is 8. To efficiently use BRAMs in the FPGA, the number of address lines of the RAM in the cell is set to p ¼ blog 2 ð9=8Þc þ 11 ¼ 11. Let v 1 be the value of the outputs of the upper LUT cascade, let v 2 be the value of the outputs of the lower LUT cascade, and let v out be the value of the outputs of the special encoder. Then, we have the relation
The circuit realizing this expression requires 11 slices on the FPGA. For the whole circuit, r8p11 requires 16 BRAMs and 69 slices. From this table, we can see that decreasing r, increases the number of groups, but decreases the number of levels. Finally, we designed the LPM address generator with n ¼ 32 inputs and k ¼ 511 registered prefix vectors using registers and gates, as shown in figure 4 . This design is denoted Reg-Gates. Note that the number of inputs is 32 and the number of outputs is 9. This design required 27,646 slices.
Performance and comparisons
In table 7, we show the performance of multiple LUT cascade realizations (i.e., r6p11, r7p11, r8p11, and r9p11), and compare them with Xilinx's TCAM and Reg-Gates. In table 7, the column Level denotes the number of levels or cells in the LUT cascade, the column Slice denotes the number of occupied slices, the column Memory denotes the amount of memory required, and the column F_clk denotes the maximum clock frequency. The column tco denotes the maximum clock-to-output propagation delay. (It is the maximum time required to obtain a valid output at the output pin that is fed by a register after a clock signal transition on an input pin that clocks the register.) The column tpd denotes the maximum propagation time from the inputs to the outputs. The column Th. denotes the maximum throughput. Since the LPM address generator has 9 outputs, it is calculated as Th. ¼ 9 Á F clk For Reg-Gates, Delay denotes the maximum delay from the input to the output and is equal to tpd. For multiple LUT cascade realizations and Xilinx's TCAM, Delay denotes the total delay, and is calculated by
where 1000 is a unit conversion factor. Consider the area occupied by the various realizations. From the Spartan-3 family architecture (Xilinx Inc. 2005) , we can see that the area of one BRAM is at least the area of 16 slices (a slice consists of two ''4-input LUTs'', two flip-flops, and miscellaneous multiplexers).
An alternative estimate shows that the area of one BRAM is equivalent to that of 96 slices, as follows. In the Xilinx Virtex-II FPGA, one ''4-input LUT'' occupies approximately the same area as 96 bits of BRAM (also containing 18K bits) (Sproull et al. 2005) . Note that both ''4-input LUTs'' and BRAMs of the Virtex-II FPGA are similar to those of the Spartan-3 FPGA. Thus, we can deduce that one BRAM of the Spartan-3 FPGA occupies about the same area as 192 ð¼ 18 Â 1024=96Þ ''4-input LUTs''. If we view one ''4-input LUT'' as approximately one-half a slice according to our discussion in the previous paragraph, we conclude that one BRAM has about the same area as 96 ð¼ 192=2Þ slices. Thus, two estimates of the area for one BRAM, 16 and 96 slices are quite different. For this analysis, a worst case of 96 slices/BRAM was used.
In table 7, the column Area denotes the equivalent utilized area, where the area for one BRAM is equivalent to the area for 96 slices. The column Th./Area denotes the efficiency of throughput per area for one slice. The column Area-delay denotes the area-delay product. The value denoted by best shows the best result.
Xilinx's TCAM has the smallest delay, but requires many slices. Reg-Gates has almost the same delay as Xilinx's TCAM, but requires about three times as many slices as Xilinx's TCAM. Note that Reg-Gates requires no clock pulses in the LPM address generation operation, while the others are sequential circuits that require clock pulses. Since the delay of Reg-Gates is 58.67 ns, the equivalent throughput is ð1000=58:67Þ Â 9 ¼ 153 (Mbps), which is lower than all others.
All multiple LUT cascade realizations have higher throughput, smaller area, higher throughput/area, and are more efficient in terms of area-delay than Xilinx's TCAM. r8p11 has the highest throughput and the smallest delay among all multiple LUT cascade realizations, but is slightly less efficient in terms of area-delay than r9p11. r9p11 has the smallest area, the highest throughput/area, and the highest efficiency in terms of area-delay among all realizations. Although r8p11 has almost the same area-delay as r9p11, its area is 28% more larger than that of r9p11. Hence, r9p11 is the best multiple LUT cascade realization since it has 5.17 times more throughput, 40.71 times more throughput/area, and is 2.97 times more efficient in terms of area-delay product than Xilinx's TCAM, while the area is only 15% of Xilinx's TCAM.
The optimum configuration of the multiple LUT cascade
Firstly, consider the relation between the required memory size and the total area. As can be seen from table 7, for r6p11, which has the most complicated encoder, the memory required occupies 96.3% (¼ð48 Â 96Þ=4786) of the total area. For r9p11, the memory required occupies 92.1% (¼ð12 Â 96Þ=1251) of the total area. Note that r9p11 has the smallest proportion of the area for memory to the total area among all the multiple LUT cascade realizations. Thus, the memory consumes no less than 92% of the total area. In addition, as shown in table 7, the size of Memory is approximately proportional to the Area. Hence, we can assume that the multiple LUT cascade realization with the smallest memory size corresponds to that with the smallest total area. Secondly, consider the relation between Area-delay and Memory-delay product. As shown in table 8, r9p11 has both the smallest Area-delay and the smallest Memory-delay among all multiple LUT cascade realizations. Note that the value of Memory-delay is approximately proportional to the Area-delay. Thus, we can assume that the realization with the smallest Memory-delay corresponds to that with the smallest Area-delay. Therefore, we can use the total size of memory required instead of the total area, and Memory-delay instead of Area-delay to find the optimum multiple LUT cascade realization. Doing this allows a formal analysis, as shown in the next section.
Total size of memory
Consider the multiple LUT cascade implementation of an n-input LPM address generator that stores k prefix vectors. Let each cell be realized as a reconfigurable memory with m bits. For the implementations discussed previously in this paper, this memory is a BRAM of the Spartan-3 FPGA, where m ¼ 18,432 bits. Each cell in the LUT cascade has r outputs, where r dlog 2 ðk þ 1Þe. With m bits stored in each memory and r bits per word, m=r words are stored in each LUT cell. Therefore, the number of address inputs for each LUT cell is pðrÞ ¼ blog 2 ðm=rÞc. Note that r pðrÞ À 1. Let M(r) be the total memory needed to implement the given LPM address generator. That is,
where s ¼ dðn À rÞ=ðpðrÞ À rÞe is the number of cells in each of the g ¼ dk=ð2 r À 1Þe cascades that make up the multiple LUT cascade realization of the LPM address generator. Theorem 2: M(r) is a monotone decreasing function of r for r pðrÞ À 2.
Since M(r) is monotone decreasing for r pðrÞ À 2, to find the minimum M(r), it is only necessary to find M(r) for r ¼ pðrÞ À 2 and r ¼ pðrÞ À 1, an upper bound in r.
Memory-delay product
From table 7, we observed that the delay in an n-input LPM address generator is given approximately as
where F clk is the frequency of the clock in MHz. Let MD(r) be the memory-delay product of the multiple LUT cascade realization of the address generator. Therefore,
Theorem 3: MD(r) is a monotone decreasing function of r for r pðrÞ À 5. Specially, when pðr À 1Þ ¼ pðrÞ, MD(r) is a monotone decreasing function of r for r pðrÞ À 3.
Since MD(r) is monotone decreasing for r pðrÞ À 5, to find the minimum MD(r), it is only necessary to find MD(r) for r ¼ pðrÞ À i, where i ¼ 1, 2, . . . , 5, or for five values of r. Specially, when pðr À 1Þ ¼ pðrÞ, to find the minimum MD(r), we only need to consider three cases for r ¼ pðrÞ À i, where i ¼1, 2, and 3.
6.3. Optimum multiple LUT cascade for the BRAM containing 18K bits, 16K bits or 4K bits
In popular FPGAs, such as Xilinx's FPGAs or Altera's FPGAs, the sizes of BRAM are 18K bits or 4K bits. For other FPGAs, the size of BRAM can be 16K bits. We consider these three types of BRAMs in the following discussion. For 16K-bit BRAM, from pðrÞ ¼ blog 2 ð16;384=rÞc, we have pðrÞ ¼ 10 for r ¼ 9 and pðrÞ ¼ 11 for 5 r 8. Since pðr À 1Þ ¼ pðrÞ ¼ 11 for 5 r 8, from Theorem 3, MD(r) decreases with r when 1 r ð11 À 3Þ ¼ 8. Let ðrÞ ¼ s Á g Á ðs þ 2Þ, where s ¼ dn À r=pðrÞ À re and g ¼ dk=2 r À 1e. We can verify ð8Þ < ð9Þ when n > 15. In most applications, we can assume that n > 16. Thus, we can conclude that MD(r) is minimum when r is maximum, where r 8.
When the size of a BRAM is m ¼ 18K bits, from pðrÞ ¼ blog 2 ð9=rÞc þ 11, we have pðrÞ ¼ 11 for 5 r 9. When m ¼ 4K bits, from pðrÞ ¼ blog 2 ð8=rÞc þ 9, we have pðrÞ ¼ 9 for 5 r 8 and pðrÞ ¼ 10 for r ¼ 4. Thus, for both m ¼ 18K-bit and m ¼ 4K-bit, we have pðr À 1Þ ¼ pðrÞ when r pðrÞ À 5. From Theorem 3, MD(r) is minimum when r ¼ 11 À 2 ¼ 9 or r ¼ 11 À 3 ¼ 8 for 18K-bit BRAM, and r ¼ 9 À 1 ¼ 8, r ¼ 9 À 2 ¼ 7, or r ¼ 9 À 3 ¼ 6 for 4K-bit BRAM. We can verify ð6Þ < ð8Þ for 4K-bit BRAM when n > 14 . In most applications, we can assume that n > 16. Thus, we only need to consider the case of r ¼ pðrÞ À 2. Depending on the values of n and k, MD(r) is minimum when r ¼ pðrÞ À 2 or r ¼ pðrÞ À 3. However, for an LPM address generator with fixed n and k, we can easily obtain an r that minimizes ðrÞ ¼ s Á g Á ðs þ 2Þ by calculating the values for r ¼ pðrÞ À 2 and r ¼ pðrÞ À 3. Table 9 shows the values of r that minimize MD(r) for three types of BRAMs. In table 7, Area-delay for r9p11 and r8p11 are nearly the same. Note that when p ¼ 11 and n ¼ 32, ðp À 2Þ ¼ 0:30 and ðp À 3Þ ¼ 0:31 are almost the same value.
For BRAM containing 18K bits, Theorem 2 shows that the area for r ¼ 9 is smaller than for r ¼ 8. If MD(r) for r ¼ 9 is almost the same as for r ¼ 8, then the multiple LUT cascade is optimum when r ¼ 9. Similar to 4K-bit BRAM, if MD(r) for r ¼ 7 and r ¼ 6 are almost the same, then the multiple LUT cascade is optimum when r ¼ 7. The following example shows the design of an optimum multiple LUT cascade.
Example 4: Consider an LPM address generator with n ¼ 32 and k ¼ 2040 implemented on a Spartan-3 FPGA. Note that the size of a BRAM is m ¼ 18K bits. First, from pðrÞ ¼ blog 2 ð9=rÞc þ 11, we have pðrÞ ¼ 11 when 5 r 9. To obtain the optimal Area-delay realization, from table 9, r can be 8 or 9 when n ¼ 32 and k ¼ 2040. Let ðrÞ ¼ s Á g Á ðs þ 2Þ. We have ð9Þ ¼ 672, and ð8Þ ¼ 640. Note that ð9Þ is nearly the same as ð8Þ. Since the area for r ¼ 9 is minimum from Theorem 2, the multiple LUT cascade is optimum when p ¼ 11 and r ¼ 9. In this case, a realization with g ¼ dk=ð2 r À 1Þe ¼ d2044=ð2 9 À 1Þe ¼ 4 LUT cascades is optimum. Also, the number of levels is dðn À rÞ=ðp À rÞe ¼ dð32 À 9Þ=ð11 À 9Þe ¼ 12, which shows that each LUT cascade consists of 12 cells. Finally, we need a special encoder. Let v 1 be the value of the outputs of the top LUT cascade, let v 2 be the value of the outputs of the second LUT cascade, let v 3 be the value of the outputs of the third LUT cascade, let v 4 be the value of the outputs of the fourth LUT cascade, and let v out be the value of the outputs of the special encoder. Then, we have the relation The optimum realization of r9p11g4 is shown in figure 6 . Table 9 . The value of r that makes MD(r) minimum.
Block_ RAM size The minimum Memory-delay 18K bits r ¼ r_max when r_max 8 r ¼ r_optimal when r_max ¼ 9 4K bits r ¼ r_max when r_max 6 r ¼ r_optimal when r_max ¼ 7 16K bits r ¼ r_optimal for r 8 r max is the maximum integer r that satisfies both r pðrÞ À 1 and r dlog 2 ðk þ 1Þe, where pðrÞ ¼ blog 2 ðm=rÞc and m denotes the size of a BRAM. r optimal is r that makes s Á g Á (s þ 2) minimum, where s ¼ dðn À rÞ=ðpðrÞ À rÞe and g ¼ dk=ð2 r À 1Þe. For m ¼ 18K-bit BRAM, r optimal can be obtained by calculating values only for r ¼ pðrÞ À 2 ¼ 9 and r ¼ pðrÞ À 3 ¼ 8. For m ¼ 4K-bit BRAM, r optimal can be obtained by calculating values only for r ¼ pðrÞ À 2 ¼ 7 and r ¼ pðrÞ À 3 ¼ 6.
We also implemented the LPM address generator with n ¼ 32 and k ¼ 2040 on Xilinx Spartan-3 FPGAs (XC3S4000-5). Table 10 shows that r9p11g4 has almost the same throughput and area-delay as r8p11g8, but its area is only 75% of r8p11g8. In addition, r9p11g4 has higher throughput/area than that of r8p11g8. Thus, r9p11g4 is the optimum realization for the LPM address generator with n ¼ 32 and k ¼ 2040.
Conclusions
In this paper, we presented the multiple LUT cascade to realize LPM address generators. In addition, we discussed an approach to obtain the optimum configuration of multiple LUT cascade on FPGAs. Although we illustrated the design method for n ¼ 32 and k ¼ 504 $ 511, it can be extended to other values of n and k.
We implemented four LPM address generators (i.e. r6p11, r7p11, r8p11, and r9p11) on the Xilinx Spartan-3 FPGA (XC3S4000-5) by using the multiple LUT cascade. For comparison, on the same type of FPGA, we also implemented Xilinx's proprietary TCAM and Reg-Gates, an approach proposed by us as a likely solution to the LPM problem. Xilinx's TCAM has the smallest delay, but requires many slices. Reg-Gates has almost the same delay as Xilinx's TCAM, but requires the largest area, and requires about three times as many slices as Xilinx's TCAM. All multiple LUT cascade realizations have higher throughput, smaller area, higher throughput/area and more efficient in terms of area-delay product than Xilinx's TCAM.
