ABSTRACT Ternary content-addressable memory (TCAM)-based search engines play an important role in networking routers. The search space demands of TCAM applications are constantly rising. However, existing realizations of TCAM on field-programmable gate arrays (FPGAs) suffer from storage inefficiency. This paper presents a multipumping-enabled multiported SRAM-based TCAM design on FPGA, to achieve an efficient utilization of SRAM memory. Existing SRAM-based solutions for TCAM reduce the impact of the increase in the traditional TCAM pattern width from an exponential growth in memory usage to a linear one using cascaded block RAMs (BRAMs) on FPGA. However, BRAMs on state-of-the-art FPGAs have a minimum depth limitation, which limits the storage efficiency for TCAM bits. Our proposed solution avoids this limitation by mapping the traditional TCAM table divisions to shallow sub-blocks of the configured BRAMs, thus achieving a memory-efficient TCAM memory design. The proposed solution operates the configured simple dual-port BRAMs of the design as multiported SRAM using the multipumping technique, by clocking them with a higher internal clock frequency to access the sub-blocks of the BRAM in one system cycle. We implemented our proposed design on a Virtex-6 xc6vlx760 FPGA device. Compared with existing FPGA-based TCAM designs, our proposed method achieves up to 2.85 times better performance per memory.
I. INTRODUCTION
Ternary content-addressable memory (TCAM) compares an input word with its entire stored data in parallel, and outputs the matched word's address. TCAM stores data in three states: 0, 1, and X (don't care). Traditional TCAMs are built in application-specific integrated circuit (ASIC), and offer highspeed search operations in a deterministic time.
TCAM is widely employed to design high-speed search engines and has applications in networking, artificialintelligence, data compression, radar signal tracking, pattern matching in virus-detection, gene pattern searching in bioinformatics, image processing, and to accelerate various database search primitives [1] - [3] . The Internet-ofthings and big-data processing devices employ TCAM as a filter when storing signature patterns, and achieve a substantial reduction in energy consumption by reducing wireless data transmissions of invalid data to cloud servers [4] , [5] .
Field-programmable gate arrays (FPGAs) emulate TCAM using static random-access memory (SRAM), by addressing SRAM with TCAM contents. Each SRAM word corresponds to a specific TCAM pattern, and stores information on its existence for all possible data of the TCAM table. The increase in the number of TCAM pattern bits results in an exponential growth in memory usage. This exponential growth in memory usage has been reduced to linear growth by cascading multiple SRAM blocks in the design of TCAM on FPGA in previous work [6] , [7] .
Contemporary FPGAs implement block-RAM (BRAM) in the silicon substrate, and offer a high speed. For example, Xilinx Virtex-6 xc6vlx760 FPGA contains 720 BRAMs of size 36 Kb [8] , and provide operating frequencies of greater than 500 MHz [9] . Designers utilize these high-speed SRAM blocks to design SRAM-based TCAMs on FPGA.
In existing SRAM-based solutions, the storage capacity of a BRAM for TCAM bits is limited by its higher SRAM/TCAM ratio 2 9 9 , because of its minimum depth limitation of 512 × 72 when configured in simple dual-port mode on FPGA [8] . For example, the design methodologies proposed in [10] , [11] , and [12] , require a total of 56, 40, and 40 BRAMs of size 36 Kb, respectively, to implement an 18 Kb TCAM.
Excessive usage of BRAMs in the design of TCAM can result in a lack of BRAMs for other parts of the system on FPGA. Furthermore, the limited amount of BRAM resources on FPGA can compel designers to implement TCAMs in distributed RAM using SLICEM, resulting in the consumption of many slices, and a limitation on the maximum clock frequency of the design. This problem becomes more severe for the design of large storage capacity TCAMs. The efficient utilization of SRAM memory is imperative for the design of TCAMs on FPGAs.
The design of memory-efficient TCAMs requires shallow SRAM blocks on FPGAs. Multipumping-based multiported SRAM emulates the sub-blocks of a dual port SRAM block as multiple shallow SRAM blocks, by operating SRAM with a higher frequency clock, allowing access to its sub-blocks in one system cycle. Researchers have designed efficient multiported memories using BRAMs on FPGA [13] - [16] .
Existing FPGA-based TCAM design methodologies offer lower operational frequencies. This is mainly because of the complex wide signals routing between BRAMs and logic resulting from excessive usage of BRAMs and complex priority encoding units synthesized in logic slices for deeper traditional TCAMs. For example, the FPGA realizations of TCAM using BRAMs in [7] and [17] achieve operational frequencies of 139 MHz and 133 MHz to emulate 150 Kb and 89 Kb TCAMs, respectively. The highest operational frequency achieved in the previous studies [6] , [10] - [12] is 202 MHz for the implementation of an 18 Kb TCAM on FPGA.
The demand for efficient utilization of SRAM memory in the design of TCAM and the speed provided by existing FPGA-based TCAM solutions make the use of multipumping based multiported SRAM more practical for designing TCAM memory on FPGA. Our proposed TCAM design aims to achieve efficient memory utilization with a high throughput.
The contributions of this work are as follows:
• A novel multipumping-enabled multiported SRAMbased TCAM architecture, which achieves efficient memory utilization, is proposed.
• Our proposed approach presents a scalable and modular TCAM design on FPGA. • The proposed design is more practical for large storage capacities, owing to the reduced routing complexity achieved by the use of fewer BRAMs and the reduced AND operation complexity. The novel optimization technique of AND-accumulating SRAM words in the proposed TCAM memory units divides the overall AND operation complexity of the design.
• The proposed design is implemented on a state-of-theart FPGA. A detailed comparison of our proposed design with existing methods is performed with respect to the performance per memory. Our proposed design achieves a performance that is up to 2.85× higher per memory. The remainder of this paper is organized as follows. Section II surveys related work. The proposed design is described in Section III. Section IV details the implementation setup and results of this work. The performance evaluation of the proposed design is detailed in Section V. Section VI concludes this work. Table 1 describes the basic notations used in paper.
II. RELATED WORK
The CAM design methodologies presented in [18] and [19] are based on the hashing technique, which has the inherent drawback of bucket overflow. Moreover, when implemented in hardware this has an expensive overhead from re-hashing.
The CAM designs presented in [20] and [21] suffer from inefficient memory usage. The increase in pattern width results in an exponential growth in memory usage, thus making them infeasible for implementation in hardware. Our proposed solution reduces this growth to linear, as the wide pattern TCAMs are implemented by cascading BRAMs on FPGA.
The SRAM-based TCAMs presented in [10] - [12] store the TCAM presence and address information in separate BRAMs on FPGAs, resulting in an excessive usage of BRAMs. Our proposed design stores the TCAM presence and address information in the same BRAM, thus efficient memory utilization.
Xilinx presented two types of FPGA applications in [22] : a CAM design using BRAM resources and a TCAM design using the shift register (SRLE16). The first application emulates CAM rather than TCAM, and suffers from higher SRAM memory usage. The second application consumes one 16-bit shift register look-up table (SRL16E) of SLICEM resources on FPGA to emulate every two bits of a TCAM table. Its implementation for large storage capacity designs suffers from routing and timing problems. Our proposed design has a reduced routing complexity for TCAM designs with large storage capacities, because of its lower usage of BRAMs and reduced AND operation complexity.
Recently binary CAM and TCAM designs built using logic resources (SLICEL) on FPGA are presented in [23] and [24] respectively. Practically the TCAM implemented using logic resources on FPGA would be of limited storage capacity, owing to the routing congestion and timing challenges. Moreover, the update of data in a TCAM design built using look-up tables (LUTs) is slow compared with SRAM-based TCAMs and requires hardware overhead of dynamic partial reconfiguration controller [25] .
A hierarchical search scheme on FPGA is presented for SRAM-based CAM in [17] , which reduces its average power consumption by stopping subsequent search operations if a match is found in the previous SRAM block. However, in the worst-case scenario all SRAM blocks are searched. Thus, the worst-case power consumption remains high. The FPGA realization of TCAM presented in [26] stores the TCAM word presence and address information separately in Xilinx distributed RAM and BRAM, respectively. This reduces the average power consumption of the design, as the look-up in BRAMs is avoided if a match is not found in the distributed RAM. However, the worst-case power consumption remains high, with a lower overall system throughput.
The FPGA realizations of TCAM presented in [6] , [7] , and [17] store the presence and address information of TCAM words in the same SRAM block. However, this approach suffers from higher SRAM memory utilization due to the limited TCAM bits storage capacity of BRAMs resulting from the minimum depth limitation on its configuration in FPGAs.
Our proposed TCAM design exploits the efficient utilization of SRAM memory by mapping TCAM divisions to shallow sub-blocks of BRAMs on FPGA. Furthermore, it operates high-speed BRAMs in the design as multipumping-enabled multiported SRAM, maintaining a high system throughput.
III. PROPOSED DESIGN A. MULTIPUMPING-ENABLED MULTIPORTED SRAM
The multipumping technique multiplies the ports of a dual ported SRAM block by internally clocking it at an integral multiple of the external system clock [13] , [15] , [16] , [27] . The addresses and data are registered and provided access to the SRAM block in a circular order by using mod P counter bits as shown in Figure 1 . Several designs utilize multipumping for the implementation of efficient multiported memory [28] - [30] . 
B. BASIC IDEA
In the SRAM-based implementation of TCAM, the depth of the traditional TCAM determines the width of SRAM memory, and the width of the traditional TCAM is encoded as the address of the SRAM memory. The basic concept of the proposed multipumped SRAM-based TCAM implementation achieving increased memory efficiency is shown in Figure 2 . shows the implementation of six TCAM bits (100*10) by using 16 × 1 SRAM block, which has been multipumped two times, each SRAM sub-block of size 8 × 1 emulating three TCAM bits. Figure 2(d) shows the implementation of eight TCAM bits (0*1000*10) by using 16 × 1 SRAM block, which has been multipumped four times, each SRAM sub-block of size 4 × 1 emulating two TCAM bits. Thus, designing TCAM using multipumping-enabled multiported SRAM in Figure 2 bit) when compared with that of multipumping-less SRAMbased TCAM design in Figure 2 (b). The TCAM bits storage capacity of the SRAM block increases with multipumping.
A multiported SRAM block of size R D × R W with a multipumping factor of P implements a traditional TCAM table of size Figure 3 and 4. Our proposed design achieves increased TCAM bits storage capacity with an increase in multipumping factor P. 
D. BASIC ARCHITECTURE OF THE PROPOSED TCAM MEMORY
The basic architecture of our proposed TCAM memory design is shown in Figure 4 . It is operated by two fully synchronized clocks, a system clock clk S and internal clock clk P , such that clk P is P times faster than clk S . An incoming TCAM word is registered in a W -bit shift register using the system clock clk S . The log 2 P-bit counter generates a sequence of log 2 P-bit numbers in P internal clock cycles. It is initialized to zero upon reset and it rolls over after every P internal clock cycles. The log 2 P-bits from the counter are concatenated with the log 2 (R D /P) bits from the shift register to make the log 2 R D -bit address space of the SRAM. At the positive edge of the internal clock clk P , the SRAM address is executed such that log 2 P-bits from the counter constitute its most significant bits, and points to the start of the corresponding sub-block in SRAM and the lower log 2 (R D /P) bits from the shift register selects an SRAM word in the sub-block.
The read SRAM words are AND-accumulated for each cycle in an R W -bit register using clk P . Similarly, the lookup is completed for a W -bit input word by reading and AND-accumulating SRAM words from each sub-block of the SRAM in P internal clock cycles or one system cycle. Consequently, the P AND-accumulated SRAM words are produced as match word using clk S . The timing diagram in Figure 6 elaborates the search operation of the proposed TCAM memory architecture shown in Figure 4 with a multipumping factor of P = 2. VOLUME 6, 2018 FIGURE 6. Timing diagram for the search operation in our proposed TCAM with a multipumping factor P = 2: (IW : input word, R W : SRAM word read, MW : match word).
E. MODULAR ARCHITECTURE
TCAM design of large storage capacity is implemented as a cascade of M × N proposed design TCAM memory units as shown in Figure 5 . An incoming W -bit TCAM word is divided into N sub-words of Plog 2 (R D /P)-bits with the bit ranges shown in Figure 5 . The resultant sub-words are stored in N shift registers of size Plog 2 (R D /P)-bits on clk S . The log 2 R D -bit indexes from the N shift registers are provided to the corresponding M TCAM memory units of the N columns of the proposed design in parallel using clk P , as shown in Figure 5 . All TCAM memory units of the design operate in parallel using clk P . The R W -bit match words from each row of the TCAM memory units are bit-wise ANDed on clk S , and the results are provided to the associated priority encoder (PE) units. The log 2 D-bit match address and the match information from each PE unit are provided to the overall priority encoder unit, which eventually forwards a match address based on the priority. The proposed TCAM design registers an input word and produces a match word as output on clk S .
The update of a TCAM word is performed in each TCAM memory unit of the design in parallel. The worst-case update latency of the proposed design comprises R D /P system cycles.
F. EFFECT OF MULTIPUMPING SRAM ON THE MEMORY USAGE AND THROUGHPUT
Multipumping results in a useful reduction in SRAM memory usage for the design of TCAM on FPGA. The configured SRAM memory blocks in our proposed design with the multipumping factor of P implements traditional TCAM divisions of size Plog 2 (R D /P) × R W as shown in Figure 4 . The TCAM bits storage capacity of SRAM blocks in the proposed design increases with an increase in P. The upper bound on the multipumping factor P is R D /2, i.e. R D /2 sub-blocks in the SRAM and each sub-block consists of two SRAM words.
Multipumping divides the achievable internal clock frequency of the design by the multipumping factor, to obtain the operating frequency of the overall system [13] - [16] . Although an increase in the multipumping factor P results in a higher memory efficiency for the design of TCAM, only the use of small multipumping factors is practical in order to avoid a significant drop in the operating frequency of the overall system. Overall multipumping factor P controls a tradeoff between the SRAM memory efficiency and speed of the proposed design.
IV. IMPLEMENTATION SETUP AND RESULTS
To verify our proposed design we implemented it on a Xilinx Virtex-6 FPGA device (xc6vlx760). The proposed design was implemented using the Xilinx ISE 14.7 design tool, and verified through behavioral and post-route simulations using an ISim simulator.
We implemented our proposed design cases I and II on the Xilinx Virtex-6 FPGA device for 512 × 28 (14 Kb) and 512 × 32 (16 Kb) TCAM tables, with multipumping factors of P = 4 and P = 2, respectively. Our proposed design CASE-III implements a large TCAM table of size 1024×140 (140 Kb), with a multipumping factor of P = 4. We have selected small multipumping factors of P = 4, 2, and 4, in our proposed design cases I, II, and III, to avoid lower operating frequencies of the overall system. Table 2 lists the FPGA resource utilization slice registers (SRs), look-up tables, and BRAMs for the implementation of our proposed design cases I, II, and III. The post place & route results show that the proposed design cases I, II, and III could achieve internal clock frequencies of 475 MHz, 475 MHz, and 349 MHz and multipumping factors of P = 4, 2, and 4, giving the system clock frequencies of 119 MHz, 237 MHz, and 87 MHz, respectively.
V. PERFORMANCE EVALUATION
The performance of our proposed design is evaluated based on its comparison with the existing SRAM-based TCAM solutions on FPGAs.
A. SRAM MEMORY UTILIZATION
SRAM-based TCAM solutions implement a traditional TCAM of depth D and width W by cascading SRAM blocks of size R D × R W on FPGAs. The minimum overall SRAM memory requirement of the existing SRAM-based TCAM solutions on FPGAs can be formulated as (1) shown below:
The overall memory requirement of the proposed design for the implementation of a D × W size traditional TCAM using R D × R W size SRAM blocks is devised as (2) shown below:
Equation (2) describes that the SRAM memory usage of our proposed design is Our proposed design achieves a considerable reduction in the SRAM memory usage by a factor of
, when compared with that of the existing approaches as described using (3) as follows:
The usage of BRAMs in our proposed design is compared with those of previous approaches in Column 5 of Table 3 . Our proposed TCAM design CASE-I emulates a 14 Kb traditional TCAM, achieving a lower BRAMs utilization of 8 BRAMs compared with the usage of 56, 40, 40, 32, and 64 BRAMs for previous approaches in [10] - [12] , [22] , and [26] , respectively for an 18 Kb traditional TCAM emulation. The proposed design CASE-III emulates a large TCAM of size 1024 × 140 using 80 BRAMs. It achieves a lower BRAMs utilization compared with the large TCAM implementations of size 1024 × 150 and 504 × 180 in the previous approaches [7] and [17] , using 272 and 140 BRAMs, respectively.
B. THROUGHPUT
The operational speed of our proposed design is compared with those of previous approaches in column 4 of Table 3 . Our proposed design cases I and II emulates traditional TCAM of size 14 Kb and 16 Kb achieving operating frequencies of 119 MHz and 237 MHz with multipumping factors of P = 4 and 2 respectively. The operating frequency of our proposed design CASE-II is higher than previous works [10] - [12] , [22] , and [26] for an 18 Kb traditional TCAM emulation. Our proposed design methodology is more useful for the design of large storage capacity TCAMs. The TCAM memory units of our proposed design AND-accumulate SRAM words from the sub-blocks of the SRAM blocks in each system cycle, reducing the complexity of the AND operation units of the overall architecture, as shown in Figures 4 and 5 . This further prevents the AND operation units from limiting the operating frequency of wide pattern TCAMs designs on FPGA. Our proposed design uses fewer BRAMs, thus alleviating the overall routing complexity of the design on FPGA. The divided AND operation complexity and reduced routing complexity makes our proposed design more practical for large storage capacity TCAMs.
The system frequency of our proposed design CASE-III emulating a large capacity TCAM of 140 Kb is 87 MHz, which is comparable with the maximum achievable frequency 97 MHz in previous work [7] implementing a large size TCAM of 150 Kb. While the SRAM memory usage of our proposed design CASE-III is 70% lower than that of [7] .
Our proposed design provides increased design flexibility in terms of the speed vs memory usage tradeoff. The designer must consider the important design factors such as the required storage capacity, relative availability of BRAMs on the target FPGA, and required throughput for the selection of the multipumping factor in our proposed design.
C. PERFORMANCE PER MEMORY
Considering the time-space tradeoff, we used the performance evaluation metric performance per memory from [31] , given by (4) . Table 3 shows that the performance per memory of the proposed design cases I and II are 1.83 times higher than that of UE-TCAM [6] , which was the highest among the existing methods. Our proposed design CASE-III emulates a large TCAM of size 1024 × 140, achieving the performance per memory of 4.25 ((Gb/s × TCAMDepth)/Kb), which is 2.85 times higher than for large TCAM of size 1024 × 150 in the existing study [7] .
Our proposed design scales well in terms of the performance when evaluated for the design of a large storage capacity. Table 3 shows that the performance per memory of our proposed design CASE-III is slightly lower than the proposed design CASE-I (with the same multipumping factor of P = 4) while the implemented TCAM size of CASE-III is ten times greater than that of CASE-I.
VI. CONCLUSIONS AND FUTURE WORK
Re-configurable hardware FPGAs emulate TCAM functionality using SRAM memory. Existing SRAM-based solutions of TCAM on FPGAs achieve inefficient memory usage and offer lower operational frequencies. We have presented a memory-efficient design of TCAM, based on multipumpingenabled multiported SRAM, by operating the SRAM blocks in the design at a frequency that is multiple times higher than that of the overall system. This allows reading from its sub-blocks to take place within one system cycle. The FPGA implementation results show that the performance per memory of our proposed design is up to 2.85 times higher than for existing SRAM-based TCAM solutions on FPGA.
Our proposed solution is general, and can be applied to many applications. Our future work will include the application of the proposed design to various applications. She has authored and co-authored over 100 reviewed journal and conference papers. Her research activities cover high performance computer architectures, memory architecture, approximate computing, selfaware computing, and reliable computing. She is a member of the National Academy of Engineering in South Korea. VOLUME 6, 2018 
