I. INTRODUCTION
Packet forwarding in network search engines is the process by which the next hop address of every incoming packet is determined based on the final destination address of the packet. Packet forwarding for Classless Inter Domain Routing (CIDR) environment requires finding the longest prefix that match the destination address. This is due to the fact that CIDR advocates variable prefix length.
A. Prior work
Several software and hardware schemes have been proposed earlier for longest matching prefix (LMP) like trie based lookup, Patricia tree, prefix expansion and various Content Addressable Memory (CAM) based packet forwarding [1] [2] [3] . All the above schemes mainly target on improving the overall search time which is often achieved at a cost of increasing the forwarding table update time.
Nowadays, more hardware architects than ever look into CAM for high performance table lookup tasks [4] [5] [6] . A specifically interesting type of CAM, called Ternary CAM (TCAM) can store don't-care values in addition to 0's and 1's. Using this capability, the TCAM entries can include wild-cards. Because of the wild-cards, a search key may match multiple entries. In this case a TCAM with properly structured content can produce the highest priority (or the most specific) result. TCAM completes each lookup task in just one clock cycle. While simplicity and high performance are the main reasons for designers to choose TCAM for hardware-based search applications, high power dissipation, low storage density and sorting requirement remain to be the major concerns with this technology, which makes it a hot topic of ongoing research in both industry and academia. A few recent works are presented in [7] , [8] , [9] and [10] .
In TCAM based implementation the replacement of any prefix in the forwarding table will take O(n) shifts (for a table of n entries) to place the new entry in its correct location. This is because the forwarding table entries should be sorted in order wherein the longest prefix has the highest priority. This may lead to a very high update time and stalls the packet forwarding process. One technique to alleviate this problems is to leave predetermined empty spaces between the table entries so that any new entry to the forwarding table can directly be placed in those empty spaces. The drawback of such a technique is waste of memory space. Moreover, the main problem does not go away. It is simply reduced from global sorting to local sorting which makes the performance of system dependent on the traffic. To have a consistent throughput, regardless of traffic, the update time of the forwarding table should be guaranteed to be as small as possible [11] .
Kobiyashi et al. proposed a Vertical Logical operation with Mask-encoded Prefix (VLMP) based search engine for wire-speed packet processing of an OC-192 link [12] . This architecture expands upon a CAM architecture but is not made up of actual CAM cells. It uses registers and combinational logic to achieve CAM-like functionality. This architecture also allows random storage of forwarding table entries, thus reducing the updation time. The main disadvantage of this architecture is that it cannot get the same prefix search time as that of the conventional TCAM based architecture and also its area overhead compared to conventional TCAM architecture is very high. Another technique which targets update time reduction is PLO OPT algorithm [13] . This algorithm places all unused entries in the center of the TCAM such that the first half of longest prefixes are always above the free space and next half are always below the free space. Addition or deletion of any new prefix would have to swap at most n/2 memory entries, where n is the number entries stored. The drawback of such an implementation is the time taken during swap of the entries and the unused TCAM space.
B. Main Contribution
We propose reconfigurable CAM (RCAM) cell that has the ability to selectively switch its functionality between a binary CAM (BCAM or simply CAM) and ternary CAM (TCAM) for any bit position. The configurability feature makes the contribution of our architecture twofold. First, it allows a user to configure it as a full CAM (e.g. of size 2n × w bits) CAM, a full TCAM (of size n × w bits) or a hybrid module (of size k × w bits, where n ≤ k ≤ 2n). In the hybrid 1-4244-9707-X/06/$20.00 ©2006 IEEE mode, the behavior (CAM or TCAM) can be defined for each bit or word position which is an attractive option for power minimization of TCAM-based search engines. Second, the masking operation in conventional TCAM is replaced by a novel wired-AND operation. Effectively, our wired-AND technique replaces a longest matching prefix operation with an exact matching mask operation. Applying the wired-AND technique completely eliminates the need for a priority encoder and the sorting requirement during updation. Elimination of these two means significant reduction in power, cost and increasing the overall throughput. All of these are achieved with negligible overhead, i.e. 5.6% more than the conventional 9 transistor (9T) TCAM unit of the same size and performance.
II. BACKGROUND
A. Binary CAM Cell Figure 1 (a) shows the basic cell structure of a conventional binary CAM. It consists of three parts -a 6T SRAM cell, 2T XOR logic and 1T Evaluation logic. The data (D and DB) is read/written into the SRAM bit of the CAM cell using the bit lines (BL and BLB). The search key is given through the comparand lines (CMP and CMPB) and will be fed to two NMOS transistors that behave like an XOR gate. The word line (W L) is activated whenever we want to read/write any data into the SRAM bit. The match line (ML) is precharged for every search operation. If there is a match, i.e if the incoming comparand bit is same as that of the stored data bit, then the charge of the match line (ML) is retained. Otherwise, ML discharges through the path created by the mismatch. Overall, the boolean function of ML can be expressed as:
where, D and CMP are the data and comparand bits, respectively. Figure 1(b) shows the basic TCAM cell structure. The main difference between binary CAM and ternary CAM is the use of additional "don't-care" bits. Such additional state enables TCAM to perform partial match of the word as opposed to the binary CAM which exercises only the exact match between the comparand and data bits. Typically, a ternary CAM is a 16T structure and consists of three parts -a 9T CAM bit, a 6T SRAM bit and 1T evaluation logic. The bit lines (BL and BLB), comparand lines (CMP and CMPB), word line (W L) and match line (ML) have the same functionality as that of binary CAM. The mask word line (MW L) is activated whenever we want to read/write the mask bit. Throughout this paper we assume the mask bit M = 1 indicates a don't-care bit. Thus, the corresponding bit is masked as it should not affect the result of comparison. The boolean equation for the match line in ternary CAM can be written as:
B. Ternary CAM Cell
where, D is the data bit, CMP is the comparand bit and M and MB are the mask bit and its complement, respectively. Equation 2 also shows that the two transistors in the evaluation block (shown also as E in some figures in this paper) form a NAND gate whose two inputs are D ⊕CMP and MB.
III. RECONFIGURABLE CAM CELL
Reconfigurable Content Addressable memory (RCAM) is an 18T TCAM cell targeting reconfiguration and reusability. Figure 2 shows the basic structure of a RCAM cell. It is made up of two 9T CAM cells along with one extra reconfiguration transistor whose gate is controlled by R. In general, the reconfiguration and precharge transistors are shared among bits and thus are drawn outside the RCAM cell area. The main difference between RCAM and the conventional TCAM is that both data and mask SRAM bit of the RCAM cell are embedded with the 2T XOR logic and 1T evaluation logic. Consequently, there will be two individual CAM bits as opposed to the conventional TCAM cell which has a CAM data bit and a SRAM mask bit.
A. Operation
The RCAM has two separate comparand lines, i.e. CMP0/CMPB0 in CAM0 for the even-numbered words and CMP1/CMPB1 in CAM1 for the odd-numbered words. The source of the evaluation transistors of both CAM bits are tied to the drain of the reconfiguration transistor. The drain of the evaluation transistors are connected to the match lines ML0 and ML1, respectively. The controlling input R of the reconfiguration transistor basically reconfigures the RCAM cell into either two independent CAM cells (R = 1) or into one TCAM cell (R = 0). Depending on the flexibility desired, we can have one R for the entire RCAM module or one R for each row or column. For simplicity, we assume there is only one control R per memory unit. Assume that the data and mask bit values are already written into the SRAM bits using the word lines (W L0 and W L1) and the bit lines (BL and BLB). The operation of RCAM cell can be explained as follows.
1) CAM Mode (R = 1): Both match lines ML0 and ML1 in RCAM should be precharged before evaluation in Figure 1 (b). Therefore, CAM0 and CAM1 together form a TCAM cell. The boolean equation for the main match line (ML0) in RCAM can be written as:
where all signals are shown in Figure 2 . When R = 1, we have ML0 = (D0 ⊕CMP0) and ML1 = (D1 ⊕CMP1), that are identical to Equation 1. When R = 0 and CMP1 = 1, we have ML0 = ML1 = D1 + (D0 ⊕CMP0) which is identical to Equation 2 when D1 is considered as the mask bit. For consistency, we consider ML0 as the final output of the cell when it operates in TCAM mode. For more clarity, the working of RCAM cell in both CAM and TCAM mode is summarized in Table I .
B. Applications
In what follows we briefly discuss three applications that benefit from a hybrid binary and ternary CAM cells in terms of cost, power and update time. 
Classifier Engines:
Packet classification in general refers to finding the best matching rule containing multiple fields among the rule set for a given search key. [16] . Application of such scheme requires mix of CAM (e.g. for the first 8 bits) and TCAM (for the rest) for which RCAM is a preferred choice.
Engines with no Priority Logic:
The RCAM architecture has the ability to switch between TCAM and CAM functionality. This feature is exploited here by generating the maximal match through exact matching within CAM bits that hold the masking information. Our architecture, therefore, completely eliminates the use of prioritizer block, priority encoder and the need to sort entries. This would be possible by having cells that can be configured as a CAM or TCAM in each cycle. The detailed working of the RCAM architecture will be explained in Section IV.
IV. RCAM ARCHITECTURE

A. Architectural Details
The RCAM architecture uses the RCAM cell as its basic building block. Figure 2 ). Each RCAM cell will be equivalent to two independent CAM cells. The resulting outputs are ML even and ML odd which in turn are fed to the corresponding n-input encoder. Together, two encoders (for odd and even cells) choose one out of 2n words in the RCAM block. 2) TCAM Mode (C/T = 1): There are two possible schemes to make RCAM cell behave as a TCAM.
(i) R = 0 and use a conventional n-input priority encoder. This scheme has been already mentioned in Table I and while possible is not of our interest as it offers no advantage over conventional TCAM. (ii) R will be 0 and 1 in the first and second cycles, respectively. The RCAM architecture in this mode takes two search cycles to compute the longest prefix match. In this case, ML1 and thus ML odd are tied to ground. In this scheme, we use the wired-AND strategy and a n-input regular encoder. This scheme offers a significant reduction of update time (to be discussed in Section IV-C). The behavior of the RCAM architecture is summarized in Table II . Note that in this paper we focus on TCAM(ii) that employs the wired-AND technique.
More details about the control signals (shown in Figure 3 ) in TCAM mode are shown in Figure 4 . In this figure, V dd and Gnd denote precharging toward V dd and tying to ground, respectively. The shaded boolean function at the end of second cycle indicates how the final signal is obtained. The details of blocks and control signals can be found in [18] .
B. Wired-AND Technique
The principle difference between RCAM and the conventional TCAM is the way by which the longest prefix is determined. Traditionally, TCAM-based packet forwarding engines determines the longest prefix entry by sorting the forwarding table and determining the longest prefix from multiple matches using a priority encoder (PE). The wired-AND technique used in our architecture completely eliminates the need for any sorting or priority encoding. The wired-AND technique in RCAM is the concept by which selected bits in the same column of different rows are read simultaneously on the same BL/BLB wire by activating their corresponding word lines.
In conventional TCAM if we read the data bits simultaneously using BL/BLB lines the output is going to be either a zero or one based on the strength of the equivalent pull up or pull down logic formed by the inputs being read. To get the wired-AND logic when multiple rows are read on the same line we size the related transistors in RCAM cells in such a way that it can withstand the strength of r parallel pull-up transistors. Therefore, the pull-down transistor present in the SRAM of the mask bit is sized r times the original value to counter this rare worst case. From the implementation point of view resizing cells for 2 ≤ r ≤ 16 is quite straightforward. From practical point of view, in networking applications researchers found out that the maximum number of multiple matching prefix that can occur in a forwarding table is quite small. Based on empirical data, the authors in [19] reported this number to be 6. Similarly, the authors in [20] found that the highest number of multi match prefix to be eight across 112 ACLs in a router database with a total of 215K rules. The simulation results show that the distribution of multi match prefix per search was mainly concentrated between 3 or 4 matches. To be on the safe side we chose r = 8. In applications that may require larger r the pull-down transistors can be easily tuned. Figure 5 shows the bit-slice of RCAM architecture which incorporates this wired-AND technique.
C. Update Time
The behavior of the RCAM architecture is summarized in Table II . The longest prefix match operation in RCAM, when operating as a TCAM, takes two search cycles. The main advantage in this scheme compared to the conventional TCAM architecture is the drastic reduction of total update time. The low update time in the RCAM forwarding table is a direct consequence of allowing the prefixes to be stored in any order and thus eliminating time consumed for sorting. To improve the table update time some of TCAM-based  forwarding techniques partition the forwarding table according to their prefix lengths leaving empty spaces at the end of each partition for insertion of new entries. This technique does not require frequent updation but if empty spaces in any one of the prefix lengths gets filled the TCAM forwarding table has to be sorted. In general, any forwarding table architecture that uses priority encoder to find longest prefix match has to sort its table one way or another. Our RCAM does not face this problem since it allows prefixes to be stored in any order and does not need any priority logic.
In order to perform the longest prefix matching the prefix entries, which is a combination of the data and mask word, are stored in the even and odd words of the RCAM block, respectively. During the first search operation, RCAM is configured to work in TCAM mode and finds all possible matches for the key presented. In the second search, RCAM identifies the longest prefix match indirectly through the corresponding mask word. In other words, we first AND all mask words corresponding to the matches obtained in the previous search and then search the odd cells (home of masks) for an exact match (CAM mode).
To be clear about this, let us assume that in the first search operation there were three matches corresponding to the mask words D1 1 = 001111, D1 2 = 000111 and D1 3 = 000001. In this example, obviously, D1 3 corresponds to the largest prefix. The logical AND of these three mask words generates ∏ 3 i=1 D1 i = 000001. In the second round, we search the mask data words that exactly match 000001 which indirectly gives us the longest prefix match. In our implementation, the wired-AND operation will generate ∏ 3 i=1 D1 i . The bit-slice of RCAM architecture incorporating the wired-AND scheme is shown in Figure 5 . The multiplexers in the bit-slice architecture are used to multiplex the comparand line inputs (CMP1 and CMPB1) according to the mode of operation. In CAM mode (C/T = 0) a normal comparand (key) is presented and in TCAM mode (C/T = 1) the wired-AND of the selected mask words is fed into the comparand lines.
Note that in TCAM mode, odd CAM cells (i.e. D1 i ) hold the mask data. Also note that, since the wired-AND of DB1 lines makes ∏ i DB1 i (which is different from ∏ i D1 i ) we used an inverter to directly make CMPB1.
V. TIME ANALYSIS
A. Search Time
Although the RCAM architecture takes two cycles to find the longest prefix, the overall search time is not doubled when compared to conventional TCAM architecture. The reason is the removal of the prioritizer circuit from critical path of the RCAM architecture. Analytically, the overall search time of RCAM architecture is given by: t rcam search 2t rcam block + t encoder (5) where, t rcam block is the delay of the RCAM block (n × w cells) and t encoder is the delay of a regular n-input encoder logic. The analytical expression for the overall search time of a conventional TCAM architecture is given by: t tcam search t tcam block + t priority encoder (6) where, t tcam block is the delay of the TCAM block (n × w cells) and t priority encoder is the delay of a n-input priority encoder logic.
A straightforward VLSI implementation of these components indicate that [17] : 1 .25 which means the search time of RCAM will be approximately 25% slower than conventional TCAM. In spite of longer search time in the next subsection we will show the overall performance of the system will significantly improve.
B. Update Time
The overall speedup of RCAM architecture compared to TCAM depends on both search and update time. In [18] we have analytically shown that the overall speedup of RCAM with respect to TCAM is given by:
where, T TCAM (T RCAM ) is the sum of total search time taken for N S searches and N U updates for TCAM (RCAM), S avg is the average number of the shifts during the update time and
. Note that no sorting (shifting) is needed in RCAM. For large tables (e.g. 10 3 ≤ N U ≤ 10 4 and 100 ≤ α ≤ 1000) the RCAM structure will outperform TCAM by 1 to 2 order of magnitude (speedup of 10 to 100).
VI. EXPERIMENTAL RESULTS
The RCAM cell along with the conventional CAM and TCAM cells were implemented in 0.18µm digital CMOS technology using Cadence tools [21] . The RCAM cell was simulated using Spice [14] for all the possible cases in both the CAM and TCAM configurations and the results were extensively reported in [18] .
We have summarized some power, performance and area metrics in Table III . According to Spice simulation [14] , the delay of a RCAM cell is slightly better than CAM and TCAM because the addition of the reconfigurable transistor forms new parallel paths that decrease parasitic resistances. However, this negligible gain in the search speed of one cell alone will not have a significant effect on the overall performance. This is because in RCAM/TCAM word architecture the overall performance is mainly dictated by both the time taken to charge/discharge the highly capacitive match line of the entire word and by the update time. The area per data bit, reported in Table III , indicates that RCAM cell consumes 5.6% and 11.8% more silicon compared to TCAM and CAM, respectively. The power (per data bit) consumed by RCAM working as CAM and TCAM is also higher than power of 1-bit CAM/TCAM due to existence of extra transistors. However, similar to our argument on performance, the increase in power consumption for an individual RCAM cell will be offset by the power saved from the elimination of the priority encoder and the update cycles in RCAM architecture. Table IV shows the delay, area and power numbers of a 4Kb (i.e. 128 × 32) RCAM architecture in both modes compared to two implementations reported in the literature. Due to differences in technology, size, cell-library and design objectives (e.g. area, power) a direct comparison of the these implementations is not possible. Our main goals of showing Table IV is to illustrate that our implementation have comparable size, performance and power while it provides the reconfiguration capability. In particular, by eliminating the priority encoder, the RCAM unit runs in 162.7MHz, i.e. expectedly almost half of its operational frequency in CAM mode. This is consistent with the analytical discussion in Section V where we argued that after removing the priority encoder, the TCAM mode operation is done in two cycles as opposed to one cycle in CAM mode.
In terms of area, in spite of area increase in one individual cell the overall area actually improves as the priority encoder is replaced with a regular binary encoder. The power-performance metric was also evaluated for both the modes of RCAM and is found to comparable with other existing architectures [22] [23] [24] . Unfortunately, we cannot directly compare RCAM in TCAM mode with the TCAM units reported in these references because of the difference in implementation technology (0.18 µm versus 0.13 and 0.35 µm) and the size (2Kb versus 4 to 36Mb). In general, RCAM in TCAM mode with 42.08 f J/bit/search shows 6.5% higher power-performance value but achieves maximum speed of 162.7Mhz that is 15.1% higher performance than conventional TCAMs.
