Content Addressable Memory (CAM) is widely used in applications where searching a specific pattern of data is a major operation. Conventional CAMs suffer from area, power, and speed limitations. We propose Spin-Torque Transfer RAM-based Ternary CAM (TCAM) cells. The proposed NOR-type TCAM cell has a 62.5% (33%) reduction in number of transistor compared to conventional CMOS TCAMs (spintronic TCAMs). We analyzed the sense margin of the proposed TCAM with respect to 16-, 32-, 64-, 128-, and 256-bit word sizes in 22nm predictive technology. Simulations indicated a reliable sense margin of 50mV even at 0.7V supply voltage for 256-bits word. We also explored a selective threshold voltage modulation of transistors to improve the sense margin and tolerate process and voltage variations. The worst-case search latency and sense margin of 256-bit TCAM is found to be 263ps and 220mV, respectively, at 1V supply voltage. The average search power consumed is 13mW, and the search energy is 4.7fJ/bit search. The write time is 4ns, and the write energy is 0.69pJ/bit. We leverage the NOR-type TCAM design to realize a 9T-2 Magnetic Tunnel Junctions NAND-type TCAM cell that has 43.75% less number of transistors than the conventional CMOS TCAM cell. A NAND-type cell can support up to 64-bit words with a maximum sense margin of up to 33mV. We compare the performance metrics of NOR-and NAND-type TCAM cells with other TCAMs in the literature. 
INTRODUCTION
Content Addressable Memory (CAM) finds numerous applications in pattern matching, Internet data processing, packet forwarding, and storage of tag bits in processor cache and as associative memory. The special functionality of the content search in CAM requires a comparison circuitry integrated with the memory cell [Pagiamtiz et al. 2006] . The comparator in addition to a memory element adds area and power overhead This article is based on work supported by Semiconductor Research Corporation (No. 2442.001) and National Science Foundation (CNS-1441757). We sincerely thank Dr. Srinivas Katkoori, Dept. of Computer Science and Engineering at University of South Florida for his continued support in this research. This submission is an extended version of [Govindaraj et al. 2015] and contains more than 30% of the new material included. Authors' addresses: R. Govindaraj, Computer Science and Engineering Department, University of South Florida, Tampa, FL 33620; email: rekhag@mail.usf.edu; S. Ghosh, School of Electrical Engineering and Computer Science, Pennsylvania State University, State College, PA 16801; email: szg212@psu.edu. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from52:2 R. Govindaraj and S. Ghosh [Matsunaga et al. 2012] ; (b) schematic of DWM cells; (c) DWM CAM realization [Zhang et al. 2012]. in CAMs. The need to store and match "don't care" requires two storage bits, which further worsens the area overhead. CMOS CAM is power hungry due to power consumed in Match Line (ML), search line, and leakage of the bit cell. In nanometer technologies, leakage power constitutes a major fraction of the total power consumed in CAM memory [Karam et al. 2015] . Non-volatile technologies that are more area efficient than a SRAM and also can provide zero leakage are attractive in such a scenario [Pagiamtiz et al. 2006; Shen et al. 2006] . Area efficiency and non-volatility of Spin-Torque Transfer RAM-(STTRAM) based ternary CAM is very useful for on-chip CAM applications. Numerous works have demonstrated the realization of CAM using non-volatile memory technologies like memristor [Junsangsri et al. 2012] , Resistive RAM (ReRAM) [Li et al. 2014] , and spintronic elements such as Domain Wall Memory (DWM), Magnetic Tunnel Junctions (MTJ), and STTRAM [Xu et al. 2008; Nebashi et al. 2011; Zhang et al. 2012; Chen et al. 2013] . Memristor-based NOR Ternary Content Addressable Memory (TCAM) [Junsangsri et al. 2012 ] uses a voltage divider network formed by the memristors to enable the discharging path to discharge the match line depending on match/miss. It incurs higher write delay of up to 800ns, which hurts the table update performance in the network routers application. ReRAM-based TCAM [Li et al. 2014] employs a special clocked self-referenced sensing scheme that complicates the memory system design due to additional reference ML required. An efficient integration of a spintronic device with CMOS technology has motivated us to employ them in TCAM cell design. Also, the principle of bit storage and writing methodologies differ in memristor and ReRAM compared to spintronic devices such as MTJ and DWM. This limits the direct extension of memristive and ReRAM TCAMs to design spintronic TCAMs.
Spintronic CAMs using MTJ and DWM ( Figure 1 ) suffer from issues such as larger area, unreliable write operation, high search delay, and high power consumption compared to CMOS CAMs [Karam et al. 2015] . The MTJ-based TCAM 4T-2MTJ design (Figure 1(a) ) [Matsunaga et al. 2012 ] is area and power efficient; however, it employs a technique-based proportional total current drawn from the ML by bit cells in words. Further, the circuit uses both NMOS and PMOS transistors. In this technique, the sense margin decreases significantly as the number of bits increases in the worst-case search of a single bit miss. So the design is not scalable beyond a certain word length (144 bits). Figure 1 (b) shows DWM cells in the memory architecture that consists of read heads along a nanowire storing data bits. The DWM-based TCAM shown in Figure 1 (c) uses 12 transistors. Therefore, deployment of non-volatile memory in CAM requires effort to achieve a smaller footprint and better performance in terms of search delay, write delay, write power, and search power. A NOR-type MTJ-based TCAM is proposed [Govindaraj et al. 2015 ] that can support wider CAM words while being tolerant to voltage and temperature variations. However, it is susceptible to a poor sense margin due to process variations. We propose Search Enable (SE) modulation to improve the sense margin under inter-die process variation. We also propose multi-V TH design to improve the sense margin. We also propose a NAND-type TCAM that exploits the NOR TCAM design. The proposed NOR-type TCAM employs only 6 transistors and 2 MTJs instead of 16 transistors in a CMOS TCAM, and a NAND TCAM cell uses 9 transistors and 2 MTJs compared to a NAND CMOS TCAM, which uses 16 transistors. In the following paragraphs, we provide an overview of the CAM and STTRAM technologies.
Content Addressable Memory
CAMs can be divided into two categories depending on the number of states that can be stored in the memory cell, namely Binary CAM (BCAM) and TCAM. BCAM stores a binary bit, that is, "0" and "1," whereas TCAM can store three possible values, namely "don't care" (X), "1," and "0." CAMs can be further categorized into two topologies, namely NOR and NAND (Figures 2(a) and (b) ). The stored bits are compared with the data on the Search Line (SL) and its complement (SL) by an XOR operation with the transistor network M1, M2, M3, and M4. To store data in a TCAM cell of a NORtype architecture, the data bit and the complement are stored in two SRAM cells. The don't care bit can be realized by storing "1" in both SRAM cells, that is, D =D = 1. In case of a match, both the SL-D and SL-D paths are disconnected, and the match line remains precharged. In case of a miss, either of the SL-D or SL-D connect ML to ground, which discharges the precharged ML. In a NAND-type architecture, TCAM cells are connected in series (Figure 1(b) ). Data bit D andD are derived from a single SRAM cell, unlike two SRAM cells in a NOR-type TCAM. The stored bit is masked by using a mask bit (M) in a parallel SRAM cell. In the case of a match, the precharged 52:4 R. Govindaraj and S. Ghosh Fig. 3 . (a) MTJ in anti-parallel ("1") and parallel ("0") states and (b) the directions of charge current to write "1" and "0." ML is connected to ground by series TCAM cells of the word by turning the NMOS transistor M1 ON. Storing the mask bit as "1" enables transistor M2 despite match or miss, which implements the don't care functionality. CMOS TCAM uses two SRAM cells, which doubles the area overhead compared to conventional SRAM cell.
STTRAM
STTRAM is a spintronic-based memory cell that stores the data in the form of the spin of an electron, unlike a static CMOS memory cell, which stores data in the form of electric potential. The basic storage element is a MTJ that consists of three layers, a layer of magnetic oxide sandwiched between the other two layers of magnetic materials. Data are stored in the form of magnetization in the two magnetic layers. Bit "0" ("1") is stored when the two magnetic layers are magnetized in the same (opposite) direction. Figure 3 (a) shows an MTJ device configuration in the parallel and antiparallel states. A Pinned Layer (PL) has fixed magnetization while Free Layer (FL) magnetization can be polarized parallel or anti-parallel with respect to the PL. The resistance of MTJ is high when PL and FL are in antiparallel configuration whereas the resistance is low when they are parallel to each other. The value written to the STTRAM bit depends on the direction and the strength of the charge current. Minimum current required to flip the state of the MTJ in a STTRAM bit is called critical current. Figure 3 (b) shows the direction of current to write "1" and "0" to MTJ [Fong at al. 2013] . Tunnel Magneto Resistance (TMR) is the ratio that determines the ratio of electrical resistances of the MTJ structure in parallel and antiparallel polarization states of FL relative to PL magnetization. If R H (R L ) is the MTJ resistances in antiparallel (parallel) states, then the TMR is defined as
. In this article, MTJ and the transistors in series together is referred to as STTRAM, while MTJ refers to the MTJ device in the memory cell. Table I summarizes the parameters of the MTJ device used in this work.
The rest of the article is organized as follows. Section 2 discusses the proposed TCAM cells design concept and circuit operation. The design analysis of a NOR-type TCAM is presented in Section 3, and simulation results are discussed in Section 4. Simulations results of NAND-type TCAM are discussed in Section 5, and the comparative analysis is presented in Section 6. Conclusions are drawn in Section 7.
PROPOSED TCAM CELLS
In this section, first we discuss the structure of the proposed TCAM. Next, we present qualitative analysis and describe read, write, and search operations.
NOR TCAM Cell Circuit
The circuit diagram of the proposed TCAM is shown in Figure 4 . Two MTJs store D andD, respectively. Transistors M1 and M2 form an ML discharge network depending on the result of data comparison with the search lines SL and SL. During a search, transistors M3/M5 and M4/M5 along with the MTJ resistance due to TMR make a voltage divider network in which the drain voltages of M3/M4 drive the gates of discharge transistors M1/M2. The cell is designed in such a way that during match the voltage of nodes X1 and X2 is below the threshold voltage of M1 and M2, and the ML stays precharged. However, during a mismatch, the voltage of X1 and X2 rises above the threshold voltage of M1 and M2, respectively, discharging the ML.
Transistor M3/M4 are the wordline (WL) selection transistors, and M6 is the write access transistor that turns ON only during write (WR) operation. Transistor M6 is sized larger to allow sufficient write current. Transistor M5 is driven by an SE signal and, sized to limit the current through STTRAM bit for read disturb free search operation. The don't care bit can be stored in the cell by storing "1" in both the D and D bits. The search bit can be masked by driving SL = SL = 0 on the search lines. The Source Line (SrL) is used for two purposes, namely (a) a write operation when the SrL is connected to 0 or V dd depending on the write data to the MTJs and (b) a search operation when SrL is driven to 0 to allow voltage division.
Qualitative Analysis of the Cell Design
There are two match cases, namely (a) (D,D) = (SL, SL) = (1, 0) and (b) (D,D) = (SL, SL) = (0, 1). Since both cases are identical, we will only explain the first case. For (D, D) = (1, 0), the left-side MTJ is in a high resistance (R H ) state, whereas the right side MTJ is in a low resistance (R L ) state. Since (SL, SL) = (1, 0), the voltage at node X1 is V X1 = V SL * r R H +r = V M and the voltage at node X2 is a voltage drop due to current flow from node X1 to SL (detailed analysis is given in Section 2.4). In this expression, r is the lumped ON resistance of transistors M3, M4, and M5 ( Figure 5 ) and V SL is the SL voltage. To keep transistor M1 OFF during a match, V X1 should be lower than V TH0 (i.e., the threshold voltage of M1 and M2). From the above equations, V MMX > V MX for the two cases as R H > R L . For the design to function properly (i.e., discharge ML during mismatch at a higher speed compared to that of a match case), R H , R L , and r should be selected such that
The following analytical equations can be used to quantify the design parameters:
where I MM and I M are the currents drawn from SL and SL in the cases of mismatch and match, respectively, and 1 , 2 are the offset voltages with respect to V th0 . Subtracting (1) and (2) and using R H = R L * (1 + TMR), we obtain 
The optimization of the proposed TCAM revolves around three key requirements: (a) maximizing the difference between the mismatch and match voltages, that is, ( 1 + 2 ); (b) maximizing the absolute values of offsets from V TH0 , that is, | 1 | and | 2 | to keep M1/M2 strongly ON or OFF as needed during mismatch and match, respectively; and (c) lowering the search current below critical write current of MTJ. From Equation (4), it can be concluded that higher TMR, higher R H , and higher r can be employed to enhance ( 1 + 2 ). Although higher r and R L is also good for maximizing 1 , it minimizes 2 . A lower 2 can turn ON M1/M2 during match degrading the sense margin. Figure 6 shows a pictorial representation of this situation with three operating points. The voltages V MM1 , V MM3 , V M1 , and V M3 provide a poor sense margin compared to V MM2 and V M2 even with same magnitude of 1 + 2 . The ideal margin is obtained when R H = ∞ and R L = 0, which gives V MM = V dd and V M = 0. However, a lower R L could be detrimental for read disturb due to high search current. High values of R H and R L ensure the low search line currents. This, in combination with high TMR, can provide a better V gs margin, that is, ( 1 + 2 ), with low search power consumption. The design optimization conducted in this article accounts for the above factors.
Write Operation
In the proposed TCAM, the search lines SL and SL are used to write data to the STTRAM bits. Table II summarizes the states of control signals in the write operation. Writing "1" and "0" consumes two cycles to write to the two STTRAMs while "X" can be written in a single cycle. During the write, the ML precharge is disabled to avoid power consumption from the ML. This is achieved by pulling the "precharge" signal high. NMOS transistor M6 is turned ON during write by the WR signal. Note that M6 is sized to provide the drain current greater than the critical write current of the STTRAM. The state of search enable signal SE is "Don't care" as M5 is connected parallel to M6. For the analysis, we assume that SE is pulled low. The WL is turned ON only for the selected word so the unselected cells are unaffected. The source line SrL is controlled appropriately to write a "1" or "0." Figure 5(a) shows the equivalent circuit of TCAM cell during write. The transistors are replaced with equivalent ON resistances. Resistors r3, r4, and, r6 are equivalent resistors of M3, M4, and, M6 respectively. The writing operation is described below. In the first cycle of the write operation, writing to the D bit is enabled by pulling WL1 to V dd , and theD bit path is disabled by pulling WL2 to ground. In the second cycle of the write operation, writing to theD bit is enabled by pulling WL2 to V dd , and the D bit path is disabled by pulling WL1 to ground. Direction of charge current through the MTJ is pictorially represented in Figure 3 (b) to write parallel and anti-parallel states.
1. Writing "1": In the first cycle, SrL is pulled high, and the SL line is pulled to ground.
The write current flows from SL writing an antiparallel state to the STTRAM storing bit D. There is no current through the other STTRAM bit as the WL2 control signal is grounded. In the second cycle, the SrL is pulled low, SL is pulled to V dd , and WL2 is pulled high, which programs the other STTRAM storingD to the parallel state. There is no current through the other STTRAM bit as WL1 is grounded. 2. Writing "0": In the first cycle, the SL is pulled high and the SrL line is pulled low.
This cycle writes parallel magnetization state to the STTRAM storing the D bit. In the second cycle, the SrL is pulled high while SL is at 0, which programs theD bit to an antiparallel state. 3. Writing "X": The "X" state can be stored by writing logic 1 to both D andD. The SrL is pulled to V dd , and the search lines SL and SL are pulled low. The current flows through both the STTRAMs storing antiparallel states to D andD.
Search Operation
Search is a single-cycle operation in CAM. The ML is precharged to V dd , and WR is pulled to ground. The SrL is pulled to ground throughout the search operation. Next, SE and WL are pulled high to enable the conducting path through M5 and M3/M4 (Table II) . Either V MM or V M voltage is developed depending on the match or mismatch, respectively, at the gate of M1/M2. The search lines SL is pulled to V dd and SL is pulled low to search a bit "1." Similarly, SL is pulled low and SL is pulled to V dd to search for bit "0." Both SL and SL are pulled low to search "X." Circuit operation in match and mismatch cases are discussed below. (match) . Under these conditions ( Figure 5(d) ) both M1 and M2 are turned ON to discharge the precharged ML which provides better sense margin. Figure 7 illustrates the ML voltages during search operation for TCAM of varied word sizes namely 1-, 16-, 128-, and 256-bit for match and mismatch. Predictive 22nm model is used for simulations [Cao et al. 2002] . The waveforms correspond to the worst-case sense margin, that is, single miss in the whole word. The rate of discharge of ML line in match case increases with the word size due to the more number of cells leaking the ML current through weakly driven M1/M2. This in turn limits the sense margin for larger word sizes. The equations V X1 and V X2 can be also be derived for the transistor device parameters by replacing the voltage across transistor by V ds and drain current of the transistors by the current through MTJ as below. The objective is to optimize the sense margin and search power and limit the drain current below the critical current of the MTJ. I d3 , I d4 , I d5 , are drain currents and V ds3 , V ds4 , V ds5 , are the drain to source voltages of transistors M 3 , M 4 , M 5 , respectively. I c0 is the STTRAM critical current. 
52:10 R. Govindaraj and S. Ghosh 
Drain currents and V ds of transistors M 3 , M 4 , M 5 differ in case 1 and case 2. However, deriving analytical expressions for transistor parameters from the above expressions is straightforward but tedious with short channel transistor equations. We have adopted a simulation-based approach to minimize such efforts.
NAND-type TCAM Cell
Two types of TCAM topology are traditionally investigated in the literature, that is, NAND and NOR. Typically, NAND topology TCAM is faster compared to NOR topology TCAM in full CMOS realization. We investigate NOR and NAND topology TCAM realization using STTRAM, which completes the study of STTRAM-based TCAM design. In this section, we propose a NAND-type TCAM cell using STTRAM. Figure 8 shows the circuit diagram of a NAND-type TCAM cell along with the match line structure. The cell consists of 2 PMOS, 7 NMOS transistors, and 2 MTJs. We use the complementary method, that is, bit "1" is encoded by parallel magnetic spins and bit "0" is encoded as antiparallel states of relative magnetic spins, to realize NAND-type TCAM in this work. Six NMOS transistors M1-M6 are sized for reliable search and write operation same as NOR-type TCAM explained earlier. In other words, the design analysis of NOR TCAM cell embedded in the NAND-type TCAM cell remain similar to that of a NOR-type cell. For a successful search operation in NAND TCAM, data bit "1" is encoded as a parallel state and bit "0" as an antiparallel state of the MTJ, respectively (Figure 8 ). Three additional transistors (M7, M8, M9) are added on the basic NOR TCAM cell to realize the NAND-type TCAM cell. These additional transistors are of minimum size. For search operation, initially the match line is predischarged, and the chain of PMOS transistors (M9 in Figure 14 ) of individual TCAM cells connects the match line to V dd only in the case of a complete match. Gates of PMOS transistors in the chain are precharged initially such that the chain is completely disconnected from V dd . The search data and search enable signals are asserted after predischarging of the match line. In the case of a match, the respective gate voltage is pulled low by the NMOS transistors M1/M2 in the search circuit. Thus, the match line is pulled high 
to complete V dd in the case of a complete match of a word. Table III summarizes the control signals in write and search operation of NAND TCAM. Writing is performed by injecting current more than the critical current from the search lines and source line (SrL). SE signal is "X," and Wr is pulled high throughout the write operation. Transistor M6 is sized to carry the current higher than the critical current of the MTJs for programmability. WL1 and WL2 are pulled to V dd to enable writing the "D" bit and "D" bit, respectively, in the first and second cycles of write, respectively. Write -1: Bit "1" can be stored in two cycles of write, that is, by writing parallel state to the MTJ storing D-bit and anti-parallel state to the MTJ storingD bit. In the first cycle, WL1 and SL are driven to V dd and SrL, and WL2 are pulled low, which results in current flow from a free layer to a fixed layer writing parallel state to the MTJ storing the "D" bit. In the second cycle, WL2 and SrL are precharged to V dd , while WL1 and SL are pulled low. This results in the current flow from the PL to FL storing antiparallel state in the "D" bit.
Write -0: Bit "0" is stored by writing an antiparallel state in the "D" bit and parallel state in "D" bit. In the first cycle, SrL and WL1 are precharged to V dd , while SL is pulled to ground to write an antiparallel state to "D." In the second cycle, WL2 and SL are pulled to V dd , while SrL is pulled low in order to write a parallel state to the "D."
Write -X: The don't care bit is written by storing a parallel state in both the MTJs "D" and "D," which results in a match case for both search bits "0" and "1." Writing "X" can be performed in a single cycle by pulling WL1, WL2, and SrL to V dd and SL, and SL are pulled low to ground, which results in writing "0" (parallel state) simultaneously to both the MTJs.
Design analysis of NOR-type cell design is presented in the next section. We analyse the design parameter selections (MTJ resistance, TMR, and search transistors sizing) for successful search operation in TCAM. Design analysis for write operation remains the same as in a conventional STTRAM cell, which we have excluded in our work.
NOR TCAM CELL DESIGN ANALYSIS
In this section, first we present the methodologies to determine the sizing and MTJ resistance for reliable operation of the proposed NOR TCAM. We consider a broad range of word sizes in the analysis. In the proposed design, the parameters are optimized for sense margin and search power. Parameters for write current transistor M6 is chosen to drive sufficient write current. All the other parameters in the cell design are optimized for search operation. 
Selection of R L and NMOS Device
The low MTJ resistance and sizing of transistor M5 are chosen to keep the search current below the critical current while providing a sufficient V gs to drive M1/M2 to differentiate the miss and match cases. Other than keeping the search current below critical current, limiting the search current is crucial to keep the search power as low as possible while achieving reliable sense margin. Moreover, ensuring the highest search current (through the lowest resistance) yields reliable design parameters, that is, total search current in the TCAM cell is less than the critical current under all PV (Process Variation) conditions. The high MTJ resistance is determined by the TMR. The transistor M6 is sized to provide write current greater than critical current through the STTRAM during write operation. We simulated a range of R L (5k to 9k) with fixed TMR of 100%. The trend is shown in Figure 9 (a) for a 16-bit word. It can be observed from the plot that high resistance values with smaller NMOS widths provide good sense margin (close to V dd /2) with lower STTRAM current from the search line. Based on this, R L = 8k is selected for the proposed design. The STTRAM current during mismatch is also plotted. Note that mismatch current is always greater than the match current, and therefore we consider it for estimating the worst-case read disturb during search operation.
The widths of the NMOS devices M3/M4 and M5 are important parameters to ensure low search current and reduce the power dissipated from the search lines. Figure 9(a) shows the distribution of STTRAM current for various widths of the NMOS device M5 with different R L values. Smaller widths of NMOS offer high resistance, reduce search current (good for lower read disturb and power), and improve the sense margin (following the discussion in Section 2.2). However, minimum-sized transistor can be susceptible to manufacturing process variations. We selected 50nm for M5 width for the low search current. Further, two transistors of 100nm width in series can be used to minimize the process tolerance. It can be observed from the plot that miss case current is highly dependent on the width of the M5 NMOS device and remains almost same for different R L values. High R L is selected to keep the TMR within practical limits (100-150%) [Chun et al. 2013] . To determine the optimal size of transistors M3/M4, we swept the size and observed the sense margin and sense current for the 50nm M5 width (Figure 9(b) ). It is evident from the plot that the sense margin increases sharply from 50nm to 200nm. After 200nm, improvement in the sense margin saturates. Also, the search current increases by approximately 10× with an increase in the width by 25nm. Therefore, we selected the width of M3/M4 to be 200nm. Figure 10 shows the trend of match current and sense margin versus the width of NMOS M5 for different TMR values. The R L of MTJ is fixed to 8K (as decided in Section 3.1) for this analysis, and TMR and R H are selected for low match case search current and higher sense margin. It can be seen that higher TMR ensures a better sense margin and low STTRAM match current with fixed R L . It can be seen from the plot that the NMOS width does not affect the STTRAM current compared to that in the miss case, because the MTJ high resistance R H dominates the effective NMOS resistance of M3/M4-M5. This also results in low drain voltage at M3/M4 compared to that in the mismatch case. So the width of NMOS is selected based on the mismatch current drawn from the SL while the TMR is chosen to satisfy the match case conditions. It can be noted that the sense margin benefit of TMR greater than 125% saturates. Hence, we have used TMR = 125%, which provides less than 45μA of match current with a sense margin close to 500mV.
The Impact of TMR on Sense Margin

Realization of R L and TMR:
Resistance of MTJ is shown to depend on oxide thickness and surface area of free layer [Shen et al. 2006] . Therefore, by tuning these parameters, it is possible to obtain MTJ resistance of R L = 8k . Similarly, it has been experimentally shown that TMR could be improved up to 236% [Shen et al. 2006] . This can be used during design time to ensure TMR = 125% for proper functioning of the proposed TCAM.
SIMULATION RESULTS OF NOR TCAM CELL
In this section, we present analysis of the proposed TCAM with respect to temperature, voltage and process-variations. We also propose modulating search enable signal and threshold voltage to improve robustness.
Setup
We used TMR = 125% with R L = 8k , a 50nm M5 transistor, and 200nm M3/M4 transistors (as discussed in Section 3). MTJ models from Fong at al. [2013] are used with 60nm×60nm×3nm free layer dimension and 0.876nm oxide (MgO) thickness for design simulations. Word sizes of 16, 32, 64, 128, and 256 bits are simulated to analyze the design with respect to process, temperature, and voltage variations.
Temperature Variation Analysis
Thermal fluctuations result in the critical current and switching time variations in the MTJ, which is modeled in the effective magnetic field in LLG equations [Wang et al. 2014] . The worst-case sense margin, search delay (for 50mV sense margin development), and the Power Delay Product (PDP) per bit search from 10 o C to 90 o C are shown in Figures 11(a)-(c) for different word sizes. A single bit mismatch is considered for sense margin and search delay as is the worst-case condition. The search delay increases proportionally as the word size due to the increment in the ML interconnect capacitance. As the temperature increases, the rate of ML discharge increases due to lowering of the threshold voltage of the discharge transistors M1/M2. The sense margin decreases with temperature due to ML discharge through the subthreshold leakage current of the discharge transistors in the match case. Therefore, the search delay (for the 50mV sense margin) increases with the temperature. The PDP is proportional to the change in search delay while the operating voltage and the search line current are similar across different temperatures. From Figure 11 , it is evident that we obtain a reliable sense margin of greater than 50mV across the range of temperature to a 256-bit word size.
Voltage Scaling
For this study, the operating voltage is varied from 0.7V to 1.2V to observe the sensitivity of the sense margin, search delay, and PDP per bit search (Figure 12) . A 50mV sense margin development time is used to measure the search delay. Below 0.7V, the sense margin of a 256-bit CAM word is less than 50mV. The sense margin and search delay are sensitive to V dd due to lowering of the gate voltage of M1/M2 while their threshold voltages remain fixed. At lower voltages, the M1/M2 transistors fail to turn ON or weakly conduct even during mismatch, degrading the sense margin (especially for wider words). Search delay for a 256-bit TCAM word varies from 124ps at 1.2V to 2.098ns at of 0.7V (search delay is plotted in log10 scale). The increase in the search delay results in sharp increase in the PDP at 0.7V.
Process Variation Analysis
For process variation analysis, we have considered Fast Fast (FF), Slow Slow (SS), and Typical Typical (TT) corners. We have modeled the process variation in transistors by a widely accepted technique of lumping the variation in channel length, oxide thickness, flat band conditions, and so on, into the threshold voltage of the transistor [Ghosh et al. 2006] . The SS (FF) is simulated by adding (subtracting) 150mV from the nominal threshold voltage. Process variation in the MTJ device is modelled by considering the effects of variation in the MTJ surface area and oxide thickness [Wang et al. 2014] . We have considered process variability in MTJ by varying the MTJ set resistance R L as normal distribution with a mean of 8kΩ and sigma +-500Ω and TMR variation of 0.1% (variation in surface area and oxide thickness). The worst-case sense margin is plotted for different supply voltages at the TT, SS, and FF corners (Figure 13 ). It can be observed that the design can provide a reliable sense margin of above 50mV at all corners to 0.75V for 128-bit words or less. The poor sense margin at lower voltages is linked with poor V gs across M1/M2 that keeps the ML precharged even in mismatch conditions.
The 256-bit word fails to provide an adequate sense margin in the FF corner at 1V. This is primarily due to poor 2 (as shown in Figure 6 ) when V TH0 moves down coupled with leakage from the match bits. Therefore, match bits leak in case of a mismatch degrading the sense margin. We propose threshold voltage modulation and SE voltage boosting or underdrive to improve sense margin for a 256-bit word. Furthermore, these techniques will not worsen the reliability of the NMOS device, since thicker oxide (associated with high V TH ) and lower gate voltage are expected to be better for reliability such as hot-carrier degradation, Negative Bias Temperature Instability (NBTI), and Time-Dependent Dielectric Breakdown. 
V TH and SE Modulation for Sense Margin Improvement
To solve the poor sense margin, we propose to modulate V TH0 , 1, and 2 by exploring threshold voltage modulation of transistor M1/M2 (to tune V TH0 ) and SE voltage modulation (to tune 1 and 2). Figure 14(a) shows the results at 1V for the three PV corners for a 256-bit word at different SE signal voltages and 0mV, 50mV, and 100mV higher V TH . A change in the gate drive of M3/M4 changes their ON resistance and results in corresponding changes in 1 and 2. It can be noted that the optimum choice of SE can improve the sense margin. Moreover, repositioning of V TH0 can improve the sense margin further. Figure 14(b) illustrates the sense margin across three PV corners with V TH implants at a 850mV supply voltage. It can be noted that V TH modulation can improve the worst-case sense margin significantly (FF and SS in this case), even though the sense margin in the TT corner is degraded. The improvement results from decreased match case current through M1/M2 at SS and the reverse effect in miss case at FF. At the same time, lower SE increases the resistance of M3/M4 which in turn increases 2. As expected, the sense margin in FF with the V TH implant is comparable to a TT corner without an implant. With a 100mV V TH implant, the design can provide a reliable sense margin of above 40mV in all the PV corners even without SE modulation. A 150mV SE underdrive can improve the sense margin at TT to more than 120mV, and a 250mV SE underdrive can improve the sense margin at FF to more than 50mV. So we employ a positive V TH implant of 100mV and a gate control signal SE under the drive by 150mV to improve performance across all the PV corners. A positive V TH implant is realized by thicker oxide, and a gate under the drive below the highest V g of the technology node improves the device reliability while improving the performance.
NAND TYPE TCAM CELL SIMULATION RESULTS
We simulated single-bit, 8-bit, 16-bit, 32-bit, and 64-bit NAND-type TCAM words and the waveform illustrating the match and miss case states of the match line is shown in the Figure 15 . It can be seen from the figure that NAND-type TCAM can provide up to 500mV of sense margin from 1-bit TCAM simulation. It is also observed that the sense margin decreases as the number of bits in the word increase due to the charge sharing of the match line from intermediate nodes between the bits of a word. We have measured the miss case match line voltage with the miss on the farthest bit from the sense amplifiers end to consider the worst-case scenario. The sense margin measured for 16-, 32-, and 64-bit TCAM words are 147mV, 90.5mV, and 33.4mV under nominal conditions. The search delay measured for a minimum SM of 50mV for 16-and 32-bit TCAM words are 1.83ns and 2.72ns, respectively. Sixty-four-bit words have a search delay of 5.45ns for 30mV SM. Search delay is measured as the time required to develop required sense margin on the matchline from the time WL crosses 0.5 * V dd . The sense margin can be improved by adding a capacitor to the match line, which makes it harder for the match line to get charged by the intermediate node voltages with stray charges in case of a miss. This technique increases the match line power due to an increase in match line capacitance.
The search power in a NAND-type TCAM cell at 0.8V and 1V supply voltage are tabulated in Table IV . The power consumption in NAND TCAM is higher than the proposed NOR TCAM. This is due to additional logic around the NOR-type TCAM cell in the design realization. The search delay and sense margin plot for different word lengths of NAND-type TCAM is as shown in the Figure 16 . It can be concluded from the plot that the search delay increases by twofold with the number of bits in the TCAM word (word length). The maximum sense margin decreases greatly for larger word length beyond 64 bits. Maximum sense margin for 64-bit word is 33.4mV with a search delay of 5.8ns, which is due to the larger resistance offered by the PMOS transistor chain in the match line. We have retained the size of PMOS in the match line same for different word lengths for simplicity of analysis and to alleviate the area overhead in larger words.
COMPARATIVE ANALYSIS
In this section, we present the comparative analysis of the proposed NOR and NAND TCAM with respect to CMOS CAM and other spintronic CAMs from the literature.
Comparison with CMOS TCAM
Conventional TCAM cell consists of 16 transistors while the proposed NOR-type TCAM consists of only 6 NMOS transistors and 2 MTJ bits, which is a 63.5% reduction in the number of transistors. For a power comparison, we implemented the CMOS TCAM and simulated it using a 22nm predictive model. The leakage power of the proposed TCAM is zero, as the power supply can be completely shut off during sleep while SRAM TCAM consumes a considerable amount of standby power. In mostly OFF applications such as the Internet-of-Things and smartphones the proposed TCAM could be very attractive compared to CMOS CAM. The search power consumption of the proposed TCAM is higher compared to conventional CMOS because of the search line current (∼51μA in the case of a mismatch at 1V) drawn to generate a secondary voltage at the drain terminals of M3/M4 that enables the discharge transistors of ML. The search line current can be reduced by selecting MTJ with high R L and high TMR. The power consumption during the search operation of "1" and "0" bits at 0.8V in STTRAM-based TCAM is observed to be up to 80% higher in the worst case (successful search of "1") compared to NOR-type CMOS TCAM. The power consumption of NOR-type CMOS TCAM and the proposed spintronic TCAM are tabulated in Table IV . The NAND-type TCAM consumes 2-3% more power and 50% more number of transistors (6T v/s 9T) compared to the proposed NOR-type TCAM.
Comparison with Spintronic CAMs
We compared the proposed TCAM cell performance with the other spintronic TCAM structures proposed so far (Table V) . The proposed NOR-type TCAM draws 51μA (39μA) from the search line during mismatch (match), which is significantly more energy efficient than DWM CAM [Zhang et al. 2012] . NOR-type TCAM has 33.3% fewer transistors compared to MTJ TCAM [Xu et al. 2008] and 50% (25%) fewer transistors than DWM TCAM [Nebashi et al. 2011; Zhang et al. 2012] . The proposed NAND-type TCAM has 12.5% additional transistors compared to the MTJ TCAM [Xu et al. 2008] and 33.3% (44.4%) fewer transistors than DWM TCAM [Nebashi et al. 2011; Zhang et al. 2012] . The BCAM proposed in Xu et al. [2008] requires additional circuitry (NMOS transistor and an MTJ) to configure as a TCAM. In the proposed TCAM data can be written to the bit cell by a conventional current-induced magnetization technique [Fong at al. 2013 ] and controlling the source line. Therefore, it eliminates the need of external writing circuitry. The NMOS transistor M6 (driven by a "WR" signal) provides the additional current required for write. This is unlike in Xu et al. [2008] , which does not provide methods for memory write. The TCAM cells [Xu et al. 2008; Chen et al. 2013] MTJs are integrated into search circuit in series which makes the write operation more complex and erroneous. DWM CAMs [Nebashi et al. 2011; Zhang et al. 2012 ] use domain wall motion-based write and MTJ sense circuit-based search, which adds area overhead and complexity in memory design. MTJ-based CAM proposed in Matsunaga et al. [2012] also uses four transistors (NMOS and PMOS) and two MTJs, and the sensing is based on the amount of current drawn from the match line by different low and high state resistance offered by the MTJ. The technique fails as the number of bits in a word increases (up to 144-bit words). The memory cell has low tolerance to variations in temperature and low V TH process corners due to leakage in the diode-connected NMOS transistor. With larger word capacitance, the ML increases while the differential current remains the same and thus affects the ML sense margin available. The situation becomes worse with process variation of the diode-connected NMOS transistor. Also, 2T-2MTJ [Matsunaga et al. 2012] , which also exploits the resistance differential and segmented search, has complex memory architecture incorporating the control circuitry, an accumulator to store segmented search results, and segment activation. The additional circuits incur delay and power overhead in the scheme. Overall, the technique is not efficient in terms of delay and power compared to other spintronic CAMs. Although search delay is mentioned in Table V for different word lengths, the search delay reported for the proposed TCAM is for larger words (256 bits). Also, in Chen et al. [2013] , it is only 7.5 times the search delay in a single-bit search. The proposed TCAM search delay differentiates by its smaller value for a larger word. The network IP address is 128 bits in the IPv6 protocol [Govindaraj et al. 2012] . For 128 bits the search delay is less than 250ps, which can theoretically support 3GHz to 4GHz search speeds for application in routers. We have used a 60nm×60nm IMA MTJ model that shows the write latency of 4ns with a write energy of 0.69pJ/bit.
CONCLUSIONS
We proposed a spintronic TCAM that is promising for zero standby leakage and uses fewer transistors. We conducted a detailed analysis in the presence of process, voltage, and temperature variations for a wide range of word sizes. The proposed design operates with a reliable sense margin up to 128-bit word sizes to 0.7V. We also propose threshold voltage modulation and search enable underdrive to improve the sense margin for 256-bit words. The proposed TCAM has 62.5% fewer transistors compared to conventional CMOS TCAM and 33-50% fewer transistors compared to other spintronic CAMs. The worst-case active leakage power of the NOR TCAM cell is measured to be 0.38nW. We also propose a 9T-2MTJ NAND-type TCAM cell, which has 43.75% fewer transistors compared to conventional TCAM cells. Our proposed NOR TCAM cell has better performance and power metrics compared to a NAND TCAM cell. Our study revealed that NOR TCAM using the proposed approach is better than the NAND TCAM in area, delay, and power. Therefore, it makes practical sense to employ the NOR TCAM in search applications.
