ABSTRACT Spin transfer torque-random access memory (STT-RAM) has recently been regarded as one of the most promising non-volatile memory candidates for the next-generation computer architectures. However, the readability issue has become a new obstacle for STT-RAM in deeply-scaled technology nodes, owing to (a) the increasing process-voltage-temperature variations but reduced supply voltage, resulting in low sensing margin (SM); (b) reduced current margin between the read current and the critical write current of magnetic tunnel junction, leading to the high read disturbance (RD). Here, to deal with the readability issue of deeply-scaled STT-RAM, we propose a full-sensing-margin dual-reference sensing (FSM-DRS) scheme via exploiting analog signal processing within the sensing circuit design. The proposed FSM-DRS scheme improves SM significantly but with no increase of RD through: (a) including two reference cells which utilize the same structure of the data cells to provide two reference signals, thus reducing the reference mismatch or regularity problem; and (b) adding an analog signal pre-processing operation between the data and reference signals before decision, doubling SM without increasing RD. In comparison with the typical sensing schemes, our simulation results (under the 40 nm technology node) show that our FSM-DRS scheme has an ∼70% enhancement in average SM as well as a ∼10× decrease in bit error rate.
I. INTRODUCTION
As process technology node scales down, the conventional CMOS technology faces enormous problems, in which the high leakage current is the most serious one [1] , [2] . To tackle this problem, many new nonvolatile memory technologies have been proposed to address the leakage current problem. Among them, spin transfer torque-random access memory (STT-RAM) has been widely regarded as one of the most potential candidates for the next-generation nonvolatile memory technologies, owing to its low power consumption, nonvolatility, fast speed, good compatibility with CMOS technology and high endurance [3] - [5] . Nevertheless, there are still some roadblocks on the way to its practical application and commercialization. Among them, read reliability is becoming one of the most critical issues, especially as technology downscales. The read reliability of STT-RAM is mainly affected by three aspects: (a) the tunnel unintentional switching of the MTJ state. Therefore, there is an intrinsic conflict and is rather difficult to optimize RD and SM of deeply-scaled STT-RAM simultaneously [14] - [16] . To make matters worse, as technology continuously scales down, the margin between the read current and the critical write current of MTJ becomes even smaller, making the readability issue a key bottleneck of STT-RAM in deeplyscaled technology nodes.
An MTJ consists of two ferromagnetic layers (e.g., CoFeB) separated by an oxide barrier (e.g., MgO). Fig. 1(a) and (b) show the two states of a 1T1MTJ (one transistor plus one MTJ in series) STT-RAM bit-cell structure. The magnetization direction of one ferromagnetic layer (pinned layer) is fixed, while that of the other one (free layer) can be altered freely by flowing a sufficient write current (I write ) through the MTJ device. An MTJ has two different resistances depending on the relative magnetization directions between the fixed layer and free layer. The parallel state ''0'' has a low resistance (R L ) and can be written by applying a write current flowing from the bit-line (BL) to the source-line (SL). On the other hand, the antiparallel state ''1'' has a high resistance (R H ) and can be written by applying a write current from the SL to the BL. The resistance difference between R L and R H is characterized by the TMR (TMR = (R H −R L )/R L ) ratio. It should be noted that, since the polarity of the read current is the same as that of I write for writing resistance state ''0'' in the configuration as shown in Fig. 1 , a RD (i.e., an unintentional MTJ switching event during the read operations) may occur when the initial state of the MTJ is ''1''.
In the past several years, many designers have attempted to overcome the read-yield degradation issue of STT-RAM as technology downscales. From the device point of view, a larger TMR is expected for an MTJ device, which can significantly increase SM [17] . Besides, a new cell structure which separates the write and read paths is helpful to avoid RD [18] , [19] . Meanwhile, the complementary cell (or twincell) structure [20] , [21] , in which complimentary values are written to two cells, is an effective to achieve a double SM, however, at the cost of storage capacity, because two cells are required to store only one data bit. On the other hand, designing novel reference schemes and sensing circuits at the circuit level have gained special interest in order to tolerant PVT variations. Until now, there have been many attempts in this area. For instance, the self-reference sensing schemes, which try to remove PVT variations of the reference cell by utilizing the data cell itself as the reference cell [22] - [24] or by changing the reference cell structures [25] - [31] , is rather attractive. Meanwhile, the offset tolerant sensing circuits, which are proposed to decrease PVT variations of the sensing branches [32] - [34] , are very popular in circuit design. These sensing schemes can indeed increase SM, but they generally need multiple steps to perform sensing operations, leading to a considerable delay. In addition, many other sensing schemes, such as adding redundancy to the sensing amplifiers (SAs), current-sampling SAs, dynamic-reference SAs, feedback SAs [36] - [44] etc., have been proposed to address the readability issue of scaled STT-RAM. However, most of the previous studies cannot address the problems of the process variations, the SM and the RD, simultaneously. In particular, as process technology scales, the conflict between SM and RD becomes more serious. Thereby, establishing a novel sensing scheme to address this conflict in deeply-scaled technology nodes is of pressing importance.
In this work, we propose a novel full-sensing-margin dualreference sensing (FSM-DRS) scheme to solve the readability issue of deeply-scaled STT-RAM via exploiting analog signal processing within the sensing circuit design. The proposed FSM-DRS scheme aims to double SM with no increase of RD. In order to realize this goal, two novel features are included in the FSM-DRS scheme: (a) two complementary reference cells with exactly the same bit-cell structures as those of the data cells are added to generate two complementary reference signals for the sensing circuit, avoiding reference mismatch and regularity problems between the data and reference cells; (b) an analog signal pre-processing (ASP) module is employed to deal with the data and reference signals before they are sent to the SA. This module can double SM by comparing directly the data signal (for resistance in either R H or R L ) with its complementary reference signal through performing a analog signal processing operation (discussed in detail in Section III). In this case, the ASP module enables the proposed sensing circuit to achieve full SM without increasing RD compared with the typical sensing schemes (with half SM). Furthermore, the analog signal pre-processing operation is achieved through capacitance division, inducing no extra process variations. Different from the complementary cell structure case, in our approach, only one cell is required to store one data bit, and the two reference cells are shared by the whole array. Fortunately, thanks to our sensing scheme, we can obtain a double SM similar to the complementary cell structure case.
The remainder of this paper is organized as follows: Section II presents a short introduction of typical sensing schemes, analyzes the readability issues of deeply-scaled STT-RAM, and presents our motivation. Then we present VOLUME 6, 2018 the concept, circuit implementation of our FSM-DRS scheme in Section III. Afterwards, in Section IV, we present the simulation results and discussions. Finally, Section V concludes the paper.
II. READABILITY ISSUES OF STT-RAM

A. TYPICAL SENSING CIRCUIT
A typical sensing circuit of STT-RAM is shown in Fig. 2 [37] . The sensing operation can be divided into three steps. First, a bias voltage (V BL ) is applied to the target BL, generating a data current signal (I data ) from the data cell. Secondly, I data is converted into a data voltage (V data ) by the load transistors of the sensing circuit. At the same time, a similar procedure is performed for the reference cell and we get a reference voltage (V ref ). In the end, by comparing V data and V ref through a SA, an output signal (''0'' or ''1'') can be obtained depending on the amplitude between V data and V ref .
Based on the reference cell structure, there are two kinds of reference cells: resistance-mean (RM) ( Fig. 2(a) ) and currentmean (CM) reference cells ( Fig. 2(b) ) [24] . The concept of the RM reference cell is to average R L and R H to get a RM reference resistance (i.e., 
where R data and R ref are the resistance values of the data cell and reference cell respectively. R data is in either R L or R H state decided by the data stored inside in the MTJ. Ideally, we can get 
It should be noted that the amplitude of SM should be larger than that of the input-offset of the SA in order to achieve a correct sensing operation. In addition, a larger SM is much preferred in practice for a faster sensing operation. Therefore, we pursue SM as big as possible in sensing circuit design.
B. READABILITY ISSUES OF SCALED STT-RAM
In general, SM is proportional to the TMR ratio and V BL (see Eqs. (1)- (3)). Nevertheless, the TMR ratio is limited by the intrinsic MTJ device property and process technology, while V BL should also be artificially limited to avoid RD of the data and reference cells during the sensing operations, because the RD probability (see (4) ) is also proportional to V BL (or the sensing current flowing through the MTJ). Therefore, there is a fundamental conflict between RD and SM. where Pr dis denotes the RD probability, I read and t read are the amplitude and duration of the sensing current pulse, respectively, is the thermal stability factor of the MTJ, I C0 is the critical write current and τ 0 is the attempt period for MTJ switching. To make matters worse, as the process technology downscales, on one hand, the PVT variations increase, resulting in further degradation of SM; on the other hand, I C0 decreases, making the margin between I read and I C0 even smaller (i.e., increase of Pr dis ). In this case, the conflict between RD and SM gets even worse with technology downscaling, making the readability issue a key bottleneck of STT-RAM in deeply-scaled technology nodes.
C. MOTIVATION
As shown in (4), the method to improve SM by enlarging the sensing current is infeasible in practice because of the RD issue, especially in deeply-scaled technology nodes. However, one observation should be noted that the maximum achievable SM of the typical sensing circuit is only half of (3)), respectively, owing to the reference signal. If we can exploit the full margin of |V H − V L | or |I H − I L |, then we can double SM without increasing the sensing current (or RD). Based on this motivation, we proposed a novel fullsensing-margin dual-reference sensing (FSM-DRS) scheme by exploiting analog signal pre-processing within the sensing circuit design for deeply-scaled STT-RAM. Fig. 3 illustrates the schematic of the proposed FSM-DRS scheme, which consists of three parts, including a sensing circuit, an analog signal pre-processing (ASP) module and a SA. Here two complementary reference cells (see the shadow region in Fig. 3 value of V dd /2, as it lies in the middle of the supply voltage and enables an equal margin either to increase to V dd or to decrease to GND. Thus, we can get 
III. PROPOSED FSM-DRS SCHEME
2 ) owing to the capacitance division. After S3, we can get updated V 1 and V 2 (see Eq. (6)). Finally, (S4) during the decision stage, V 1 and V 2 are sent to the SA to output a digital result. As can be seen, the voltage difference between V 1 and V 2 after S3 can achieve a full SM = |V H −V L | between V H and V L (see Eq. (7)).
Specifically, if the data cell is in ''0 (i.e., R L )'' state, then V data = V L , we can get,
Therefore, the SA outputs a digital ''0''. Alternatively, if the data cell is in ''1 (i.e., R H )'' state, then V data = V H , we can get,
Therefore, the SA outputs a digital ''1''. This indicates the correct operation of the proposed FSM-DRS scheme. More details will be presented in the following section.
IV. SIMULATIONS AND DISCUSSIONS
Hybrid MTJ/CMOS circuit simulations were carried out by using a commercialized 40 nm CMOS design kit and a compact MTJ model [45] . Some key related parameters are listed below: V dd = 1.0 V, TMR = 100%, = 60, C 1 = C 2 = 3.0fF, temperature is 300 K, resistance area product (R.A) of the MTJ is 5 · µm 2 , the process variations of the MTJ (mainly resistance) and CMOS transistor (including V t , width, length) are 5% and 3σ , respectively. Here, the process variations of 5% is a typical value on the resistance of the MTJ and is based on reported experiment data and production test results [46] , [47] . The process variations of 3σ for the CMOS transistor, in which σ is the standard deviation in a Gaussian distribution, is from the commercialized 40 nm CMOS design-kit. Monte Carlo simulations were performed with 10000 runs. Table 1 lists the performance comparison results between the typical sensing schemes, DDRS [41] and the proposed FSM-DRS schemes with TMR = 100%. Firstly, compared with the typical schemes, as can be seen, our FSM-DRS achieves the best SM and BER. In addition, please note that both the SM and sensing current have a gap between Typical+RM and Typical+CM, due to the different connection structures of the load PMOS transistors. In Typical+RM, the load PMOS transistors are connected to act as a load resistance, while, in Typical+CM, the load PMOS transistors are connected in a current mirror structure, therefore the sensing current (accordingly the SM) is dramatically reduced. On the other hand, thanks to the current mirror structure, the Typical+CM has a lower standard deviation (SD). Therefore, the final BERs between the Typical+RM and Typical+CM are interestingly similar. In comparison with the DDRS [41] based on the same parameters and simulation environments, our proposed FSM-DRS has an advantage in terms of SM and BER with similar area overhead. However, our FSM-DRS requires a relatively longer sensing time, due to the analog signal preprocessing step. Besides, the sensing current (the related read disturbance or power consumption) is relatively larger, because we utilize a typical voltage-mode sensing method. In DDRS, the sensing current is clamped by the positive feedback. Fortunately, the sensing current can be significantly reduced by clamping the BL voltage or by increasing the load resistance, as we have a high SM. Besides, in comparison with the covalent-bonded cross-coupled (CBCC) scheme proposed in [43] based on the same parameters and simulation environments, the CBCC scheme has shorter sensing time and smaller sensing current. However, our proposed FSM-DRS shows a big advantage in BER. In the following, the performance and parameter impacts of the proposed FSM-DRS will be analyzed in detail.
A. IMPACT OF CLAMP VOLTAGE
The gate voltage of the NMOS clamp transistor (denoted as V clamp ), which controls the amplitude of the sensing current, has an important impact on the sensing operation of the FSM-DRS scheme. Obviously, a larger V clamp results in a bigger sensing current, thus improving SM and sensing speed. On the other hand, it increases the RD probability. Fig. 6(a) ), which will lead to an asymmetric read reliability (i.e., the sensing yield is different for data bits ''0'' and ''1''). However, when V clamp = 1.0 V, σ V L and σ V H are comparable (see Fig. 6(b) ), therefore no or small sensing yield asymmetry will occur. In addition, the sensing currents when V clamp = 1.0 V have a slight increase in comparison with those in V clamp = 0.9 V, indicating similar RD probability. Regarding all these reasons, V clamp = 1.0 V is selected in our FSM-DRS design for a higher sensing yield. Fig. 7 shows the V C0,1 ( V C in ''0'' or ''1'' states) distribution of the proposed FSM-DRS scheme. Ideally, there VOLUME 6, 2018 should be V C0 < 0 and V C1 > 0 for a correct sensing operation. As can be seen from Fig. 7 , almost all values of V C0 and V C1 fall within the correct region, indicating a rather high read yield. The mean values µ V −C0 and µ V −C1 are ∼258.9 mV and ∼258.2 mV, respectively, nearly twice as much as that of the typical one (shown in Table 1 ). In addition, the deviations σ V −C0 and σ V −C1 are ∼78.4 mV and ∼83.3 mV, respectively, which are very close, corresponding to the conclusion drawn in Fig. 6 that only a slight sensing yield asymmetry occurs. Furthermore, Fig. 8 shows the normal probabilities of V C0 and V C1 of the proposed FSM-DRS scheme, validating the results in Fig. 7 . As can be seen, both V C0 and V C1 are consistent with the lines except a small deviation. Besides, there is no sharp slope in Fig. 8 , indicating that the proposed FSM-DRS scheme is rather robust under PVT variations. Fig. 9 shows the SM with respect to the TMR ratio. As can be seen, the average SM of the FSM-DRS scheme has an ∼70% enhancement in comparison with the typical sensing schemes. It should be noted that the practical SM improvement (∼70%) is less than the ideal assumption (say double) owing to the parasitic capacitances. The detailed reason will be explained later in the following subsection. Fig. 10 presents the BER with Monte-Carlo simulations (10000 runs). We can find that the BER of the FSM-DRS scheme is far below that of the typical sensing scheme. When TMR = 100%, the BER of the proposed FSM-DRS scheme is ∼0.315%, about 10× reduction compared to that of the typical sensing schemes. More importantly, when TMR > 150%, no sensing error can be found in our Monte-Carlo simulations. 
B. SIGNAL DISTRIBUTIONS
C. SM AND BER
D. IMPACT OF THE CAPACITANCE
The capacitances, contributing from the capacitors C 1 and C 2 (which are two important components) as well as the parasitic capacitances DC 1 and DC 2 (see Fig. 11 ), have great impact on the performance of our proposed FSM-DRS scheme. During the subtraction stage (S3), when C 1 and C 2 are charged or discharged, DC 1 and DC 2 will also get charged/discharged at the same time, making V * 1 = V 1 (and V * 2 = V 2 ) with deviations mainly depending on the relative capacitance amplitude between C 1 and DC 1 (between C 2 and DC 2 ). In specific, assume the data cell is in ''1'' state, then the voltage on node V * 1 at stages S2 and S3 can be denoted as V * 1 (S2) =V L and V * 1 (S3) = V H , respectively. The relationship between V * 1 and V 1 after stage S3 can then be VOLUME 6, 2018 decided as,
By solving (10), we can get, In addition, the ASP module in our circuit has a symmetric structure, the BL capacitance will have a symmetric impact on the coupled voltage of V1 and V2 through C1 and C2. Therefore, in Eqs. (10)- (12), we omit the BL capacitance for simplicity. Further, Fig. 12(a)-(b) shows the impact of the capacitances of C 1 and C 2 on the σ V distribution, and SM respectively. As can be seen, both σ V and SM increase with respect to the capacitances of C 1 and C 2 when their values are relatively small but saturate after their values exceed a threshold. This phenomenon can be explained by Eq. (12), due to the relative ratio between C 1 and C 1 + DC 1 (C 2 and C 2 + DC 2 ). In practice, relatively larger C 1 and C 2 are preferable for achieving a higher SM to overcome the input-offset of the SA. In addition, Fig. 12(c) shows the impact of the capacitances of C 1 and C 2 on the sensing time of the proposed FSM-DRS scheme. Obviously, a larger capacitance causes a longer time for charging/discharging, thus leading to a longer sensing time. Therefore, there is a trade-off in terms of SM and speed for the choice of capacitances of C 1 and C 2 . Meanwhile, it should be noted that the capacitances of C 1 and C 2 should be far larger than the parasitic capacitances DC 1 and DC 2 . Therefore, we choose C 1 = C 2 = 3.0 fF in this paper, when taking into consideration all these concerns. In addition, the capacitance of the BL has also an impact on the sensing time owing to the charge delay as shown in Fig. 13 .
E. AREA OVERHEAD Table 1 shows that the proposed FSM-DRS scheme has a relatively bigger area overhead (see also the layout in Fig. 14) , compared with the typical schemes, due to the extra reference cells and the ASP module. The overhead is not the most critical performance bottlenecks for deeply-scaled STT-RAM in comparison with the reliability requirement. Although the area overhead of the FSM-DRS circuit is bigger than the other two schemes, but the total area overhead is negligible for the whole memory chip, because the sensing circuit in every WL or BL serves for many memory cells. For instance, assume a sub-array with 8 BLs and 256 WLs, the FSM-DRS scheme occupies only ∼1.3% of the chip, while the area overheads of the Typical+CM and Typical+RM are ∼1.2% and ∼1.0% respectively. Furthermore, as the sub-array size grows larger, the area overhead is further reduced.
Meanwhile, in order to show the merit of our proposed sensing scheme, we also compared the performance with identical area by increasing the layout of the load PMOS and clamp NMOS transistors of the Typical+CM and Typical+RM schemes. As shown in Table 2 , our work still has an advantage in BER enhancement. In addition, please note that one of the key insights of our proposed sensing approach is the ASP part, the previous reported variation reduction techniques can also be integrated into our sensing scheme.
V. CONCLUSIONS
The read reliability has become a key bottleneck for STT-RAM as process technology continuously downscales. In this paper, a novel FSM-DRS scheme is proposed, achieving full SM without increasing RD by exploiting an analog signal pre-processing operation on the data and reference signals before they are sent to the SA. The proposed FSM-DRS scheme was implemented and evaluated under the 40 nm technology node. According to our simulation results, the FSM-DRS scheme can realize a ∼70% improvement in average SM and a ∼10× decrease in BER for TMR = 100% with insignificant performance and area overheads, in comparison with typical sensing schemes. More interestingly, the proposed FSM-DRS scheme can achieve 100% read yield when TMR > 150%, which is a popular parameter in mainstream MTJ technology. The proposed FSM-DRS scheme is promising for advanced deeply-scaled STT-RAM.
