In this work, we investigated the sensing challenges of spin-transfer torque MRAMs structured with perpendicular magnetic tunnel junctions with a high tunneling magnetoresistance ratio in a low resistance-area product. To overcome the problems of reading this type of memory, we have proposed a voltage sensing amplifier topology and compared its performance to that of the current sensing amplifier in terms of power, speed, and bit error rate performance. We have verified that the proposed sensing scheme offers a substantial improvement in bit-error-rate performance. To enumerate the read operations of the proposed sensing scheme with the proposed cross-coupled capacitive feedback technique on the clamped circuity have successfully been performed a 2.5X reduction in average low power and a 13X increase in average reading speed compared with the previous works due to its device structure and the proposed circuit technique.
Introduction
Static random access memories (SRAM) are vital blocks of high-speed digital systems as these are used to implement on-chip cache memories. However, the static power consumption of SRAM increases while CMOS technol-5 ogy nodes are scaling down. The emerging non-volatile memories seem to be viable candidates to replace SRAMs in order to accommodate new scaled-down technologies. Among these memory candidates, spin-transfer torque magnetic RAM (STT-MRAM) is particularly attractive as it provides some unique advantages lacking in other candidates: CMOS compatibility, suitability for low-power and high-speed operations, and high endurance [1] .
The storage device used in STT-MRAM is magnetic tunnel junction (MTJ) which allows the storage of the infor-15 mation via magnetoresistive effect, that is binary information is encoded in MTJ as high (R AP ) or low (R P ) resistance based on the relative magnetization directions of a ferromagnet. On the other hand, perpendicularly magnetized (PMA) MTJs offers the attractive feature of the low 20 power switching property [1, 2, 3, 4, 5] .
There are several types of PMA devices, which are the bulk PMA-MTJs in thin films, exchanged-coupled PMA-MTJs in superlattices, and interfacial PMA-MTJs (pMTJs) in ferromagnet(CoFe)/oxide(MgO) interfaces [6] . 25 Among these, a pMTJ has attracted great attention for its potential benefits in realizing MTJ with high tunneling $ Corresponding author Department of Electronics and Communication Engineering Istanbul Technical University, Istanbul,34469 . $$ Corresponding author. E-mail address: matasoyu@itu.edu.tr (M.Atasoyu) .
magnetoresistance(TMR) in low resistance area product (RA). Also, high TMR is better distinguishing the R AP and the R P states from each other, and low RA is better 30 for reducing average switching current density [7] . In addition, TMR values depends on bias voltage [4] and temperature [3] variations. However, pMTJs with high TMR in low RA is challenging with high resistance variations due to their thinner tunnel barrier layer (MgO) [3, 4] . On the 35 other hand, in write operations, the switching threshold current (I C ) of a pMTJ is an important parameter determining the current required to switch between the R AP and R P states. Low values of I C are obtained while pMTJ dimensions are scaled down. However, scaling down the 40 dimensions of pMTJ has its own drawbacks. These drawbacks often cause a read disturbance (RD) and incorrect decisions because of low sensing margin (SM) [1] .
The purpose of this paper is to seek better sensing scheme solutions for recently proposed device methods 45 [3, 4] , whose devices have lower switching power than conventional devices. However, these MTJ devices have greater resistance variations such as 12.5% in this work than conventional MTJs, such as 5%, due to their thinner oxide thickness and their thinner free layer thickness. Our
50
proposed sensing scheme has a balanced current mechanism that uses the cross-coupled capacitive feedback in the reference and data cells of the structured MRAM arrays. We verify the effectiveness of our technique by comparing exactly the same simulation framework with the 55 seminal sensing scheme design [8] , which is based on a current mode sensing scheme different from our design (voltage mode sensing scheme) and has reference cells of low cost MTJs whose resistance states are only in low resistance states, we have adopted this approach from [8] . The
60
Monte Carlo BER results of the proposed sensing scheme and [8] show our contributions; Reducing MTJ dimensions requires reading operations of MRAM arrays at the low sensing current levels needed to prevent read disturbances due to the low switching current threshold value of MTJ between its high and low resistance states. Therefore, our sensing scheme enhances the BER performances comparing with [8] due to its well-defined the current values of data cells and reference cells. Secondly, we compare the voltage and the current mode sensing schemes in terms of 70 power, speed and BER performances which are the main specifications for the sensing operations. This paper gives a perspective for ongoing sensing scheme designs for embedded MRAM in terms of a current mode and a voltage mode sensing scheme approaches.
75
In this work, we investigated the challenges of designing sensing schemes for STT-MRAM structured with high TMR pMTJs with low resistance area. To this end, we considered voltage and current sense amplifier topologies and evaluated their performances from different perspec-80 tives i.e. power dissipation, speed, and BER performances. However, the pMTJ in low resistance area features large resistance variations due to its thinner barrier oxide layer. This may substantially reduce the BER performance [3, 4] of the STT-MRAM arrays. As a remedy, we propose 85 a circuit in which the sensitivity of the latch circuitry is improved, by retrieving the sense output from a high impedance node and by keeping the supply voltage headroom low which, in turn, enables low current sensing operation. Furthermore, we designed especially a clamped 90 reference scheme in the proposed circuit in order to further improve BER performance. In the core design of the clamped reference circuitry, there is a cross-coupled capacitive feedback mechanism (CCCF) which reduces the hysteresis and kickback noise effects, as well as improves the 95 current balancing between the data cells and the reference cells [8, 9, 10] . However, we have noticed that considering the effects of the parasitic capacitance compensation of the clamped reference circuitry increases the power consumption.
100
The rest of the paper is prepared as follows: Section 2 studies the compact modeling of the pMTJ, Section 3 shows the evolution of the proposed sensing scheme, Section 4 shows the simulation results and their effects on the key design metrics, Section 5 concludes this work. An MTJ has three layers, which are a free layer (FL), a pinned layer (PL), and an oxide barrier layer (such as MgO). The device has two stable magnetic switchable re-110 sistance states which are a low resistance state (P) and a high resistance state (AP), stored in the FL. The magnetized state reversal of the MTJ can be either of a precessional nature or of the thermally activated switching 
Parameter Value
The spin polarization factor (P) 0.5
The memory cell area 40nmx40nm
The tunneling magneto resistance ratio (TMR)
165%
The oxide thickness (t ox ) 1nm
The saturation magnetization (M S ) in Fig.1 .
The Simulation Framework
In this work, we use the macrospin compact model of MTJ written in Verilog-a [11] , where the magnetic switching dynamics of the MTJ is described by LLGS equations 130 in mono-domain [12] and the conductance of MTJ is modeled as described in [13] . In addition, 65nm CMOS model parameters and physical parameters of MTJs taken from [14, 15] are used in the models. The design parameters of the MTJ are given in Table- 1. In addition, the configura-135 tion of the Monte Carlo Simulations with 1K samples is specified with the CMOS model of 65nm and the variation of t OX of the MTJ is 2% (3σ). It is important to note that the resistance variations of MTJs are more than %12.
The Comparisons of Sensing Schemes for STT-

140
MRAM
A sensing scheme of STT-MRAM adopts at its input a current or a voltage signal that enables to determine resistive information via a sense amplifier (SenseAmp), which can be a voltage-mode sense amplifier (VSenseAmp) or a 145 current-mode sense amplifier (CSenseAmp). CSenseAmp amplifies the current difference between the activated bit line (BL) and the reference-line (REFL), while a VSenseAmp amplifies the voltage difference between the activated BL and the REFL. Comparing the speed of 150 these SenseAmps, the reading time of a VSenseAmp takes longer than a CSenseAmp due to a longer discharge or a longer charging time due to the large valued BL capacitance (C BL ) or the REFL capacitance (C REF L ). However, the reading operation of a VSenseAmp can be faster compared to CSenseAmp when the variations of threshold voltages V T H of the CMOS devices are greater than 12mV [16] . In fact, the variations of V T H are 30mV or more in deep sub-micron technologies [17] , such as in 65nm CMOS technology nodes also used in this work [8, 18, 19, 17, 20] .
160
It should be also noticed that these higher variations of V T H substantially degrade sensing margin (SM).
In the literature, some works have been presented on improving the SM in order to reduce the BER. Some of these efforts are at the device level and rely on increasing 165 the TMR of MTJs [21, 22, 6, 4, 3] . However, there are some efforts on circuit level based on equalizing the differences in parasitic resistance between BLs [8] . This latter approach improves the read access time [18] of well-defined reference resistance cells, which are self-reference cells [23] , 170 self reference cells with two transistors and two MTJs [17] , dynamic data dependent reference cells [18, 24, 19] , reference cells at only R P state [8] , and locating reference cells close to data cells [18] . At the same time, these works address the vulnerabilities of the sensing scheme of the 175 STT-MRAM. However, the use of simple current mirror circuitry in the sensing scheme may increase current mismatches in the BLs and REFLs [8, 24, 19] , and the dynamic data dependent sensing schemes [18, 24, 19 ] are a good solution to increase the SM, but the power consumption of 180 these schemes increases due to the use of two cross-coupled latch SenseAmps [18, 24] and a differential amplifier [19] . In addition, these two latch circuits have a high sensitivity into the mismatches [19] . The realization of STT-MRAM cell arrays with two transistors and two MTJs [17] 185 achieves low SM, whereas it suffers from high cost and high area. Although offset cancellation techniques can be used to improve SM further, all these techniques lead to powerhungry circuits [25] , such proposed in [26] .
The Proposed Sensing Schemes
190
The proposed VSenseAmp is composed of a latch, precharge and equalizing transistors, column (read enable) and write driver (write enable) switches, and clamped reference circuitry, as shown in Fig.1 
The Power Efficient Timing of the proposed VSenseAmp
205
The reading operations of the proposed VSenseAmp are carried out in three phases: pre-charge, evaluation, and decision. The data and the reference cells are activated through word lines (WLs) and reference-WLs (REF-WLs). The activation orders for these stages are shown in Fig.2,   210 the clock delays are similar to [8] . Also, the output signals of the proposed SenseAmp for the R AP and R P states are shown in Fig.2 . In the pre-charge phase via M 7−8 , transistors activated with the clock signal of SAE and further the latch is disconnected from the clock signal of SAE1 in 215 order to reduce power dissipation [8] . Both BLs and RLs are pre-charged to the voltage value of V DD = 1V , and also this voltage value can be possible in less than 1V in our proposed SenseAmp. In the evaluation and decision phases, the output of the proposed VSenseAmp goes to 220 V DD (goes to ground) when sensed resistive data from the activated STT-MRAM cell is high (low). [10, 8, 9 ], helps to a low resistance state sensing creating imbalance between BLs and REFLs because reference cells are only in low resistance state [8] . The circuit operations of these functions 255 are as follows; MC and MR transistors are biased at the voltage values of V C = 0.8 and V R = 0.7 obtained via the parametric analysis for the optimal BER performance. These voltage sources drive the parasitic capacitances of MC and MR, mainly taking into account C GD , and cou-260 pling of these capacitances to BLs and REFLs capacitances deteriorate the cell current, and so this can cause a read failure. Adding cross-coupled C 1 C 2 which are structured as a transistor that has the same width and length of transistors of MC or MR, as shown in Fig.1 , helps to reduce 265 these voltage fluctuations functioning as a capacitive voltage divider between BLs and REFLs, as formulated in Eq.1-2. However, this solution increases the occupied area of the sensing scheme.
The resistance of the reference cell which is only a low resistance state in this work and in this paper, we propose to alleviate this penalty utilizing the current balance mechanism that is based on the cross-coupled feedback technique.
The Resistance Determination of Reference Cells
A current or a voltage signal difference between a data 295 and a reference cell is amplified by SenseAmp; maximizing this signal difference improves the SM as well as a reference cell resistance with its low variation. Researchers have sought as a way to find well-defined reference cell resistance such as only R P [8] , current-mean [24, 9] and resistance-300 mean [24] , dynamic data dependent [18, 24] and absolute resistance [27] , and in this work, multiple-cell R P building serially connected R P to protect read-disturb.
Performance Comparisons
We compared the read performance of the CSenseAmp
305
[8], our proposed VSenseAmp, and some SenseAmp designs in literature in terms of BER, power, speed, and area perspective. We perform the speed and power comparisons of our work with the reference works [18, 19, 23] taking into account the bit lines and the reference lines better with MTJ or CMOS device level innovations, such used MTJ devices have lower switching power and a perpendicularly magnetized with low-resistance area product, proposed in [6, 3, 4] . We mentioned that this type of device has a better performance of speed and power. 
The Read Realibility: BER Performances
The BER performances is given in this section to show the effectiveness of the proposed clamped reference neutralization technique on the voltage and the current mode sensing schemes in terms of their BER performance im-325 provements. High TMR in low RA MTJ devices provides resistance matching between MTJs and access transistors [21] , as good as BER performance. However, TMR values are highly sensitive to variations of t OX [3] , free layer thickness (CoFe/CoFeB) [3] , and RA product [4] . In ad-dition, a pMTJ with a thinner t OX has low RA product [3, 21] , but has higher resistance variations. These resistance variations of pMTJs in our simulations are defined as 13% for both R P (the actual value is 742Ω) and R AP (the actual value is 1.97KΩ) according to t ox thickness of 335 1nm with its variation of 2% for 3σ. It is important to note that the resistance variations of MTJ devices in simulations are generally defined as 5% and 3σ. In this work, they are so high because of the thinner thickness of t ox as shown the mean variations of R AP and R P in Fig.6 . 
340
To improve BER performance, placing closer data and reference cells minimize resistance variations [18] , but may consume extra area, and reducing secondary noise effects on the latch circuitry such as capacitive couplings, a kickback, and a hysteresis noises [13] ; on the other hand, the 345 debate over the offset-aware design of a sensing scheme comes into conflict between a low power design or very precise design [25] . In this work, we compared the reference resistance scheme: serially stacked three pMTJs or one pMTJ in terms of their BER performance. In the se-rially stacked structure [28, 29] , the reference resistance scheme combined serially connected three pMTJs, but has the same resistance value of one pMTJ, making the area of the one of this pMTJ is three times bigger than the area of one pMTJ. Indeed, this also provides unintentional write 355 protection, due to three times increased J C during reading operations. We compared the VSenseAmp and the CSenseAmp in terms of BER performance.
Furthermore, parasitic capacitance coupling effects on BER performance of the proposed VSenseAmp, such as 360 hysteresis effects that causes the previous stored data on the C GD ) of the M3 and M4 causes when the recovery time of the latch is inefficient, concluded assuming the capacitances (taking such as 2fF, 4fF, 7fF) of the SAOUT and the SAOUTB nodes are initially set at the voltages 365 of 0V (V DD ) then we sensed the stored R AP (R P ) data. Also, these parasitic capacitance couplings of the proposed The VSenseAmp with multiple references with CCCF 8 6 The VSenseAmp with a single reference with CCCF 2 8
VSenseAmp and the CSenseAmp [8] are shown in Fig.4 . The BER performance of the proposed VSenseAmp and CSenseAmp were evaluated by Monte Carlo simulations 370 with CMOS 1K and MTJ using variation process samples. The proposed VSenseAmp is robust in secondary noise effects compared to the CSenseAmp, as shown in Fig.5 and Fig.6 , which respond to AP and AP state detection. As a result, it is difficult to find an optimal solution in 375 terms of the performance of REC between the states R P and R AP examining Table 2 . 
The CSenseAmp [8] with a single reference 32 29
The VSenseAmp with a single reference 45 38
The VSenseAmp with multiple references 45 38
The VSenseAmp with CCCF with multiple references 48 41
The VSenseAmp with CCCF with a single reference 48 41
Power consumption comparisons
Firstly, the latch is a power-hungry unit and must be effectively disconnected via SAE1 clock after the sensing 380 decision [8] for low power operation. We analyzed different timing strategies of SAE1 clock such as the same as the SAE clock or one or two inverter delays after SAE clock [10] ; in fact, the most effective one is similar to [8] , and also we applied our proposed VSenseAmp. However, the 385 proposed CCCF technique increases the power consumption as given Table 3 , and the CSenseAmp has less power consumption than the proposed VSenseAmp.In fact, reference resistance cells constructed with single MTJ and multiple MTJs have almost the same power dissipation, 390 as indicated in Table 1 . As a result, we compared the power dissipation (at 66.7MHz) of some previous works and the proposed VSenseAmp. Our proposed design has less power dissipation than these given works due to its device structure as well as the reduced capacitive coupling 395 effects, as shown in Fig.4 . In addition, our power dissipation results are obtained through Monte Carlo simulations (with 1K samples), and separately for the R AP and R P states. 
Read speed comparisons
400
The speed comparisons are based on the simulations in HSPICE and are not included the parasitic word-line capacitances just only included the bitline parasitic capacitances as the value of 50fF that is higher than such a value of 30fF for 256 cells because the read access delay 405 of the sensing scheme is a function of the bitline capacitances that is a t access = (C BL * V of f set )/I read [30] . The readout time of the proposed VSenseAmp is sensitive to the discharging time of C BL which was 50fF in our simulations, as well as to the voltage swings of BLs and REFLs.
410
However, when a reading a word be needed to access the word-line with the write access delay specified to the process technology.
A VSenseAmp can be faster than a CSenseAmp when it is designed in deep submicron technology nodes [16] .
415
The reading speed of a latched SenseAmp is limited by the overdrive recovery time of a latch circuitry that is needed for reliable sensing operations and also depends on the transconductance of M3 and M4 transistors and the parasitic capacitances. In order to improve the speed 420 of the read operation of VSenseAmp, we have adopted a clamped reference scheme [10, 8] , and we proposed the CCCF technique to improve the readout time. The proposed VSenseAmp is faster than the CSenseAmp and less sensitive to parasitic coupling capacitances, but read speed 425 is not improved or deteriorated when a reference cell structured with single MTJ and multiple MTJs, according to our speed comparison results given in Table 4 . In addition, these readout delay comparisons are not the same for the R AP and R P because of the asymmetric resistance distri-430 bution of R AP and R P . More importantly, the readout time of proposed VSenseAmp with CCCF is less sensitive to reference resistance variations taking account given speed comparison results in Table 4 for multiple and single reference schemes of the proposed VSenseAmp with or 435 without CCCF; however, the proposed VSenseAmp with CCCF is slower but has a reliable operation. Indeed, the proposed VSenseAmp with CCCF has faster read operations than the compared CSenseAmp [8] . As a reminder, the high-speed operation will consume high power. As a 440 result, our proposed VSenseAmp has better readout speed 
The outlook of chip area efficiency
The types of data and the reference cell arrays used in the implementation mainly determine the chip area of the STT-MRAM. The use of a common source-line array 450 is an area-friendly approach compared to the approaches where the source-line is kept separated [18] , or where the contact holes from BLs to access transistor are shared by the nearest cells [8] . The proposed VSenseAmp is thus an area-efficient design compared to circuits presented in
455
[19, 18, 24] . The proposed VSenseAmp can be located at the edge of sub-array and can be shared between multiple BLs together. This approach may lead to further reduction in chip area [18] . As a reminder the layout of the proposed sensing scheme is not generated, this section is 460 not concerned a layout-area comparison.
Conclusion And Discussions
In this work, we investigated the main limitations of sensing schemes designed for STT-MRAM structured with pMTJs providing high TMR in low resistance area. We 465 have proposed a new sensing circuit, modified to address the main issues. It is important to notice that the proposed sensing scheme, i.e. the VSenseAmp is less sensitive to the resistance variations of the data and reference cells from the [8] . Moreover, the proposed VsenseAmp has a sub-470 stantial advantage in sensing speed, BER compared to the counterparts of the proposed design in the literature and [8] . Consequently, the proposed VSenseAmp with a single reference pMTJ cell is a good solution for high-speed and low-power read operations. To enumerate the read oper-475 ations of the proposed VSenseAmp with CCCF have successfully been performed a 2.5X reduction in average low power and a 13X increase in average high speed compared with the previous works due to its device structure and the proposed circuit technique. Our future work will examine 480 specific circuit techniques to improve accuracy rates of the proposed VSenseAmp.
