Abstract-In this paper, we present two multilevel spin-orbit torque magnetic random access memories (SOT-MRAMs). A single-level SOT-MRAM employs a three-terminal SOT device as a storage element with enhanced endurance, close-to-zero read disturbance, and low write energy. However, the threeterminal device requires the use of two access transistors per cell. To improve the integration density, we propose two multilevel cells (MLCs): 1) series SOT MLC and 2) parallel SOT MLC, both of which store two bits per memory cell. A detailed analysis of the bit-cell suggests that the S-MLC is promising for applications requiring both high density and low write-error rate, and P-MLC is particularly suitable for high-density and low-write-energy applications. We also performed iso-bit-cell area comparison of our MLC designs with previously proposed MLCs that are based on spin-transfer torque MRAM and show 3-16× improvement in write energy.
I. INTRODUCTION
T ODAY'S dominant CMOS memories, such as SRAM and DRAM, require constant supply of power to retain its data. This leads to static power consumption due to the leakage currents, which is expected to increase with scaling [1] . One straightforward way to tackle the issue is to replace the leaky CMOS memories with nonvolatile memory technologies that consume zero standby power without losing data. Spin-transfer torque magnetic random access memory (STT-MRAM) is considered to be the leading candidate due to its potential for low power, high density, compatibility with CMOS process, scalability, and nonvolatility [2] . The storage element of STT-MRAM is a two-terminal magnetic tunnel junction (MTJ), which consists of two ferromagnetic (FM) layers separated by a thin tunneling barrier (TB) [ Fig. 1(a) top]. The magnetization of one FM layer, free layer (FL), is reversible, whereas the magnetization of the other FM, pinned layer (PL), is fixed in one direction. The MTJ exhibits two different resistances (R P or R AP ) depending on the magnetization direction of the FL relative to that of the PL. As a result, a read operation can be performed by passing read current through the MTJ and sensing the corresponding voltage to determine the MTJ resistance or stored information. In order to write to the FL of the MTJ, a larger magnitude of current is passed through the MTJ. The injected electrons get spin polarized in the direction of the first FM they enter, and when these electrons pass through the second FM after tunneling through the TB, they exert a torque on the second FM moment through an effect known as STT [3] . When the exerted torque is large enough, the FL can be reversed and the corresponding threshold current is referred to as a critical current, I C .
Much of the research effort was focused on reducing I C for its direct impact on the key memory attributes, such as write energy, integration density, performance, and endurance. Different optimization and design techniques have been developed to reduce I C of the two-terminal MTJ device. One approach is to reduce incubation delay and increase STT efficiency by inserting additional assist layers, such as perpendicular polarizer [4] and double tunnel junctions [5] to in-plane magnetic anisotropy (IMA) MTJs. Yet another promising approach is to introduce perpendicular magnetic anisotropy (PMA) [6] in the FL of the MTJ so that the stable magnetization points out-of-plane instead of in-plane. This cancels the out-of-plane demagnetization term that increases I C without contributing to the thermal stability of the MTJ [2] , [6] , [7] . As a result of continued research effort and material optimization, 20-nm-diameter PMA MTJ with I C < 30 μA has been demonstrated recently [7] , and underscores the promising potential of STT-MRAM.
Nevertheless, the use of two-terminal MTJ as a storage element poses two significant design issues. First, the read and write (R/W) current paths are identical, which imposes a stringent tradeoff between read and write performance. Second, during write operation, especially in high-speed applications, high write voltage (current) is applied across the thin tunnel barrier of the MTJ, which may lead to reliability issues, such as time-dependent dielectric breakdown [8] .
The aforementioned limitations can be addressed with recently proposed three-terminal spin-orbit torque (SOT) device [9] , [10] , where the FL of the MTJ is in contact with a nonmagnetic heavy metal (HM) with strong spin-orbit interaction (SOI) [ Fig. 1(b) top] . In such FM/HM systems, when current is injected through HM, strong SOI in HM results in an antidamping torque on the FM due to the injection of a pure spin current into the FM as a result of spin-Hall effect (SHE) [10] , [11] and/or a field-like torque on the FM due to Rashba effect [12] . Recently, the switching of both IMA [10] and PMA FMs with [12] and without any external magnetic field [13] , [14] were demonstrated. Although the strength of each effect and the amount of contribution to the magnetization switching from each is still being studied, the potential benefit of the three-terminal SOT device is substantial.
A SOT device has three main advantages over the twoterminal MTJs: 1) read and write current paths are decoupled and that allows separate optimization for read and for write; 2) reliability issues associated with the TB of the MTJ is eliminated since the TB is not stressed during write operation; and 3) write operation is fast and energy efficient due to lower critical current and voltage [15] , which we discuss in more detail in Section III. For these reasons, SOT magnetic random access memory (SOT-MRAM) has attracted a significant amount of attention.
However, the key disadvantage of SOT-MRAM is that each bit-cell requires two access transistors and results in larger bitcell footprint [a possible bit-cell is shown in Fig. 1(b) ]. Hence, SOT-MRAM may not be an attractive option in high-density memory applications despite all its advantages. A well-known technique to improve integration density and reduce the cost per bit is to implement a multilevel cell (MLC), which is a mature technology for other types of memory, such as FLASH. In addition, MLC design concepts for STT-MRAM have been studied and demonstrated in [16] - [18] as well. However, MLC design for SOT-MRAM remains to be explored. To that effect, we propose two new MLC designs for SOT-MRAM, namely: 1) series MTJ MLC (S-MLC) and 2) parallel MTJ MLC (P-MLC) and study their merits and demerits. This paper is organized as follows. Section II discusses two designs of MLC storage elements and shows the operation of their Bit-cells. The simulation framework used to investigate the MLCs is described in Section III. Using the simulation framework, we present the optimization of MLCs for both read and write in Section IV. In Section V, we compare our proposed bit-cells with previously proposed MLCs. Finally, the conclusion is drawn in Section VI. Before discussing the MLCs, we define the following notations in this paper. We denote low (high) resistance state as 0 (1) and assign MTJ 1 bit to most significant bit (MSB) and MTJ 2 bit to least significant bit (LSB). Hence, the MLCs, which we will describe later in this paper, have four resistance states: 1) R00; 2) R01; 3) R10; and 4) R11.
II. MULTILEVEL SOT-MRAMs

A. Multilevel Cell Based on Series MTJs (S-MLC)
The storage element of S-MLC consists of a series stack of two MTJs with nonmagnetic HM, as shown in Fig. 2(a) . The FL of MTJ 2 is in contact with HM, and MTJ 1 is stacked on top of MTJ 2 separated by a spacer. Similar structure (two MTJs without HM) has been fabricated in [16] and [17] , and four distinct resistance states were demonstrated by designing the two MTJs with different cross-sectional area ( A MTJ ) and constant tunnel barrier thickness (t TB ).
The memory state is readout in an identical manner as in the SOT-SLC. The sense-amp clamps the bit-line (BL) to a read voltage and grounds the source line (SL). It then senses the current flowing through the MLC, which depends on the series resistance of the MTJs. In order to program MTJ 1 and MTJ 2 independently, the write operation of S-MLC requires two steps (or two write cycles). To write into MTJ 1 (MSB), conventional STT is used, where the spin-polarized tunneling current through MTJ 1 is passed perpendicular to the plane of the MTJ stack [ Fig. 2(b) ]. To write into MTJ 2 (LSB), a charge current is passed through HM to induce SOT to switch FL 2 . The bit-cell schematic and its operation are shown in Fig. 2 
(c) and (d).
In order to achieve four distinguishable resistance states, MTJ 1 and MTJ 2 can be designed with: 1) different A MTJ and/or 2) different resistance-area (RA) product or t TB values. When two MTJs are designed with different A MTJ , one MTJ needs to be larger than the minimum feature size allowed for a given technology. As a result, a larger I WR may be required to program the larger MTJ leading to higher write power dissipation. On the other hand, when different t TB 's for the MTJs are used while keeping the same A MTJ , one of the MTJs needs to be designed with thicker t TB or higher RA. This leads to a higher resistance of the MTJ stack (R STACK = R MTJ 1 + R MTJ 2 ) as compared with the case when different A MTJ 's are used. The high resistance of the MTJ stack may limit the current drivability of the access transistor due to source-degeneration effect [19] and degrade the write performance. Moreover, the control of A MTJ is expected to be easier than that of t TB from the implementation aspect.
During the write operation when MTJ 1 is being written, the I WR flows not only through MTJ 1 but also through MTJ 2 . As a result, the I WR for MTJ 1 can cause MTJ 2 to be written as well. This unwanted accidental write is a write-disturb failure, and is a major failure in MLC based on multilevel STT-MRAM (STT-MLC) [18] . In S-MLC design, this write-disturb failure can be completely eliminated by simply writing to MTJ 1 first and MTJ 2 next. In this case, even if the MTJ 2 is disturbed during the programming of MTJ 1 , the correct MTJ 2 data can be written in the following cycle. Thus, S-MLC design can be particularly attractive for applications requiring not only high density but also low write-error rate (WER). However, unlike SOT-SLC, separate optimization for read and for write is not possible because the write current path for writing MTJ 1 is shared with read current path.
B. Multilevel Cell Based on Parallel MTJs (P-MLC)
Since bit-cell area is dominated by the access transistors and metal lines [20] , the two MTJs may be connected in parallel without incurring an area penalty [see the layout in Fig. 4(e) ]. In this P-MLC design, the MTJs are placed next to each other on HM separated by distance d, and the FLs of the MTJs are in contact with the HM [ Fig. 3(a) ].
The read operation is same as in SOT-SLC operation, sensing the total resistance of the parallel MTJs to determine the stored data. The write operation is also performed in the same manner as done in SOT-SLC, where the charge current is supplied through the HM only. However, since both MTJs are in the same write current path, it is necessary to engineer them to have different I C 's. The switching direction of both MTJs is determined by the direction of charge current flow through the HM.
As in S-MLC, to achieve four distinguishable resistance states, the MTJs can be designed with: 1) different A MTJ values and/or 2) different RA values or t TB . Designing two MTJs with different RA values may involve additional processing steps and hence, designing with different A MTJ is preferable. In addition, in order to meet the requirement for different I C 's for MTJ 1 and MTJ 2 , the following designs can be employed: 1) make FL thickness for MTJ 1 and MTJ 2 different and/or 2) use different cross-sectional area for HM underneath MTJ 1 and MTJ 2 ( A HM1 and A HM2 ) [ Fig. 3(a) ]. Although designing different thickness of FLs may be practically challenging due to additional processing steps, patterning HM as shown in Fig. 3(a) does not incur any additional cost. Hence, the second approach is more desirable. When A HM1 and A HM2 are different, for a given write current supplied through the HM, the charge current densities (and hence the spin current density injected into the FM) are different. For example, I C1 ∼ 2I C2 when W HM1 = 2W HM2 . In addition, it should be noted that the magnetic coupling between the MTJs depends on their dimensions and the distance between them [21] . Based on the results shown in [21] , when d ≥ F, the magnetostatic coupling can be neglected in the P-MLC design.
The write operation occurs in two steps. The MTJ with a larger I C is written first, followed by the MTJ with a lower I C . In case when I C2 > I C1 , MTJ 2 is written first by passing I WR greater than I C2 . Note that both MTJ 1 and MTJ 2 are written with the same data in this step. During the second write step, MTJ 1 is written with the correct data by passing write current I C1 < I WR < I C2 to avoid accidental write into MTJ 2 (write-disturb failure). The described write scheme is identical to that of STT-MLC [18] , [22] .
III. MODELING AND SIMULATION FRAMEWORK
In order to analyze the MLCs, we use the simulation framework proposed in [23] . The RA of the MTJ is calibrated to data reported in [24] at t TB = 1.15 nm, using the nonequilibrium Green's function (NEGF) approach. The NEGF formalism uses a spin-dependent effective mass Hamiltonian for electron transport simulations and captures the resistance dependence on the applied voltage, the magnetization angle between FL and PL, and t TB . The calibrated MTJ model is then used together with a commercially available 45-nm transistor model for bit-cell simulations using SPICE, for different RA values of the MTJ.
Although our proposed design concept is valid regardless of the physical origin of the SOT, we assume SHE as the origin of the SOT in this paper [ Fig. 5 ]. Hence, when charge current, I e , is applied through the HM, the amount of spin current injected into the FL ish I S /2q, and I S can be calculated 
where λ sf is the spin-flip length, A MTJ and A HM are crosssectional areas of the MTJ and HM, respectively, and θ SH is the spin-Hall angle of HM. Note from (1) that when
, the spin current I S injected into the FL can be even greater than the charge current (I e ) supplied. This occurs because one electron that travels through the HM can repeatedly scatter at FM/HM interface and transfer many units of angular momentum as shown in Fig. 5 [10], [11] . As a result, write operation in SOT-MRAMs can be very energy efficient. In addition, t HM dependence on spin current injection is captured in 1 − sech(t HM /λ sf ) term in (1), which accounts for the backflow of spin current from the free surface of HM [bottom surface in Fig. 5(a) ] when t HM is comparable with the λ sf [26] . The calculated I S is then used with the generalized Landau-Lifshitz-Gilbert equation to compute the switching times of the MTJs. Under the macrospin approximation, the magnetization dynamics can be expressed as [27] 
where γ = 2.21 × 10 5 s −1 (A/m) −1 , α is Gilbert damping parameter, H eff is the effective field which includes anisotropy field from the demagnetizing field [28] of the magnet, and a Langevin random thermal field. Extensive stochastic simulations were performed to obtain the critical write currents and the associated WER. The energy barrier is calculated from
, where FL is FL volume, N yy and N x x are demagnetizing factors in in-plane hardand easy-axis directions, respectively. Furthermore, we assume IMA MTJs in all bit-cell and simulation for simplicity. However, the design concepts as well as the analysis we perform in this paper are still qualitatively valid for the bit-cells designed with PMA MTJs as well. The simulations parameters are listed in Table I .
Finally, since the actual physical layout dimensions and area of the bit-cells can vary depending on the technology in use, λ-based design rule [29] [λ is half the minimum feature size (F)] is employed for the estimation of the layout area in this paper. Moreover, as the design rules for memory are typically more relaxed than that for the logic due to their regularity, we assumed the relaxed constraints described in [20] . The bit-cell layouts for single-level STT-MRAM (STT-SLC) and STT-MLC, SOT-SLC, and the proposed MLCs are shown in Fig. 4 .
IV. DESIGN AND OPTIMIZATION OF MLC SOT-MRAMs
In order to achieve the full potential of the proposed MLC designs, design space exploration and optimization are critical. In this section, we study the impact of three key design parameters, namely, the ratio of cross-sectional area of MTJ 2 to that of MTJ 1 (m = A MTJ 2 /A MTJ 1 ), RA of the MTJs, and HM geometry, on the bit-cell characteristics, such as read margin (RM) and write margin (WM). The impact of process variations on the design of STT-MRAM MLC was studied in [18] , [22] , and [30] and we expect qualitatively identical trend for the proposed MLC designs. As a result, we do not consider process variations in this paper. It is, however, important to include sufficient design margins to account for process variations in the bit-cell design process. In this section, we also discuss possible approaches to improve the efficiency of the spin-current injection. Here, we define RM as the minimum resistance separation between the states [i.e., RM = min(R11-R10, R10-R01, R01-R00)], and WM as the difference between the I WR and the required I C .
A. S-MLC Optimization
S-MLC has separate current paths for writing MSB (MTJ 1 ) and LSB (MTJ 2 ), and as discussed in Section II, that allows the design of MLC with low WER. Although low WER for both bits can be achieved by supplying required I WR for each of the current paths, it should be noted that the read and write MSB current paths are now shared [path from T1 to T3 in Fig. 2(b) ]. Consequently, the choice of m and RA will not only affect the read performance, but also write performance, as in standard STT-MRAM cells. Fig. 6(a) shows the RM of S-MLC for different m and RA. It is evident that higher RA yields higher RM, and hence is desirable for robust read operation. Moreover, the optimum m occurs when the resistance separation between states are equal [ Fig. 6(b) ], when m ∼ 2, consistent with previous studies in [16] and [18] . Also shown in Fig. 6(a) are WM = 0 points for different m and RA, which is superimposed on the same RM plot, indicated using a white dotted line. Since the white line indicates where WM = 0, the region on the left of the line represents the feasible design space. In this feasible region, the resistance of the MTJ stack is small enough for the access transistor to provide the required I WR .
During the second write step to program the LSB (MTJ 2 ), the write current flows through HM only, as in SOT-SLC. The amount of spin current injected into the FM can be described by (1) from which we see that high θ S H , and shorter λ sf can improve the spin current injection efficiency. Yet another approach to enhance the spin injection efficiency is to add a spin-sink layer (SSL) at free HM surface [31] , as shown in Fig. 6(c) . The effect of SSL is to reduce the spin accumulation effect at the free HM surface and hence reduces the backflow of spin currents. Thus, when a perfect SSL is assumed, (1) is reduced to I S = (W MTJ /t HM )θ S H I e [ Fig. 5 ]. In this case, the efficiency of spin current injection into the FM can be improved by reducing the thickness of the HM.
B. P-MLC Optimization
The biggest merit of P-MLC is that it has all the benefits of SOT-SLC. The read and write current paths are decoupled and separate optimization for read and write can be performed. As shown in Fig. 7(a) , m and RA affect only the RM and the design space is unrestricted unlike in S-MLC. Note that when compared with S-MLC, P-MLC requires MTJs with higher RA to achieve the same level of RM of S-MLC. This necessitates thicker t TB for higher RM. The use of thicker t TB is beneficial in two ways: 1) the thicker t TB is less sensitive to variations in t TB and 2) the leaky write current [ Fig. 7(d) ] during the write operation can be reduced. The optimum m also varies with RA because of the resistance of the HM, R HM . When R HM is neglected, optimized m occurs at 1.67 for all RA. As shown in Fig. 7(a) and (c), for RA < 6 μm 2 , the resistance of the HM, (R HM1 /2 + R HM2 /2) in Fig. 7(c) is large enough to offset the resistance of MTJ 1 from that of MTJ 2 . As a result, the two MTJs can have the same A MTJ and still provide four distinct resistance states. When RA > 6 μm 2 , HM resistance is comparatively smaller than the MTJ resistance, and hence optimized m gradually increases to 1.5 at RA = 30 μm 2 .
As discussed in Section II-B, P-MLC requires different HM widths underneath each MTJ to achieve different I C 's. As a result, one of the HMs is designed to be wider than the other and the difference between the two widths directly impacts the write-disturb error rate. However, increasing one of the widths to ensure a certain write-disturb error rate may lead to an increase in the bit-cell area [ Fig. 4(e) ]. This may be addressed by utilizing the SSL that we discussed in Section IV-A. When SSL is employed to only one of the MTJs, the difference between I C1 and I C2 will be increased and hence, lower writedisturb failure rate can be achieved without incurring area overhead.
From the equivalent circuit during a read operation and read current path in Fig. 7(c) , it is seen that the total resistance being sensed is R T = R HM2 /2 + (R MTJ 2 ||(R MTJ1 + R HM1 /2 + R HM2 /2)). Note that the first term (R HM2 /2) degrades the distinguishability of the resistance changes caused by MTJ 1 and MTJ 2 states. As a result, by designing HM 2 wider than that of HM 1 , R HM2 becomes less than R HM1 , and the amount of overall bit-cell TMR degradation can be minimized.
V. COMPARISON WITH STT-MLC
In this section, we compare our proposed S-MLC and P-MLC to STT-MLC. The comparison is carried out at iso-bit-cell-area, using minimum bit-cell area of the S-MLC as a baseline, and increasing STT-MLC bit-cell area to match that of S-MLC. We observed that for STT-MLC, two-fingered layout allows for larger access transistor width. Hence, we used two-fingered bit-cell layouts [ Fig. 4(b) ] in our comparison. Moreover, the RAs for S-MLC and P-MLC are adjusted so that the RMs approximately matches that of STT-MLC. The simulation parameters are listed in Tables I and II. Based on the simulation results, S-MLC offers two distinct advantages over STT-MLC apart from the design feasibility of MLC with extremely low WER. First, S-MLC shows 67% reduction in average write energy (E W ) for two-bits programming. In STT-MLC, a significant portion of write power is dissipated to write the larger LSB magnet (MTJ 2 ). However, in S-MLC, energy-efficient SOT is used to program MTJ 2 , and hence dissipates much lower write power than STT-MLC. Second, in S-MLC, the write voltage is applied across the MTJ stack only during the first write cycle whereas in Fig. 8 . Failure probability of (a) STT-MLC, (b) S-MLC, and (c) P-MLC during write step-1 and step-2. Probability of switching indicates write-disturb failure (accidental write) during write step-2.
STT-MLC, the MTJ is stressed for both of write cycles when performing 01 and 10 operations. Consequently, the TB of the MTJ in S-MLC is stressed less frequently, which alleviates endurance issues in the MTJs. Moreover, as it can be seen from Fig. 8 , S-MLC is able to achieve low WER, whereas in STT-MLC, the lowest WER that can be achieved is 1.84 ×10 −4 when I WR of 175 μA is applied.
The biggest merit of P-MLC is that it has all the advantages of SOT-MRAM, such as decoupled R/W current paths and energy-efficient write operation. As a result, higher RM can be achieved by simply increasing the RA of the MTJ. Moreover, P-MLC dissipates <10% the average E W of STT-MLC. Fig. 9 shows the E W for writing 00 and 01 for the bit-cells under comparison. Note that E W for P-MLC does not depend on the initial MTJ state (0.4 pJ regardless of MTJ initial state), in contrast to STT-MLC and S-MLC. This is because the load of the access transistor during write is different in each case. The load is R HM in P-MLC, R MTJ in STT-MLC, and R MTJ in S-MLC only during the first write cycle. Consequently, since the R MTJ can vary depending on the stored state, the write voltages in S-MLC and STT-MLC need to be chosen such that sufficient I WR is supplied in the worst case MTJ states. This leads to excessive I WR being supplied when the MTJs are not in the worst case MTJ states. Hence, excessive write power is dissipated and leads to a large difference in E W as Fig. 9 shows.
The main drawback of P-MLC arises from the shared current path for MTJ 1 and MTJ 2 during write operation. In order to reduce the WER of the P-MLC to the desired level, the difference between the HM widths (W HM1 and W HM2 ) needs to be large and the addition of SSL may be necessary. Even with those designs and optimizations, however, P-MLC may be unable to achieve the extremely low WER achieved in S-MLC.
VI. CONCLUSION
We presented two MLC designs to improve the integration density of SOT-MRAM. Our analysis reveals that our proposed S-MLC not only improves the integration density by 2× but is also write-energy efficient, dissipating 67% lower E W than STT-MLC. Moreover, S-MLC employs two different write mechanisms, namely STT and SOT, to write each of the two bits. Hence, S-MLC is able to achieve extremely low WER unlike STT-MLC. In addition, the proposed P-MLC uses two parallel MTJs per cell without incurring bit-cell area overhead, thereby increasing integration density by 100%. P-MLC consumes <10% the average E W of STT-MLC by utilizing energy-efficient SOT to perform write operations. Moreover, separate optimization for read and write can be done in P-MLC due to decoupled read and write current paths. Since both of our proposed MLC designs alleviate reliability concerns associated with the TB of the MTJ, we believe they are promising candidates for future nonvolatile MLC memories.
