Spin-Transfer Torque MRAMs are attractive due to their non-volatility, high density, and zero leakage. However, STT-MRAMs suffer from poor reliability due to shared read and write paths. Additionally, conflicting requirements for data retention and writeability (both related to the energy barrier height of the storage device) makes design more challenging. Furthermore, the energy barrier height depends on the geometry of the storage. Any variations in the geometry of the storage device lead to variations in the energy barrier height. In order to address the poor reliability of STT-MRAMs, usage of Error Correcting Codes (ECC) has been proposed. Unlike traditional CMOS memory technologies, ECC is expected to correct both soft and hard errors in STT-MRAMs. To achieve acceptable yield with low write power, stronger ECC is required, resulting in increased number of encoded bits and degraded memory capacity. In this article, we propose Failureaware ECC (FaECC), which masks permanent faults while maintaining the same correction capability for soft errors without increased number of encoded bits. Furthermore, we investigate the impact of process variations on run-time reliability of STT-MRAMs. In order to analyze the effectiveness of our methodology, we developed a cross-layer simulation framework that consists of device, circuit and array level analysis of STT-MRAM memory arrays. Our results show that using FaECC relaxes the requirements on the energy barrier height, which reduces the write energy and results in smaller access transistor size and memory array area. Several research efforts have been devoted to addressing the poor reliability of STT-MRAMs at the device, circuit, and architecture levels [Pajouhi et al. 2015; Kwon et al. 2015; Kang et al. 2015; Wang et al. 2008; Apalkov et al. 2006 ]. Single-ended current sensing scheme utilized in STT-MRAM results in poor read reliability due to process variations in the electrical characteristics of the storage device (i.e., Tunneling Magneto-Resistance (TMR) and Resistance-Area (RA) product) that makes it difficult to distinguish between "1"s and "0"s reliably. On the other hand, since STT switching is a stochastic process [Kim et al. 2012; Fong et al. 2012] , increased write currents are necessary to ensure reliable write operation. Reducing write current improves energy efficiency but increases the probability of write errors and results in degraded yield. In order to reduce write errors, the Energy Barrier (E B ) height of the Magnetic Tunnel Junction (MTJ) can be reduced [Li et al. 2008; Augustine et al. 2010 ]. However, the retention time of the MTJ depends on the energy barrier height [Naemi et al. 2013], which needs to be high enough to ensure sufficiently long retention time.
INTRODUCTION
Spin-Transfer Torque (STT) memories are considered to be promising for future onchip memory technology due to their favorable characteristics such as high density, non-volatility and near-zero leakage [Slonczewski 1996; Berger 1996; Katine et al. 2000; Li et al. 2008 ]. Nevertheless, they suffer from poor reliability that manifests in the form of low manufacturing yield, as well as run-time errors. Furthermore, ensuring high reliability through design leads to increased read and write energy and reduced This research was funded in part by the Center for Spintronics: Materials, Interfaces and Architecture, a StarNet Center funded by DARPA and MARCO, by Semiconductor Research Corporation, and by National Science Foundation. Authors' addresses: Z. Pajouhi, X. Fong, A. Raghunathan, and K. Roy, 465 Northwestern Ave, West Lafayette, IN, USA, 47906; emails: {zpajouhi, xfong, raghunathan, kaushik}@purdue. edu. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from -We develop a cross-layer simulation framework to evaluate the impact of process variations on the yield and run-time reliability of an STT-MRAM array. We utilize the simulation framework to analyze the impact of ECC on yield and run-time reliability. We show that using ECC to improve yield has a negative impact on the ability of ECC to improve run-time reliability. -We propose a Failure-aware ECC (FaECC) to mask permanent faults without compromising the correction capability for transient faults. With such an approach, Single Error Correction and Double Error Detection (SECDED) is employed to correct transient faults as well as masking permanent faults simultaneously. Furthermore, by using this approach, the required energy barrier height of the memory array to meet the required run-time reliability decreases. This decrease results in reduced access transistor size and reduced read/write power.
The rest of the article is organized as follows: In Section 2, we explain STT-MRAM preliminaries and describe different bit-cell failure mechanisms. In Section 3, we discuss the run-time reliability of STT-MRAM array. In Section 4, we propose FaECC, a Failure-aware ECC scheme for STT-MRAMs that exploits an understanding of failure mechanisms to improve yield without sacrificing run-time reliability. In Section 5, we present the cross-layer simulation framework used to evaluate the proposed FaECC scheme. In Section 6, we discuss the results obtained from our simulation framework. Section 7 provides the concluding remarks.
STT-MRAM PRELIMINARIES
An STT-MRAM bit-cell consists of a storage and an access transistor, as shown in Figure 1 . The Magnetic Tunnel Junction (MTJ) is the storage device of STT-MRAM, and the access transistor is used to access the MTJ. The MTJ consists of two ferromagnetic layers -a pinned layer and a free layer -sandwiching a tunneling oxide (typically MgO). The pinned layer has a fixed magnetization orientation, while the magnetization of the free layer can be changed. The relative magnetic orientation of the free and the fixed layers determines the data stored in the MTJ. If the magnetization orientation of the free layer is the same as the fixed layer, they are said to be in parallel; however, if they are in opposite directions, they are said to be anti-parallel (we assume logic "0" is represented by the parallel orientation and "1" by anti-parallel). The magnetization orientation can be aligned with the surface of the MTJ -in-plane magnetic anisotropy (IMA) -or it can be perpendicular to the surface (PMA).
In order to write into the bit-cell, the word line is activated and a bias voltage is applied between the bit line and the source line to pass current through the MTJ. The direction of current flow defines the data value that is written into the bit-cell. The amount of write current needed, called the critical switching current, depends on the desired write time. Achieving acceptable write latencies typically requires high switching current, negatively impacting energy efficiency and reliability [Fong et al. 2012] .
In order to read the bit-cell, the word-line is enabled and a bias voltage is applied between the bit-line and the source line, causing a current to pass through the MTJ. The current is then sensed to evaluate the resistance of the MTJ and to distinguish between a logic "1" and a logic "0". The read current should be substantially lower than the critical switching current of the MTJ to avoid accidentally writing into the bit-cell during read operations.
There are four major failure mechanisms in STT-MRAMs: read decision failures, read disturb failures, write failures, and retention failures [Fong et al. 2012] .
Read decision failures occur due to an inability to correctly detect the value stored in the MTJ. As explained earlier, a voltage (V read ) is applied between the source line and the bit line and the data is determined by comparing the bit-cell current with a reference current (I ref ) . Ideally, the bit-cells with different stored values have different currents passing through them (e.g., I P for the parallel configuration and I AP for antiparallel) and the sensing margin is maximized by setting the reference current to the average of the two currents. However, due to process variations (e.g., variation in the RA product), the current passing through each bit-cell may differ from its nominal value. Therefore, I ref should be chosen carefully to minimize decision failures. Once I ref is defined, the sense amplifier is adjusted accordingly. Note that read-decision failures may be considered to be stuck-at-fault failures [Kang et al. 2015; Su and Huang 2004] .
Disturb failures occur when the data stored in the bit-cell is unintentionally overwritten during a read operation. This is due to increased current flowing through the MTJ during the read operation. Note that, since the direction of the read current matches that of only one of the write currents, this type of failure occurs only for one value of data (either "0" erroneously changing to "1" or vice-versa). Disturb failures occur due to increased current drivability of the access transistor or decreased critical current of the MTJ. This decrease may be the result of process variations or thermal effects.
Write failures occur due to unsuccessful MTJ state change during write operation. They occur due to decreased current drivability of the access transistor or increased MTJ critical current due to process variations or thermal effects. Standard connection [Nebashi et al. 2009; Lin et al. 2009 ] of the bit-cell is considered in this work to mitigate write failures.
Finally, retention failures occur due to thermal effects. If thermal effects are large enough to flip the magnetization of the free layer, the MTJ changes its state. Retention failures are characterized by the retention lifetime of the nanomagnet. The probability of retention failure in a single memory bit-cell at time t is given by Naemi et al. [2013] :
where f f (t) is the probability density function failure, P FAIL_THERMAL is the probability of failure at time t, and F f (t) is the cumulative probability density function. Also, As observed in Equation (1), the probability of retention failures depends on the time elapsed after the write event. Furthermore, the lifetime of the bit-cell depends on the physical characteristics of the free layer. Note, due to process variations (leading to variations in the barrier height), some bit-cells are more vulnerable to retention failures than others.
In the next section, we will investigate the impact of process variations on retention failures and its effect on run-time reliability.
RUN-TIME RELIABILITY ANALYSIS

Thermal Stability and Retention Time
Although STT-MRAMs are referred to as non-volatile memories, their ability to retain the stored data is limited in practice. As explained in the previous section, the retention failure probability can be expressed in terms of the time elapsed from the time data was stored in the memory and the lifetime of the free layer. The lifetime of the free layer can in turn be expressed as [Augustine et al. 2010] :
where E B is the Energy Barrier height, K B is the Boltzmann constant, and T is the temperature in Kelvin. E B depends on the geometric dimensions of the free layer. Figure 2 illustrates the physical dimensions of the free layer for different magnetic anisotropy configurations. E B for an In-plane Magnetic Anisotropy (IMA) free layer can be expressed as [Apalkov et al. 2010 ]:
where M S is the saturation magnetization, H K is the effective field anisotropy, and V is the volume of the free layer. Furthermore, w, AR, and t are the width, aspect ratio and the thickness of the free layer, respectively. As expressed in Equation (3), E B depends on the geometry of the free layer and is therefore sensitive to process variations. For a free layer with Perpendicular Magnetic Anisotropy, E B can be expressed as [Augustine et al. 2010 ]:
where K u2 is the uniaxial anisotropy, V is the volume of the free layer, and H C k is the effective field anisotropy. As observed, E B also depends on the geometry of the free layer.
In order to ensure reliable operation of STT-MRAM, E B should be adjusted such that the requirements of run-time reliability are met. A typical memory reliability specification can be expressed in terms of FIT or failures in time, where 1 FIT is one failure per billion (devices × hours):
where λ is the failure rate in hours and can be expressed as the equivalent of Mean Time To Failure (MTTF):
where f f is the probability density function of time to failure, if and only if this integral exists (as an improper integral).
Therefore, for an MTJ device, we have:
If only a single device is considered, 1 FIT translates into 0.00876% failure over 10 years, and the required E B to meet the requirement of 1 FIT is about 50 K B T. However, for larger memory arrays, 1 FIT should be considered for the entire memory array and not just a single MTJ device. For this purpose, let us consider a memory array with n bit-cells. Under such conditions, the probability of correctness for the array can be defined as:
in which F farray is the cumulative probability density function. Therefore, the probability density function can be written as:
The MTTF for the memory array can be defined as follows:
In order to obtain the required E B , the required MTTF should be obtained from Equation (6). Next, the lifetime should be defined to meet the required MTTF array by solving Equation (10) for the desired array size. Once the E B is derived, the free-layer physical characteristics can be derived. In order to define the free-layer characteristics, if the free layer is an IMA (PMA), Equations (3) and (4) should be used. In order to analyze run-time reliability, without confining to a set of MTJ parameters, the thermal stability factor is defined as follows:
In the following sections, we will derive the free-layer physical characteristics based on the operating temperature and characteristics of the MTJ. Figure 3 shows the required E BN for larger memory arrays (ignoring parameter variations) for 1 FIT. As observed, the required E BN increases with increasing memory size. Since the reliability metric of 1 FIT is kept constant, as the number of bit-cells in the memory array increases, the tolerable probability of failure for each bit-cell decreases. In order to meet this decreased probability, the E B should be increased.
The Effect of ECC on Run-Time Reliability
ECC is one of the most effective methods to improve reliability of memory arrays. Among different ECC codes, Bose-Chaudhuri-Hocquenhgem (BCH) codes are commonly used in memory arrays [Michelson and Levesque 1985] . A BCH code changes a k-bit word data into an n-bit word data by adding (n-k) bits to the word. The choice of n depends on the desired correction capability of ECC. The choice of the word length at which ECC should be applied (k) and the extra bits (n-k) impacts the correction capability as well as the overheads incurred. The probability of correctness for an n-bit word with m-bit correction capability, with a bit error probability of P b can be expressed as:
Furthermore, if the memory array has s words, every word has to be encoded using the ECC scheme selected for the memory array. Then, the probability of correctness of the entire memory array would be:
In order to obtain the required E BN for an array (with ECC), it is required to substitute the probability of failure of a single MTJ obtained in Equation (1) into Equation (12) as follows:
Next, the resultant P word is inserted into Equation (13). At the next step, the probability density function is derived from the cumulative density function as follows:
Finally, the probability density function of time to failure is derived and inserted into Equation (6) to obtain the MTTF array :
For example, let us assume that the desired size of the memory is 4MB and we apply ECC to every 128 bits in the array. Therefore, k = 128 and s = 4MB/128. For an ECC with SECDED capability, the encoding should be performed GF(2 8 ) where GF(2 deg ) is the Galois field with degree deg. The number of additional bits is 8 for Hamming code (which is considered the simplest BCH code) and a single parity bit is added to detect an additional error, resulting in a total of 9 bits. Therefore, m = 1 and n = 9 + 128 = 137. These values should be inserted in Equations (15) and (16) to obtain the MTTF array . However, for a given MTTF array , this process should be inverted to obtain the required t life . Once the t life is derived, it can be inserted in Equations (2) and (11) to obtain E BN and E B . Figure 4 illustrates the required E BN to meet the reliability level of 1 FIT for a 4MB memory array with the word size of 128 bits. As observed, the required E BN decreases with an increase in the correction capability of ECC.
Further, let us consider that ECC is employed to correct read and write failures as well as retention failures. Under such conditions, the words that happen to contain such failures (read or write) and retention failures cannot be corrected with ECC. Moreover, if there are a large number of such words, the E B cannot be reduced as suggested above.
In the next subsection, we will analyze the impact of such conditions on the efficacy of ECC for retention failures.
The Impact of Read and Write Failures on the Efficacy of ECC
Let us consider a scenario where ECC is used for enhancing yield as well as run-time reliability. The presence of hard failures in a data word degrades its capability to correct retention, read, and write failures. In order to determine the impact of degraded ECC capability on run-time reliability, let us assume that the number of words with j nonretention failures is n j . Then, the probability of correctness for the whole memory can be defined as:
where m is the maximum number of correctable errors and P wordj can be obtained from the following equation:
where n is the total number of bits in a word. For example, let us consider a 4MB array in which ECC with SECDED capability is used. Furthermore, let us assume that ECC is applied to a word length of 128 bits, which implies that the number of words in the array is s = 4MB/16B. Additionally, let us assume that there are b words with a single read or write failure and we have used ECC to correct these failures as well. If a retention failure also occurs in one of the b lines mentioned above and in one of the bit-cells without read or write failures, ECC with SECDED capability will be unable to correct it because there is already a read or write failure in the same word. Note, in this example, m = 1, n 0 = s-b, n 1 = b. In order to calculate the MTTF array , the probability of correctness obtained in Equation (1) should be substituted in Equation (18). Following these steps, the probability of correctness for the memory array can be expressed as follows:
Next, the probability density function is derived similar to Equation (15). 
Finally, the required E BN and E B can be derived by substituting f corr into Equation (6).
As an example, let us consider the same 4MB memory array with 128-bit ECC word length. Furthermore, let us assume that the probability of having a read or write failure in each bit-cell is 1e-5 and that these failures are uniformly distributed among all the bit-cells. Next, based on the uniform distribution of these failures, we obtain the mean number of words with j read or write failures (n j ) and insert the obtained n j into Equation (19) and the probability density function should be derived similar to Equation (20). Figure 5 illustrates the percentage increase in the required E BN . As observed, the increase in the required E BN decreases with an increase in the correction capability of ECC. In order to avoid increasing E BN , either an ECC scheme with higher correction capability is required or if possible, other yield enhancement techniques, such as redundant rows or columns can be introduced. However, due to poor reliability of STT-MRAM bit-cells, redundancy is not an effective method to enhance yield; it incurs high overheads [Kwon et al. 2015] . Therefore, a recent trend towards ensuring reliability is to utilize ECC for yield enhancement. However, in order to maintain the high yield and run-time reliability simultaneously, there is a need to use ECC with higher correction capability, requiring higher storage area. We propose a Failure-aware ECC (FaECC) scheme, which enhances the correction capability without adding a large number of encoded bits to the array ( Figure 6 ). In FaECC, we identify the read-decision failures, and use our proposed technique (described in the following section) to correct these failures. However, due to the stochastic nature of the write in STT-MRAMs, this method is not used to mitigate write failures.
FAILURE-AWARE ECC BASED CORRECTION
Due to their simple structure and decoding scheme, BCH codes [Wilkerson et al. 2010; Strukov 2006 ] are commonly used in memory design. Specifically, the Hamming code, which can be viewed as a special case of BCH codes, has found widespread use in memory ECC. The number of additional encoded bits required for ECC is determined by the desired correction capability and the word length at which ECC is applied. Furthermore, ECC can be employed to correct errors with known locations. These types of errors are called erasures [Evain et al. 2014 ]. The concept of potentially erroneous bits with recognized locations is well known and is employed in digital communications, but amazingly not commonly used in memory systems [Evain et al. 2014 ]. Theoretically, a code with minimum Hamming distance d can correct t random errors and r erasures if d > 2 * t + r [Walker et al. 1979; Carter and McCarthy 1976; Siewiorek and Swarz 1998; Chen and Hsiao 1984; Evain et al. 2014; Seong et al. 2010; Fujiwara 1989] . Therefore, if we know all positions of errors, we can introduce t = 0 and use the code for correcting erasures.
In FaECC, we use the concept of correcting erasures through ECC [Evain et al. 2014 ]. Figure 6 illustrates the FaECC methodology. In this methodology, a SECDED code is used and erasure information are used to enable Double Error Correction (DEC) capability for stuck-at-fault errors. In this method, the encoding and decoding are performed similar to a SECDED encoding scheme and DEC decoding is enabled only if double errors are detected by the normal SECDED decoder. Once the DEC decoding is enabled, the erasure information are retrieved and used to correct the stuck-at-fault errors. In the next subsection, we explain the FaECC scheme in detail.
Failure-Aware ECC Scheme
In Hamming code, the data bits are encoded through an encoder to obtain an encoded word (c) to be stored in the memory:
in which a is the input word expressed as:
Note that k is the number of data bits. Further, c is expressed as:
in which n is the total number of bits to be stored including encoded bits, the expression ≡ is equivalent to:
The encoded word is stored in the memory. Once the codeword is read from the memory (c), it may contain one or more errors. In regular Hamming decoders, the syndrome, z, can be calculated as follows:
where H T is the parity check matrix and e is the error pattern belonging to the syndrome. The error pattern is expressed as:
The error pattern may contain s number of 1s (s is equal to 1 for Hamming), which corresponds to the number of errors that are corrected in the codeword:
To this end, every syndrome leads to exactly one error pattern with a single error. Therefore, there are n unique patterns possible in e 1 :
Meaning that each and every single-error pattern would result in a unique syndrome. Furthermore, if the error pattern is all zeros, the codeword is correct; otherwise, decoding can be performed based on a syndrome table.
The same single error pattern corresponds to error patterns with 2 (duets) or 3 (triplets) errors. In order to distinguish between the single-error occurrence and the higher number of errors, an extra parity bit is added (constructing a SECDED coding). This parity bit clarifies whether there was a single error in the codeword or two errors. If there is only one error, the decoder asserts the error based on the individual single error pattern:ĉ
in whichĉ is the corrected codeword. On the other hand, if two errors are detected, the error patterns can be expressed as d m 2 e 2 . However, there are several double-error patterns that correspond to the same syndrome:
Therefore, if there is no additional information, the codeword cannot be uniquely selected and the normal decoder would assert an error to the output.
On the other hand, in the FaECC scheme, we consider these double-error pattern codewords and resolve which one should be considered to calculate the correct word. For this purpose, let us call the two nonzero bits in every d m 2 "active bits". Under such conditions, these error patterns are orthogonal, meaning that for every i,j that satisfies Equation (27) for the same syndrome, we have:
Particularly, each of these codewords contain unique pairs of active bits; if bit i and bit j are active in codeword x, neither of them are active in any of the remaining codewords that satisfy Equation (22) for the same syndrome as x. In other words, each and every specific bit in the codeword is active in at most one of the possible error patterns. In order to identify the correct candidate error pattern, there is a need to identify the location of one of the active bits. If both of the erroneous bits are soft errors, there is no way to find out which bits were erroneous. However, if one of the errors is a stuck-atfault, meaning that it is possible to detect the location of the error, the correct value of both of the bits can be retrieved. In order to find the location of one of the active bits, the erroneous codeword can be inverted and rewritten into the same line and read from it [Chen and Hsiao 1984] . This inversion enables the decoder to detect any stuck-at-fault location and would assist in finding the correct codeword. At the next step, the codeword that was read the second time is compared to the codeword that was read the first time and the location of the faulty bit(s) are derived. Eventually, active bits associated with the location(s) of faulty bits are fixed. Thus, the correct candidate codeword can be selected and the corrected word can be retrieved.
As explained above, the decoding scheme is capable of correcting a single error, two stuck-at-faults or a single stuck-at-fault and a single soft error. Table I compares the correction capability of SECDED, FaECC, and DECTED, where deg is the degree of the Galois Field used to realize the coding scheme. If there are two soft errors, this scheme will not be able to correct it and will assert a fault as the output. Furthermore, although we used this scheme to correct errors in STT-MRAM memory arrays, it can be used to improve the yield of any type of memory array under the aforementioned conditions.
CROSS LAYER SIMULATION FRAMEWORK
In order to analyze the reliability of STT-MRAM memory arrays, we developed a crosslayer simulation framework that captures the impact of various design parameters at different levels of abstraction (device, circuit, and architecture) on STT-MRAM memory array reliability. Figure 7 shows the simulation framework and its different stages of analysis. The framework takes MTJ characteristics, memory specifications, and design constraints as inputs, and optimizes the memory array for the desired efficiency. We describe the simulation framework and its models at each level of abstraction in further detail next.
Device Level
The simulation framework utilizes the device level model based on [Fong et al. 2012] , which consists of a magnetization dynamics solver and a Non-Equilibrium Green's Function (NEGF)-based electron transport solver [Danielewicz 1984 ]. Initially, the NEGF solver is utilized to obtain RA P,AP vs. T MgO and V MTJ . Next, the magnetization dynamics is obtained from the critical switching currents of the free layer, J C (AP P) and J C (P AP). The free layer is modeled as a monodomain ferromagnet. The magnetization of the monodomain ferromagnent is simulated by solving the Landau-Lifshitz-Gilbert equation, including the Slonczewski spin-torque term (LLGS) [Lee et. al. 2005] .
wherem FL andm PL are the unit magnetization vectors of the free layer (FL) and pinned layer (PL), respectively. Both FL and PL are considered to have the same M S . γ is the gyromagnetic ratio, α is the FL damping factor and H EFF is the effective magnetic field. q is the electronic charge and J MTJ is the current density through the MTJ and P is the material-dependent spin polarization efficiency defined in Slonczewski [1996] . The characteristics of the MTJ are encapsulated in a Verilog-A model [Fong et al. 2012] , which is used in HSPICE simulation [HSPICE 2013] . Table II shows the device parameters assumed in this work. These and other bit-cell parameters were derived from Fong et al. [2012] , and the model and the MTJ model was calibrated to experimental data published in literature [Yuasa et al. 2004 ].
Circuit Level
The circuit level model of an STT-MRAM bit-cell consists of 32nm MOSFET models [Synopsys Inc. 2014 ] and the MTJ Verilog-A model. HSPICE was used to simulate the circuit level behavior of the bit-cell. The load line method [Fong et al. 2012 ] was used to obtain probability of failure for different failure mechanisms. Figure 8 illustrates the load line method. In this method, we consider variations in t MgO the ability to write into the bit-cell (write failures), the ability to correctly sense R MTJ of the bit cell (decision failures), and the ability of the MTJ to retain its configuration when the bit-cell is being read (disturb failures). In order to determine write failures, it is considered that the MTJ cross-sectional area has a Gaussian distribution. For each MTJ cross-sectional area, the critical current density (J C ) is determined. At the next step, the transistor I D -V DS (obtained using Monte Carlo simulations in HSPICE) we find the voltage across the MTJ (V MTJ ) from the DC load line analysis as shown in Figure 8(a) . Eventually, the maximum R MTJ (and the corresponding maximum t MgO ) that allows successful write in the MTJ is calculated. Hence, any bit-cell having an MTJ with the same area but a thicker t MgO will not be written in the targeted write time. Therefore, the bit-cell fails the write operation. A similar analysis is performed for read disturb failures. However, in read disturb failures, the bit-cells with thinner t MgO are considered to fail.
Decision failures occur when the sense amplifier outputs H for a bit-cell in P configuration (R L ) and L for a bit-cell in AP configuration (R H ). The probability that a functioning sense amplifier senses the bit-cell configuration incorrectly is called the read decision failure. The reference current (I REF ) needs to be chosen to minimize this probability. For a bit-cell with an MTJ of a particular cross-sectional area, a certain t MgO will result in the bit-cell current to be I REF . If the MTJ is in AP (P), a thinner (thicker) t MgO will result in a smaller (larger) R MTJ and a bit-cell current higher (lower) than I REF . Figure 8(b) illustrates the method used to determine the decision failures for each I REF .
The optimum read reference current is the reference current that minimizes the read probability of failure. In our analysis, we perform a linear search between the nominal read currents of P and AP configurations to obtain the optimum reference current.
The variations in the MTJ considered were variations in the cross-sectional area and the oxide thickness. Both were considered to have normal distribution with 2% variance. Moreover, in order to capture the variations in the access transistor, 1e4 Monte-Carlo simulations were performed and the aforementioned methods were used to obtain the probability of failure for different failure mechanisms.
Array Level
At the array level, CACTI [Muralimanohar et al. 2009 ] was modified to include the access time and energy model of the STT-MRAM. We considered a two-finger layout of the bit-cell as explained in . The layout of the bit-cell is illustrated in Figure 9 . Additionally, the number of encoding bits was added to the CACTI model to capture the impact of the extra bit cells on the memory efficiency.
In order to analyze the impact of ECC, we implemented the ECC codecs. The encoders for the Hamming code and the proposed FaECC were identical. However, the Hamming decoder and the FaECC decoder were different. For the Hamming decoder, our implementation was based on [Opencores 2015]. For the FaECC decoder, an RTL HDL description was developed using the lookup table method [Howell et al. 1977 ]. Synopsys Design Compiler [Design Compiler 2011] was then used to implement the decoders in the 32nm Technology node. Table III shows the characteristics of the decoders.
In order to calculate the efficiency of the memory at the array level, for each write operation, it was considered that the bits were encoded and written into the memory; therefore, the overheads associated with encoding were calculated towards the total efficiency of the memory. However, for the read operation, the results were obtained based on the weighted average number of each of the three possible scenarios:
(1) read operation and error detection, (2) read operation and single-error correction, and (3) read operation and double-error correction using reread, rewrite, and FaECC decoder.
RESULTS AND DISCUSSION
We designed a 1MB STT-MRAM cache to evaluate the proposed ECC techniques. Table IV presents the characteristics of the cache. In order to capture the impact of process variations on the MTJ, the volume of the free layer was considered to have a variation with a standard deviation of 2% of the nominal value. The same variation level was considered for the cross section of the MTJ. Also, in order to analyze the impact of variations on the access transistor, as explained earlier, the load line method [Fong et al. 2012 ] was used to obtain read and write failures. Initially, we investigated the read operation and analyzed the impact of different parameters on read operation reliability. Figure 10(a) illustrates the probability of read failure vs. the access transistor width. The read decision failure probability increases slightly with an increase in the transistor width. This is due to the degradation in the bit-cell TMR with higher transistor widths. Furthermore, the probability of failure is slightly higher for V read = 200mV; however, the difference is smaller than an order of magnitude. For read disturb failures, our results show that this type of failure are negligibly small for our design. Therefore, the read decision failures dominate the probability of read failure. This makes the probability of read failure virtually independent of E B .
Next, we investigated write operation reliability. Figure 10(b) illustrates the probability of write failure vs. access transistor size for two different write pulse widths. The V dd was considered to be 1V. As observed, the probability of write failure decreases with an increase in the width of the access transistor. This is due to the increase in the write current for higher transistor widths. Due to process variations, some of the bitcells have higher than nominal critical current; by increasing the write current, these bit-cells are successfully written. Therefore, the probability of write failure decreases with an increase in the access transistor size. This trend is observed for both of the write pulse widths shown in Figure 10(b) . However, the probability of failure is larger for the 6ns pulse width compared to the 8ns pulse width. This is due to the inverse relationship between the critical current of the MTJ and write pulse width-the critical current is smaller for 8ns pulse width compared to 6ns. Therefore, for a given nominal transistor width (which results in a given write current), the number of bit-cells with currents less than the critical current of the bit-cell is smaller for 8ns compared to 6ns.
Let us consider the relationship between E B and the probability of write error. Figure 11 illustrates the probability of write error with respect to access transistor width for different values of E B for a fixed pulse width of 8ns. As observed, for a given transistor width, the probability of error is higher for higher E B . This relation stems from increased the critical current of the bit-cell with higher E B .
Once the design space was explored and the bit-cells were characterized, we designed caches optimized for different design metrics. For this purpose, we considered the target yield to be 99.9% and the run-time reliability to be 1 FIT. As observed in Figure 10(a) , the read probability of failure is of the order of 1e-6 and does not change substantially with change in the transistor width. Furthermore, in order to define the write probability of failure, the nominal write pulse width of 8ns is used. As observed in Figure 10(b) , the write failure increases drastically with a decrease in the access transistor width. Therefore, if the read and write probabilities of failure are considered jointly, the probability of bit-cell failure cannot be made lower than ∼1e-6 by adjusting the transistor size and/or the read voltage. Therefore, it is not possible to meet the reliability target without applying ECC. This result matches the results in Xu et al [2009] , Del Bel et al. [2014] , Pajouhi et al. [2015] , and Kwon et al. [2015] .
Next, we considered a cache with ECC. In order to have a fair comparison with respect to different ECC schemes, we considered a 128-bit ECC for SECDED and FaECC and DECTED. Figure 12 compares the energy, the area and the read/write latency for caches with different ECC configurations when the caches were optimized for minimum area. As observed, the area for a cache with FaECC is 20% and 13% less than a cache with SECDED and DECTED, respectively. Notably, the access transistor width is smaller for FaECC compared to SECDED. This reduced transistor width stems from higher coverage for read and write errors for FaECC compared to SECDED. Furthermore, since SECDED and DECTED are used for yield enhancement as well as for enhancing run-time reliability, it may not be easy to reduce E B (run-time errors may increase). On the other hand, if FaECC is used, there exists an opportunity to optimize E B , while still maintaining good coverage for run-time errors.
Next, we optimized the cache for minimum energy consumption. In order to have a fair comparison between the three cache configurations, we considered the mean energy consumption of every read operation. Specifically, the energy associated with the error detection unit is considered for each and every read operation. However, for SECDED and DECTED, the decoder energy is considered only when an error is detected. Further, for FaECC, if a single error is detected, the decoding procedure would involve correcting a single error; thus, it would not include the additional write and read step. On the other hand, if two errors are detected, the decoding would involve extra write and read operations as well as the use of the additional decoding step. Therefore, the energy associated with each of these two conditions is added to the total energy based on the number of times each condition is applicable. Figure 13 compares the read/write energy and the area of the cache after energy optimization is performed. As observed, the read energy of the cache with FaECC is 8% less than SECDED and 4% less than DECTED. However, the write energy of FaECC is 21% and 11% smaller than that of caches with SECDED and DECTED, respectively. Also, as observed in Figure 13 , the area of the cache with FaECC is 20% and 11% less than the caches with SECDED and DECTED, respectively. We also optimized the cache for improved write performance. In order to have a fair comparison, we compared the mean delay of the three different ECC schemes. For this purpose, the write latency was calculated based on the latency required for a successful write operation as well as the latency for calculating the encoding bits. In order to obtain the mean read latency, similar to calculating the energy, we considered the weighted average of the delays of different ECC schemes based on the number of times they are invoked. For all ECC schemes, the error detection delay is included in every read operation. However, for SECDED and DECTED, the decoder delay is considered only when an error is to be corrected. For FaECC, as observed in Figure 6 , the data detection is performed for every data read and the correction unit is used only if there is an error.
If there were only a single error, the FaECC decoder would have the same delay as a SECDED decoder. However, the FaECC scheme differs from SECDED when two errors are detected. Additionally, this occurs only if the data value written into the hard error location is different from the data read from that location: if a bit-cell with a hard error of "0" is storing "0" (the same value), it will be read without any error. When FaECC is activated, the read latency would be dominated by a write and a read operation. This is due to parallel estimation of the candidate codewords and the additional write and read procedure. Note, that the hard-error locations are required only at the end of the correction procedure. Therefore, the worst-case delay associated with FaECC is longer than that of SECDED or DECTED.
On the other hand, in STT-MRAMs, due to the long latency associated with STT switching, the write pulse width dominates the write latency. Therefore, for write performance optimization, the write pulse width should be reduced. In order to have a fair comparison, we reduced the write pulse width of all three ECC configurations to 6ns and optimized the cache for performance. It can be observed from Figure 10 (b) that if the write pulse duration is equal to 6ns instead of 8ns, the probability of write failure increases for a fixed access transistor width. In order to compensate for this increased probability of failure, the access transistor can be upsized to ensure complete STT switching. However, upsizing the access transistor negatively impacts the read performance. Therefore, there is a tradeoff between the read performance and the write performance. Figure 14 depicts the area and read/write latency for a cache with different ECC configurations. As observed, the read latency of FaECC is 16% less than that of the cache with SECDED and 11% less than that of the cache with DECTED. Furthermore, the area is 19% and 14% less than that of SECDED and DECTED, respectively.
CONLUSION
In this article, we analyzed the impact of process variations on the run-time reliability of STT-MRAM memory arrays. Furthermore, we analyzed the efficacy of ECC in relaxing the E B requirement of the MTJ under process variations. We also analyzed the efficacy of ECC on yield enhancement and run-time reliability. Our results showed that if SECDED is used for yield enhancement besides run-time reliability, it may be difficult to have a more relaxed value of E B (better write current). Thus, we proposed using FaECC in which permanent faults are masked while maintaining its correction capability for soft errors. In order to analyze the efficacy of FaECC, we developed a simulation framework that considers different levels of design abstraction. Using the simulation framework, we performed a case study of a 1MB cache in 32nm Technology node. We showed that in our proposed scheme, the area of the memory array is reduced up to 20% compared to a cache with SECDED and up to 13% compared to a cache with DECTED, at iso-reliability. Furthermore, the write energy can be reduced up to 21% and 11% compared to caches with SECDED and DECTED correction capabilities, respectively.
