Abstract-Since the advent of complementary metal oxide semiconductors (CMOS), the number of transistors per die has continued to increase, reaching today several billion transistors. As a result, it has been possible to design and fabricate smart devices able to run at high speed. However, the power consumption of systems-on-chip has significantly increased due to the high density integration and the high leakage power of current CMOS transistors. As a result, the limits of heat dissipation make further improvement in performance difficult. A high level of autonomy for battery-powered devices is a real challenge. To deal with these issues, spin-transfer-torque magnetic random-access memory (STT-MRAM) technology is seen as a promising solution. In addition to its attractive performance features, STT-MRAM can bring nonvolatility to a system to allow full data retention after a complete shutdown while maintaining a fast wake-up time. Considering two 32-bit embedded processors, this letter shows how STT-MRAM can improve energy efficiency and reliability of future embedded systems thanks to normally-off computing and checkpointing/rollback techniques. A detailed analysis is performed to evaluate the cost related to the backup/recovery of the system. Index Terms-Spintronic memory and logic, embedded processor, spin-transfer-torque, magnetic random-access memory.
I. INTRODUCTION
The scaling limits of complementary metal oxide semiconductors (CMOS) are mainly due to the high heat dissipation observed in current systems-on-chip. As a consequence, speed and density are limited and the thermal constraints oblige the system to be partially turned off by using power gating techniques [Or-Bach 2015] . However, this solution is clearly constrained by the inherent volatility of CMOS devices, since turning off the memory part also means losing the execution state. For beyond CMOS systems, the spin-transfer torque magnetic random access memory (STT-MRAM) is a promising solution by combining nonvolatility, high density, low leakage, and competitive access time compared to CMOS-based memories such as static random access memory (SRAM) and flash memory [Kültürsay et al. 2013 , Ikegami et al. 2014 , Meena et al. 2014 , Jiang et al. 2016 , Lacoste et al. 2016 . Recently, new technologies known as spin-Halleffect MRAM , Aradhya 2016 , Kang 2017 and voltagecontrolled MRAM , Noguchi 2016 , Grezes 2017 appear to be promising for fast and ultralow power applications. Since these technologies are quite young and require further development, they are not considered in this letter. On the contrary, several prototypes demonstrated the maturity of the STT-MRAM technology [Lu 2015 , Chung 2016 , Song 2016 , Rho 2017 .
Considering two 32-bit embedded processors (The MIPS-like SecretBlaze [Barthe 2011 ] and the ARM-like Amber [Santifort 2010 ]), this letter shows how perpendicular STT-MRAM can help to design fast, low-power, and reliable devices. The rest of the letter is organized as follows: Section II gives the basics of STT-MRAM technology. Section III presents the experimental setup to evaluate a nonvolatile processor based on STT-MRAM. Section IV describes two features introduced by the nonvolatility of STT-MRAM for future computing: normally-off computing and rollback. Section V shows a detailed performance/energy analysis of the considered embedded processors based on STT-MRAM considering various architecture scenarios. Section VI concludes this letter.
Compared to the state of the art, this letter validated the normally-off computing and rollback techniques on two different 32-bit embedded processors, which clearly strengthen our approach. Moreover, this letter has also validated a recovery of the system at run time (e.g., in the case of a soft error), contrary to other related works which only considered the recovery process after a shutdown of the system [Koike et al. 2013 , Layer et al. 2016 .
II. SPIN-TRANSFER TORQUE MRAM: BASICS
A bit of information is stored as the resistance of a magnetic tunnel junction (MTJ), which consists of two ferromagnetic layers separated by a thin insulating barrier (see Fig. 1 ). The parallel (antiparallel) state causes a low (high) resistance value and can be characterized as a logic zero (one). A read operation consists in measuring the resistance thanks to a sensing current flowing through the MTJ. For the write operation, a spin-polarized current flips the magnetization of the storage layer by direct transfer of the spin angular momentum from spin-polarized electrons [Khvalkovskiy et al. 2013] . The direction of the current flow through the MTJ determines the final state of the bit cell.
Regarding the last advances of the STT-MRAM technology, IBM and SAMSUNG [Nowak 2016 ] demonstrated that a specific MTJ stack with perpendicular magnetic anisotropy is capable of delivering good STT performance down to 10 −6 write error rate (WER) in a broad range of device sizes from 50 to 11 nm, on a statistically relevant sample of several hundred of devices. They demonstrated an individual 11 nm device switching down to WER = 7 × 10 −10 using only 7.5 µA.
III. EXPERIMENTAL SETUP
The first step to evaluate the integration of STT-MRAM into the considered processors is to validate two capabilities brought by the nonvolatility: the normally-off computing for near-zero leakage power during sleep mode and the rollback to be tolerant against soft errors and power failures. The synthesizable hardware description language codes of the processors were modified to implement the aforementioned techniques. The registers retaining the state of the processor were duplicated to emulate the nonvolatile STT-MRAM registers, and control logic were added to enable the backup/recovery of the system state.
The second step is to quantify the cost in terms of speed, energy, and area of implementing the two prior capabilities. Data from the current state-of-the-art of STT-MRAM-based flip-flops (FFs) [Chabi 2014 ] are used to evaluate the cost at register level. On the other hand, the cost at cache and memory levels are quantified, thanks to NVSim [Dong 2012] . Tables 1 and 2 , respectively, detail the performance data of each memory element based on STT-MRAM and the three architecture scenarios considered in this letter. The performance data of the SRAM cache used in scenario 3 are 0.43 ns, 0.4 ns, 8 pJ, and 7 pJ, respectively, for read latency, write latency, read energy, and write energy. 
IV. NONVOLATILE COMPUTING

A. Normally-Off Computing
The normally-off computing consists in saving the state of the processor before a complete shutdown, then restoring the state after a new power-up. To make this possible, it is required to insert STT-MRAM at both register level and main memory level. Among all of the FFs of the SecretBlaze (Amber), 1986 (1644) FFs contain the state of the processor. A typical FF based on STT-MRAM is designed to have a dual-storage facility [see Fig. 2(a) ] [Jovanovic 2015] . The CMOS stage of the FF uses cross-coupled inverters (latch) to store one data bit in its electrical (volatile) form. On the other hand, the magnetic stage uses an MTJ to store one nonvolatile data bit. Such a hybrid CMOS/MTJ FF allows fast processing during active mode, while the leakage power is significantly reduced during sleep mode, especially thanks to the nonvolatility of the MTJ. Fig. 2(b) shows a snapshot of a silicon prototype with two FFs based on STT-MRAM (200 nm) and the 28 nm fully depleted silicon-on-insulator (FDSOI) CMOS technology. In brief, assuming that the FFs and the main memory of the processor are based on STT-MRAM, then a backup/recovery of the processor is the same as that described in Algorithm 1.
B. Checkpointing/Rollback Mechanism
The rollback technique is the ability to restore a safe state of the processor in the case of a system failure (see Fig. 3 ). This letter assumes that an error detection mechanism is available into the processor architecture to identify errors during the execution, for instance, as proposed in Wali [2016] . To avoid resetting the application, checkpoints can be created at runtime by saving the state of the processor either periodically or at strategic instant during the execution of the application. Then, if a system failure occurs, the last checkpoint is recovered. To keep a backup of the system state, a checkpoint has to retain the state of the registers and the main memory. Thanks to the dual-storage structure of the STT-MRAM FFs, a checkpoint of the registers is performed by copying data from the volatile CMOS stage to the nonvolatile magnetic stage of each FF.
After a checkpoint, the main memory contents will most probably be modified. For the rollback procedure, a dual-bank memory architecture can be considered. One bank (main memory) is dedicated to the execution of the application whereas the other bank (checkpoint memory) is used for the backup. It is worth noting that the checkpoint memory would be smaller than the main memory in a real application. Indeed, only a few memory locations are modified between two checkpoints. The size of the checkpoint memory depends on the application and the checkpointing period. Fig. 4 describes how this work implements the backup of the main memory. A buffer is used to save the addresses of the modified memory locations during the execution of the application. If a checkpoint is desired, only the modified memory locations (from the last checkpoint) are backed up. If the address buffer is full, a creation of a checkpoint is forced. In a similar way, if a rollback is needed before the next checkpoint (e.g., because of an execution error), only the modified memory locations are restored. An alternative solution to perform a checkpoint at memory level could be the use of a double context nonvolatile SRAM cell as proposed in Jovanovic [2015] . Considering such a nonvolatile memory based on this cell, it is possible to optimize the silicon area overhead and to greatly simplify the backup of the main memory. In brief, assuming the aforementioned considerations, then the checkpointing/rollback mechanism is the same as that described in Algorithm 2.
C. Validation
A complete backup/recovery of the system state has been validated through register-transfer level simulations for the Secretblaze and the Amber processors. A checkpoint/rollback procedure is demonstrated in Fig. 5 which shows the output terminals of both processors running the data encryption standard and the blowfish cipher algorithms. In both cases, a rollback is carried out at runtime. Then the applications are properly re-executed from the checkpoint.
V. ANALYSIS OF STT-MRAM-BASED EMBEDDED PROCESSOR
This section analyzes the cost of integrating the normally-off computing and rollback features into the considered processors. All the results are summarized in Tables 3 and 4 . The core frequency is set to 50 MHz, and this work assumes a checkpoint memory size of 4 kB. The cost has been evaluated separately for each memory element (i.e., FFs, cache, and main memory). Then the total cost for the complete system has been reported. This work assumes the backup/recovery processes of the FFs, cache, and main memory are carried out in parallel. Therefore, the latency cost at the system level represents the highest latency between the three memory elements. For the energy cost at system level, a simple addition is performed.
A. Register Level
Considering data from Table 1 , each FF consumes 500 fJ (12 fJ) to save (restore) the state of the CMOS stage into (from) the nonvolatile magnetic stage. As a result, the energy cost to back up the system at the register level comes to about 1 nJ for both processors, and the recovery energy comes to 24 pJ and 19.7 pJ, respectively, for the Secretblaze and the Amber.
Regarding the backup latency, it is worth noting that backing up all the FFs at the same time can lead to a high peak current, which could cause electrical integrity issues. Therefore, a progressive backup is considered in this letter. A maximum of 500 FFs are backed up (in parallel) at a time. As 4 ns is required to back up one FF, it will take 16 ns to save all the FFs for both processors. This corresponds to one clock cycle latency if the system frequency is lower or equal to 62.5 MHz. On the other hand, it only takes 0.8 ns to restore all the FFs (500 FFs restored at a time, 0.2 ns restore time per FF).
B. Cache Level
During a backup, there is no cost associated to the cache memory, even for scenario 3 which considers a volatile SRAM cache, since a write-through policy is used. When restoring the system state, the first cost is related to the warmup. The second cost is related to the flush process to avoid memory inconsistency after a rollback at run time. The considered cache architecture has 256 lines. This work implemented the cache flush process by invalidating one cache line per clock cycle. As a result, the latency cost of the rollback at cache level comes to 5.12 µs. In terms of energy, the rollback costs 19.6 nJ and 0.376 nJ, respectively, for the second and third scenarios. It is worth noting that to invalidate a cache line, only a write into the tag array is required. Therefore, the energy costs reported in Table 4 for the cache correspond to the energy of 256 writes into the tag array.
C. Main Memory Level
For normally-off computing, there is no cost related to the main memory. Since the latter is based on STT-MRAM, data are preserved before a complete shutdown of the system. For the checkpointing/rollback implementation, the cost to perform a checkpoint will depend on the number of bytes it is required to save into the checkpoint memory. In this letter, the worst case is considered (i.e., the size of the checkpoint memory which is 4 kB). The memory architecture is implemented with a 32-bit word width. Therefore, the energy cost to back up 4 kB (i.e., 1024 words) is represented by (1), where N words , E read , and E write are, respectively, the number of words to back up, the read energy per access of the main memory, and the write energy per access of the checkpoint memory. As a result, considering the data in Table 1 , the backup energy cost at main memory level comes to 72.2 nJ for both processors. In a similar way, the restore energy comes to 62.2 nJ. Regarding the latency, creating/restoring a checkpoint takes about 16 µs. It is worth noting that this letter does not consider the cost related to error correction code (ECC) to minimize errors. However, it has been demonstrated that every cell out of a 8 Mb STT-MRAM chip could be written without the use of ECC down to a pulse length of 4.5 ns [Jan 2014 ].
D. Area Analysis
Although STT-MRAM is denser than SRAM thanks to its small bit cell structure, the drawback of this technology is the high peripheral circuitry area due to the large CMOS transistors required to generate sufficient write current. As a result, a ratio of 1.5 to 3 is noticed between hybrid CMOS/MTJ FFs and standard CMOS FFs [Chabi 2014 ]. The STT-MRAM cache area of scenario 2 (0.048 mm 2 ) is also bigger than the SRAM cache area of scenario 3 (0.026 mm 2 ) by a ratio of 2. On the other hand, the STT-MRAM-based 1 MB main memory is clearly denser (0.98 mm 2 ) than its SRAM equivalent (2.5 mm 2 ) by a ratio of 2.5. This is because the area of the cell array occupies a large proportion of the total memory area compared to the area of the peripheral circuitry [Senni 2016 ].
VI. CONCLUSION
This letter showed how STT-MRAM could help designing energyefficient and reliable devices, thanks to the normally-off computing and the checkpointing/rollback mechanisms. Perspectives of this letter are to strengthen the results by building a real silicon prototype of a nonvolatile processor based on STT-MRAM.
ACKNOWLEDGEMENT
This work was supported in part by the European Union's Horizon 2020 research and innovation programme under Grant agreement 687973-(GREAT project), and in part by the French National Research Agency under Grant ANR-15-CE24-0033-01 (MASTA project).
