et al.. A high-reliability and low-power computing-in-memory implementation within STT-MRAM. Microelectronics Journal, Elsevier, 2018, 81, pp.
M A N U S C R I P T A C C E P T E D ACCEPTED MANUSCRIPT
A High-Reliability and Low-Power Computing-in-Memory Implementation Within STT-MRAM
Introduction
In the current big data era, a huge amount of data is generated, stored and processed all the time. Data are transferred back and forth to be processed between the processor and the memory in 5 the conventional Von-Neumann computer architecture [1] . The processor frequency and the memory access efficiency, however, are in the state of imbalance, especially in the computer system with multicore processors. More energy is consumed during 10 data transport rather than the computation itself. Transferring and processing of data are highly inefficient due to the limited bandwidth between the processor and the memory, which are even worse in the data-intensive applications [2, 3] . According to M A N U S C R I P T A C C E P T E D ACCEPTED MANUSCRIPT CMOS based memories, it is complex and cost inefficient to integrate processing unit and storage unit together until the 3D stacking technology emerges [8] . Moreover, these technologies also suffer from the reliability issues [9, 10] .
With the emerging of non-volatile memories (NVMs), the CIM has regained the interests from both academia and industry. The "memory wall" issue becomes possible to be addressed due to some good properties of the NVMs over the other con-45 ventional memories [11] [12] [13] [14] [15] . For example, STT-MRAM has been considered as one of the most promising candidates due to its distinctive advantages, such as, non-volatility, good scalability, compatibility with CMOS, ultra-fast accessing speed 50 and good endurance [16] [17] [18] [19] [20] . Some CIM paradigms are proposed to implement logic operations within the NVMs. These paradigms can be categorized as two kinds. One utilizes two or more reference cells to distinguish the different resistance states 55 of the data cells participating in the logic operations, while the other one executes the logic operations by its complementary structure [5, 21, 22] . Most of the CIM paradigms have been presented and assessed at architecture level [6, [23] [24] [25] . They 60 are based on the idea of adding necessary peripheral circuits to the memory, which makes the memory has some kinds of computation capability and storage capability at the same time [26] [27] [28] . However, we note that few detailed implementations at the circuit level are introduced among these CIM paradigms, and few studies are evaluated carefully on their reliability and performance [29] [30] [31] [32] . Therefore, we implement a complementary CIM scheme: ComRef at the circuit level within STT-MRAM, 70 and then measure its reliability and performance. The proposal is based on the idea that logic operations can be executed within STT-MRAM by enabling multiple word lines (WL) simultaneously, and then the result can be obtained by memory-75 like read operation. This CIM implementation can perform logic operation in higher reliability, lower power consumption and higher access speed compared with the dual reference (DualRef) CIM implementation [33] .
80
The rest of this paper is organized as follows. Section 2 presents the ComRef implementation. The functionality of the ComRef CIM implementation is validated in Section 3. After that, the reliability and performance assessment are included in Section 1600  1800  2000  2200  2400  2600  2800  3000  3200  3400  3600  3800  0   500   1000   1500   2000   2500   3000   3500 Resistance (Ohm) # of occurrence R AP // R AP ( 0, 1 ) or ( 1, 0 ) ( 0, 0 ) ( 1, 1 ) R P // R AP R P // R P
Ref1
Ref2
TMR = 120%
Standard deviation σ = 0.1*μ Figure 1 : Resistance distribution of two MTJs aligned in low resistance parallel state (R P //R P ), antiparallel state (R P //R AP ) and high resistance parallel state (R AP //R AP ).
The MTJ model used comes from [34] . 
ComRef CIM Implementation
The ComRef CIM implementation is implemented within STT-MRAM by adding necessary peripheral circuits. It can work in both computing 90 mode and memory mode to perform bitwise logic operations or write/read data, and switch between the two modes freely. We will show the working procedures of the ComRef by the case of a four-byfour ComRef CIM array. 95 
Basic Principles
There are two differences of the ComRef when compared with other CIM paradigms. One is that each bit is expressed by two complementary data cells. Generally, the low-resistance state R L of the 100 data cell is used to indicate the data "0", and its high-resistance state R H represents data "1". In the ComRef CIM implementation, the data "0" is expressed by the resistance pattern (R L , R H ), bitwise logic operation to be performed, and the execution of an OR bitwise logic operation is determined by the resistance pattern (R H , R L ), as shown in TABLE 1. In our proposal, bitwise logic 115 operations are performed between the two columns activated simultaneously in a ComRef CIM array. As depicted in Fig. 1 , there are three resistance states of the two MTJs in parallel. The key idea of performing bitwise logic operations within STT-
120
MRAM is to differentiate one resistance state from the others. In the conventional CIM paradigms implemented by dual reference cells, the resistance of two reference cells is designed to distinguish the three resistance states as shown in Fig. 1 . In the 125 ComRef CIM implementation, the principle we exploited is that the distance of the two resistance states is expanded by adding the operation-selected bit cells in parallel. This idea results in higher reliability, which can be validated by the following 130 simulations in Section 4. added as the operation-selected data cells, the address decoders are replaced by the EAD (Enhanced Address Decoder), and two sensing amplifiers (SA) are added to perform bitwise logic operations. The EAD is designed to be able to activate one WL 140 in read/write operations (memory mode) and activate three WLs simultaneously in the bitwise logic operations (computing mode) according to the signals of Address and CIM T ype. Two complementary data cells are used to indicate one-bit data in 145 the ComRef CIM array. Therefore, the WD (Write Driver) is also enhanced, where the input Data are first converted to the complementary form and then wrote into the corresponding data cells. The readout operation in this CIM array is the same as nor-150 mal STT-MRAM, but the readout results can be more reliable due to the complementary structure. When performing a bitwise logic operation between the second and the third column as shown in Fig.  2(a) , the corresponding WL0 and WL1 should be 155 selected and activated. Besides the WL0 and WL1, WL OpSel should also be activated to determine the kind of operation. When WL Sel is enabled, the execution of the operation starts. As can be seen, the ComRef CIM implementation is realized by re-160 vising the peripheral circuits and adding necessary SAs, which makes it be able to work in between the computing and memory mode, and switch freely.
ComRef CIM Array

Working Procedure
In this subsection, OR bitwise logic operation 165 of one-bit is chosen as an example to present the working procedures. The components participat-
ing in the operation are abstracted as shown in Fig. 2(b) . The data cells with the cyan background are the two operation-selected data cells, 170 which should be set to data "1". In other words, (MTJ OpSel0 , MTJ OpSel1 ) are configured to (R AP , R P ). Assumed that the OR bitwise logic operation is executed between data "0" and "1". Therefore, the MTJs (MTJ00, MTJ10) in the data cells with the green background are configured to (R P , R AP ), and the MTJs (MTJ01, MTJ11) in the data cells with the yellow background are configured to (R AP , R P ). The inverter (MP1,MN0) and the other one (MP2,MN1), along with the PMOS MP0 and MP3 comprise the precharge sensing amplifier (PCSA), which is used to distinguish the resistance differences of the two discharge branches. In the beginning, V P re , V Sel , WL OpSel , WL0 and WL1 are all in low-voltage level, the PCSA is precharged. When 185 starting to execute the OR bitwise logic operation, WL OpSel , WL0 and WL1 are activated first by setting to high-voltage level, then V P re and V Sel are switched from low to high-voltage level. Now, the PCSA begins to discharge. The discharge currents 190 in the two branches are different due to the different resistance of data cells. In this operation, the resistance of the data cells in the left branch is bigger than that of the right branch, so the discharge currents in the right branch are more than that of 195 the other branch. Therefore, the PCSA reaches the stable state (Q="1", Q Bar ="0") when the voltage on drain electrode of MN1 is first lower than the threshold voltage of the inverter (MP2,MN1). The final operation result is obtained at Q, which is con-
In summary, we present the basic principles of the ComRef CIM implementation, and then show the CIM array and the work procedures. The reliability and performance of the ComRef CIM im-205 plementations are fluctuated with PVT (Process, Voltage, Temperature) variations. In the following, we will first validate the feasibility of the ComRef CIM implementation, and then measure the operation error rate, sensing margin, operation delay 210 and energy consumption to assess their reliability and performance. model and a 40 nm compact perpendicular magnetic anisotropy MTJ model [35] [36] [37] . The supply voltage is fixed at V dd = 1.1V , and the TMR 220 equals to T M R = 300%, other related parameters are their default values in the MTJ model. Larger CMOS transistors channel width in the sensing circuit is used to enlarge the sensing margin, and the minimum channel width is employed in the write 225 circuit and the logic circuit to eliminate the influence of the parasitic capacitor as possible. Fig. 3 depicts the transient simulation waveforms of the ComRef CIM implementation. AND bitwise logic operations are first carried out, and 230 then the OR bitwise logic operations. Although XOR/XNOR and other bitwise logic operation can not be executed directly within the ComRef CIM array, the results can be obtained by adding essential logic circuits. Therefore, only the waveforms of 235 OR/AND bitwise logic operations are presented in this simulation. Assumed that all the MTJs participated in this simulation are in the high-resistance state (antiparallel state). The first step in this simulation is to write data "0" to the three complemen- erations are executed in the following stages. The output results are consistent with TABLE 1.
Functionality Verification
250
The ComRef CIM implementation is validated with the hybrid MTJ/CMOS transient simulation.
To the best of our knowledge, STT-MRAM suffers from many reliability issues, which are caused by the PVT variations. In follows, we will measure 255 the operation error rate, sensing margin, operation delay, energy dissipation to explore the design space of the CIM implementation. Last, we will compare the ComRef CIM implementation with the DualRef CIM implementation to see their robustness and 260 performance under PVT fluctuations.
Reliability and Performance Measurement
In this section, we carry out six groups of simulations to assess the ComRef CIM implementation 265 in terms of their operation error rate, sensing margin, operation delay, dynamic energy consumption, static power dissipation and area overheads. And we also compare it with the DualRef CIM implementation to see their reliability and performance. variations of the CMOS transistors. The operation error rate is defined as the arithmetic mean of er-280 ror rates at the bit patterns: (00), (01), (10) and (11) for OR or AND bitwise logic operations. These operation error rates are calculated with respect to the process variations varying from 0 to 20% of the mean µ and the temperature varying from 300K to 285 400K. In these simulations, the TMR of the MTJ devices are set to 300%, the value is in accordance with the reported recently [38] . The temperature is fixed at 300K when obtained operation error rates with the process variations; while evaluating the 290 operation error rate with respect to the temperature, the process variation is set to 5%. The measured results are shown in Fig. 4 and 5 respectively. As can be seen from the two figures, the operation error rates keep very low at the small process 295 variations, and it is the same in the low temperatures. However, the operation error rates raise up rapidly after 6% of process variations and 330K. With the large process variations or higher temperatures, both the resistances of the data cells and 300 the reference cells change, the driving currents of the transistors change accordingly. These changes result in the reduction of the sensing margin, which can be the explanation for more and more operation errors occurring. The operation error rate of 305 the ComRef CIM implementation is lower than that of the DualRef as shown in Fig. 4 and 5 under both the process and temperature variations, and which is more obvious under the process variations. Under the same process variations, the ComRef CIM Figure 6 : The sensing margins of bitwise logic operations. It is defined as the currents differences of the two discharge branches in the SAs.
error rate by 67.1% as shown in Fig. 4 when compared with the DualRef CIM implementation.
Sensing Margin
As described in Section 2, these bitwise logic op- variations that the ComRef CIM implementation can tolerate to ensure the computing without errors. The differences between the low-resistance and high-resistance states of the MTJ device varies with the TMR. Therefore, the sensing margins are 325 checked by increasing TMR from 100% to 300%. The calculated sensing margins are shown in Fig. 6 . Every data is the average value of the sensing margins at the four bit patterns. As can be seen from this figure, the smallest sensing margin is more than 330 16µA, and the biggest one is about 50µA. Both the sensing margin of the OR and AND bitwise logic operations arise linearly with the TMR increasing. We can also see from the figure, the sensing margin of the ComRef is bigger than that of the DualRef cell becomes large as the TMR arising, which results in the increasing of the sensing margins. 
Dynamic Energy Consumption
The operation dynamic energy consumption is calculated by integrating the product of the voltage and the currents with respect to its operation de-370 lay, both the dynamic energies of the two discharge branches in the SAs are counted. The results are shown in Fig. 8 . These results are obtained by fixing the temperature at 300K. As can be seen from the figure, the dynamic energy consumptions of both the ComRef and DualRef CIM implementations increase with the supply voltage V dd arising, but decrease with the raising of the TMR. Increasing the V dd will surely result in the raising of the dynamic energy consumption. The resistance of the 380 discharge branches arises with the TMR increasing, which results in the reduction of the discharge currents, so the dynamic energy consumed by single bitwise logic operation decreases with the TMR rising. It is also found that the ComRef CIM im-385 plementation reduces the average dynamic energy consumptions by 23.4% when compared with the DualRef CIM implementation.
Static Power Dissipation
The static power dissipation is calculated by the 390 product of the supply voltage V dd and the leakage currents of the CMOS transistors, which is shown in Fig. 9 
Area Overheads
When compared with the normal STT-MRAM array, an SA is added for every two rows in one array, and the address decoder and write driver are enhanced to support CIM operations. Due to the 410 complementary structure, two data cells are used to represent one bit, there will be more area overheads when realizing the ComRef CIM implementation. Regarding the DualRef CIM implementation, two SAs and the related reference cells, an XOR gate 415 and a multiplexor are needed for every row in one array. In summary, both of the ComRef and Dual-Ref CIM implementations have area overheads, but the area overhead of the ComRef is more than that of the DualRef.
420
In summary, the reliability and performance of the ComRef and DualRef CIM implementations are analyzed quantitatively. Both the ComRef and Du-alRef CIM implementations suffer from the PVT variations. However, as compared with the Dual-
425
Ref CIM implementation, we find that the ComRef CIM implementation has less operation error rate, faster operation speed and less energy consumption than the DualRef CIM implementation besides for area overheads. 430 
Conclusion
