In-memory computing is a promising approach to addressing the processor-memory data transfer bottleneck in computing systems. We propose spin-transfer torque computein-memory (STT-CiM), a design for in-memory computing with spin-transfer torque magnetic RAM (STT-MRAM). The unique properties of spintronic memory allow multiple wordlines within an array to be simultaneously enabled, opening up the possibility of directly sensing functions of the values stored in multiple rows using a single access. We propose modifications to STT-MRAM peripheral circuits that leverage this principle to perform logic, arithmetic, and complex vector operations. We address the challenge of reliable in-memory computing under process variations by extending error-correction code schemes to detect and correct errors that occur during CiM operations. We also address the question of how STT-CiM should be integrated within a generalpurpose computing system. To this end, we propose architectural enhancements to processor instruction sets and on-chip buses that enable STT-CiM to be utilized as a scratchpad memory. Finally, we present data mapping techniques to increase the effectiveness of STT-CiM. We evaluate STT-CiM using a device-to-architecture modeling framework, and integrate cycle-accurate models of STT-CiM with a commercial processor and on-chip bus (Nios II and Avalon from Intel). Our system-level evaluation shows that STT-CiM provides the system-level performance improvements of 3.93 times on average (up to 10.4 times), and concurrently reduces memory system energy by 3.83 times on average (up to 12.4 times).
I. INTRODUCTION
T HE growth in data processed and increase in the number of cores place high demands on the memory systems of modern computing platforms. Consequently, a growing fraction of transistors, area, and power are utilized toward memories. CMOS memories (SRAM and embedded DRAM) have been the mainstays of memory design for the past several decades. However, recent technology scaling challenges in CMOS memories, along with an increased demand for memory capacity and performance, have fueled an active interest in alternative memory technologies. Manuscript Spintronic memories have emerged as a promising candidate for future memories due to several desirable attributes, such as nonvolatility, high density, and near-zero leakage. In particular, spin-transfer torque magnetic RAM (STT-MRAM) has garnered a significant interest with various prototype demonstrations and early commercial offerings [1] - [3] . There have been several research efforts to boost the efficiency of STT-MRAM at the device, circuit, and architectural levels [4] - [30] . In this paper, we explore, viz., in-memory computing with STT-MRAM. By exploiting the ability to simultaneously enable multiple wordlines (WLs) within a memory array, we enhance STT-MRAM arrays to perform a range of arithmetic, logic, and vector operations. We propose circuit and architectural techniques for reliable computation under process variations and to enable the proposed design to be used in a programmable processor-based system.
In-memory computing is motivated by the observation that the movement of data from bit-cells in the memory to the processor and back (across the bitlines, memory interface, and system interconnect) is a major performance and energy bottleneck in computing systems. Efforts that have explored the closer integration of logic and memory are variedly referred to in the literature as logic-in-memory, computing-in-memory, and processing-in-memory. These efforts may be classified into two categories-moving logic closer to memory or nearmemory computing [31] - [44] and performing computations within memory structures or in-memory computing [45] - [57] , which is the focus of this paper. In-memory computing reduces the number of memory accesses and the amount of data transferred between processor and memory, and exploits the wider internal bandwidth available within memory systems.
Our proposal is based on the observation that by enabling multiple WLs simultaneously 1 and sensing the effective resistance of each bitline (BL), it is possible to directly compute logic functions of the values stored in the bit-cells. Based on this insight, we propose spin-transfer torque compute-inmemory (STT-CiM), a design for in-memory computing with STT-MRAM that can perform a range of arithmetic, logic, and vector operations. In STT-CiM, the core data array is the same as standard STT-MRAM; hence, memory density and the efficiency of read and write operations are maintained. Reliable sensing under the limited tunneling magnetoresistance (TMR) of STT-MRAM bit-cells is known to be a challenge [12] - [16] , [29] , and we show that challenge this is further aggravated for in-memory computations. In order to enhance the robustness of STT-CiM under process variations, we extend error-correction codes (ECCs) to errors that occur during in-memory computations. To evaluate the benefits of STT-CiM, we utilize it as a scratchpad in the memory hierarchy of the Intel Nios II [58] processor. We propose enhancements to the on-chip bus and extend the instruction set of the processor to support CiM operations and expose them to software. We also present suitable data mapping techniques to maximize the benefits of STT-CiM.
We note that earlier efforts (see [48] ) have proposed enabling multiple WLs to perform computations within nonvolatile memories (NVMs). Although this paper shares this principle, we differ from the previous work in several key aspects: 1) we address reliable in-memory computing under process variations; 2) we go beyond bitwise logic operations to also perform arithmetic and vector operations, which are commonly present in modern computing workloads; and 3) we propose architectural enhancements [bus and instruction set architecture (ISA) extensions], and data mapping techniques to enable in-memory computation in the context of on-chip scratchpad memories.
In summary, the key contributions of this paper are as follows.
1) We explore CiM with spintronic memories as an approach to improving system performance and energy. 2) We propose STT-CiM, an enhanced STT-MRAM array that can perform a range of arithmetic, logic, and vector CiM operations without modifying either the bit-cells or the core data array. 3) We address a key challenge in STT-CiM, i.e., reliably performing in-memory operations under process variation, by demonstrating suitable error correction mechanisms. 4) We propose extensions to the instruction set and on-chip bus to integrate STT-CiM into a programmable processor system and demonstrate the viability of these extensions using Intel's Nios II processor and Avalon on-chip bus. 5) We evaluate the performance and energy benefits of STT-CiM, achieving average improvements of 3.83 times (up to 12.4 times) and 3.93 times (up to 10.4 times) in the total memory energy and system performance, respectively. The rest of this paper is organized as follows. Section II presents an overview of prior research efforts related to in-memory computation. Section III provides the necessary background on STT-MRAM. Section IV describes the STT-CiM design and how it supports in-memory computation. Section V outlines architectural enhancements for STT-CiM. Section VI describes the experimental methodology, and experimental results are presented in Section VII. Section VIII concludes this paper.
II. RELATED WORK
The closer integration of logic and memory is variedly referred to in the literature as logic-in-memory, computingin-memory, and processing-in-memory. These efforts can be broadly classified into two categories, as shown in Fig. 1 . We limit the scope of our discussion to approaches that improve the efficiency of active computation. For example, we do not discuss the embedding of NVM elements into a logic circuit [59] - [62] in order to enable the system to shut down and wake up efficiently for improved power management.
Near-memory computing refers to bringing logic or processing units closer to memory. Notwithstanding the closer integration, processing units still remain distinct from memory arrays. Near-memory computing has been explored at various levels of the memory hierarchy [31] - [35] , [37] - [44] . Intelligent RAM [31] is an early example, which integrated a processor and DRAM in the same chip to improve the bandwidth between them. Embedding simple processing units within each page of main memory [32] and within secondary storage [33] enables computations to be performed near memory. An application-specific example of near-memory computation is memory that can generate interpolated values, enabling the evaluation of complex mathematical functions [42] . Near-memory computing has gained significant interest in recent years, with industry efforts such as hybrid memory cube [43] and high bandwidth memory [44] .
In-memory computing [45] - [48] , [51] - [54] integrates logic operations into memory arrays, fundamentally blurring the distinction between processing and memory. The key challenge of in-memory computing is to realize it without impacting the desirability of the resulting design as a standard memory (i.e., density or efficiency of standard read and write operations). Due to these constraints, in-memory computing is typically limited to performing a small number of simple operations.
We can classify previous proposals for in-memory computing based on whether they target application-specific or general-purpose computations, and based on the underlying memory technology that they consider. Application-specific examples of in-memory computing include vector-matrix multiplication [54] - [57] and sum-of-absolute difference [46] computation. Ternary content-addressable memory [45] , ROM-embedded RAM [63] , AC-DIMM [52] , and Micron's automata processor [64] can also be viewed as examples of in-memory computing that target specific operations, such as pattern matching or evaluation of transcendental functions. Unlike these application-specific designs, we focus on embedding a broader set of operations (arithmetic, logic, and vector operations) within memory.
In-memory evaluation of bitwise logic operations has been explored for memristive memories [48] - [50] and DRAM [51] . This paper differs from these efforts in several important aspects. First, we focus on in-memory computing for spintronic memory, which involves fundamentally different prospects and design challenges. For example, the proposed operations are not destructive to the contents stored in the accessed bit-cells (unlike [51] ). On the other hand, the much lower ratio of ON-to-OFF resistance in spintronic memory leads to lower sensing margins. Second, we use a different sensing and reference generation circuitry (RGC), which enables us to natively realize a wider variety of operations. For example, the proposed design requires only one array access (unlike two in the case of [48] ) to perform bitwise XOR operations. Second, our design goes beyond bitwise logic operations and realizes arithmetic as well as complex vector operations. Third, we propose architectural extensions (bus and ISA extensions) and data mapping techniques to enable in-memory computing within a general-purpose processor system. Finally, we address a key challenge associated with in-memory computing, viz., reliable operation under process variations.
A different approach to in-memory computing with spintronic memories [47] uses an extra transistor in each bit-cell (2T-1R cells), which sacrifices the density benefits of standard (1T-1R) STT-MRAM while potentially enabling more complex functions to be evaluated within the array. In contrast, our proposal enables in-memory computation within a standard STT-MRAM array with no changes to the bit-cells. We note that a concurrent effort [53] has explored bitwise AND/OR operations in STT-MRAM. The bitwise XOR operation cannot be realized atomically using the design proposed in [53] . Furthermore, these efforts restrict themselves to device and circuit-level considerations, and do not address the architectural challenges of in-memory computing.
III. BACKGROUND
An STT-MRAM bit-cell consists of an access transistor and a magnetic tunnel junction (MTJ), as shown in Fig. 2 . An MTJ, in turn, consists of a pinned layer that has a fixed magnetic orientation and a free layer whose magnetic orientation can be switched. The magnetic layers are separated by a tunneling oxide. The relative magnetic orientation of the free and pinned layers determines the resistance offered by the MTJ (the resistance for the parallel configuration R P is lower than the antiparallel resistance R AP ). The two resistance states encode a bit (we assume that parallel represents logic "1," and antiparallel represents logic "0"). A read operation is performed by applying a bias (V read ) between the BL and the source line (SL), and enabling the WL. The resultant current flowing through the bit-cell (I P or I AP ) is compared against a global reference to determine the logic state stored in the bit-cell.
A write is performed by passing a current greater than the critical switching current of the MTJ through the bitcell. The logic value written is dependent on the direction of the write current, as shown in Fig. 2 . The write operation in STT-MRAM is stochastic in nature, and the duration and magnitude of the write current determine the write failure rate. Apart from write failures, STT-MRAMs are also subject to read decision failures, where the value stored in a bit-cell is incorrectly sensed due to process variations, and read disturb failures, where a read operation inadvertently ends up writing into the bit-cell. These failures are addressed through a range of techniques, including device and circuit optimization, manufacturing test and self-repair, and error correcting codes [12] , [13] , [15] , [16] . Apart from write/read failures, memories may also have failures due to thermal noise, which causes stochastic flipping in the bit-cells. However, such failures are negligible in STT-MRAM due to the high energy barrier between the two resistance states.
IV. STT-MRAM-BASED COMPUTE-IN-MEMORY
In this section, we describe STT-MRAM-based compute-inmemory (STT-CiM), a design for in-memory computing using standard STT-MRAM arrays.
A. STT-CiM Overview
The key idea behind STT-CiM is to enable multiple WLs simultaneously in an STT-MRAM array, leading to multiple bit-cells being connected to each BL. With enhancements that we propose to the sensing and RGC, we can directly compute logic functions of the enabled words. Note that such an operation is feasible in STT-MRAMs, since the bit-cells are resistive, and since the write currents are typically much higher than read currents. In contrast, enabling multiple WLs in SRAM can lead to short-circuit paths through the memory array, leading to loss of data stored in the bit-cells. Fig. 3 explains the principle of operation of STT-CiM. First, consider the resistive equivalent circuit of a single STT-MRAM bit-cell shown in Fig. 3 (a). R t represents the ON-resistance of the access transistor and R i represents the resistance of the MTJ. When a voltage V read is applied between the BL and the SL, the net current I i flowing through the bit-cell can take two possible values depending on the MTJ configuration, as shown in Fig. 3(b) . A read operation involves using a sensing mechanism to distinguish between these two current values. Fig. 3 (c) demonstrates a CiM operation, where two WLs (WL i and WL j ) are enabled, and a voltage bias (V read ) is applied to the BL. The resultant current flowing through the SL (denoted I SL ) is a summation of the currents flowing through each of the bit-cells (I i and I j ), which in turn depends on the logic states stored in these bit-cells. The possible values of I SL are shown in Fig. 3(d) . We propose enhanced sensing mechanisms to distinguish between these values and thereby compute logic functions of the values stored in the enabled bit-cells. We discuss the details of these operations in turn as follows. Consequently, only the case where both bit-cells are in the AP configuration, i.e., both store "0," leads to an output of logic "0" ("1") at the positive (negative) output of the sense amplifier, while all other cases lead to logic "1" ("0"). Thus, the positive and negative outputs of the sense amplifier evaluate the logic OR and NOR of the values stored in the enabled bit-cells.
2) Bitwise AND (NAND): A bitwise AND (NAND) operation is realized at the positive (negative) terminal of the sense amplifier by using the sensing scheme shown in Fig. 4(b) . Note that in this scheme, a different reference current (I ref-AND ) is fed to the sense amplifier.
3) Bitwise XOR: A bitwise XOR operation is realized when the two sensing schemes shown in Fig. 4 it is not necessary to distinguish between the cases where the two bit-cells connected to a BL store "10" and "01." 4) ADD Operation: An ADD operation is realized by leveraging the ability to concurrently perform multiple bitwise logical operations, as illustrated in Fig. 5 . Suppose A n and B n (the nth bits of two words, A and B) are stored in two different bit-cells of the same column within an STT-CiM array. Suppose that we wish to compute the full-adder logic function (the nth stage of an adder that adds words A and B). As shown in Fig. 5 , S n (the sum) and C n (the carry out) can be computed using A n XOR B n and A n AND B n , in addition to C n−1 (carry input from the previous stage). Fig. 5 also expresses the ADD operation in terms of the outputs of bitwise operations, O AND and O XOR . Three additional logic gates are required to enable this computation. Note that the sensing schemes discussed enable us to perform the bitwise XOR and AND operations simultaneously, thereby performing an ADD operation with a single array access.
B. STT-CiM Array
In this section, we present the array-level design of STT-CiM using the above-described circuit-level techniques. As shown in Fig. 6 , the proposed STT-CiM memory array takes an additional input CiMType that indicates the type of CiM operation that needs to be performed for every memory access. The CiM decoder interprets this input and generates appropriate control signals to perform the desired logic operation. In order to enable CiM operations, the read peripheral circuits present in each column (sensing circuit and global reference generation circuit in Fig. 6 ) are enhanced, while the core data array remains the same as in the standard STT-MRAM. The address (row) decoder needs to enable multiple WLs for CiM operations. Specifically, we utilize two address decoders, with each decoding the corresponding input address. The corresponding outputs of the decoders are ORed and connected to each WL. This configuration allows any of the two decoders to activate random WL locations. While the row decoder overhead is roughly doubled, it represents a small fraction of total area and power for configurations involving large arrays (1.8% in our evaluation). The write peripheral circuits are unchanged, as write operations are identical to the standard STT-MRAM. We next describe enhancements to sensing and reference generation circuits to enable CiM operations.
1) Sensing Circuitry: Fig. 6 shows the sensing circuit enhanced to support all the logic operations discussed in Section IV-A. It consists of two sense amplifiers, a CMOS NOR gate, three multiplexers, and three additional logic gates for the ADD operation. We note that the area and power overheads associated with these enhancements are minimal, since the sensing circuit constitutes a small fraction of the total memory area/power. As shown in Fig. 6 , the reference currents (I refl and I refr ) produced by the global reference generation circuit are fed to the two sense amplifiers in order to realize the sensing schemes discussed in Section IV-A. The three MUX control signals (sel 0 , sel 1 , and sel 2 ) are generated by the CiM decoder to select the desired CiM operation.
2) Reference Generation: Fig. 6 illustrates the modified reference generation circuit used to produce the additional reference currents necessary for the proposed sensing schemes. It includes two reference stacks, one for each of the two sense amplifiers in the sensing circuit. Each stack consists of three bit-cells programmed to offer resistances R P , R AP , and R REF , respectively. R REF 2 represents the fixed resistance reference MTJ used in a standard STT-MRAM to perform read operations. The CiM decoder generates control signals (rwl 0 , rwl 1 , . . . , rwr 1 , rwr 2 ) that enable a subset of these bit-cells in the reference stacks, which in turn produces the desired reference currents. Table II presents the values of these control signals so as to achieve the required reference currents.
The STT-CiM array can perform both regular memory operations and a range of CiM operations. The normal read operation is performed by enabling a single WL and setting
TABLE II
STT-CIM OPERATIONS CONTROL SIGNALS sel 0 , sel 1 , and rwl 0 to logic "1." On the other hand, a CiM operation is performed by enabling two WLs and setting CiMType to the appropriate value, which results in computing the desired function of the enabled words. The control signal values for a read operation as well as CiM operations are shown in Table II .
C. CiM Operation Under Process Variations
The STT-CiM array suffers from the same failure mechanisms (read disturb failures, read decision failures, and write failures) that are observed in the standard STT-MRAM. In this section, we compare the failure rates in the STT-CiM and standard STT-MRAM. Normal read/write operations in STT-CiM have the same failure rate as in a standard STT-MRAM, since the read/write mechanisms are identical. However, CiM operations differ in their failure rates, since the currents that flow through each bit-cell differ when enabling two WLs simultaneously. In order to analyze the read disturb and read decision failures under process variations for CiM operations, we performed a Monte Carlo circuit-level simulation on 1 million samples considering variations in MTJ oxide thickness (σ/μ = 2%), transistor V T (σ/μ = 5%), and MTJ cross-sectional area (σ/μ = 5%) [12] . Fig. 7 shows the probability density distribution of the possible currents obtained during the read and CiM operations on these 1 million samples.
1) CiM Disturb Failures: As shown in Fig. 7 , the overall current flowing through the SL is slightly higher in case of a CiM operation compared with a normal read. However, this increased current is divided between the two parallel paths, and consequently, the net read current flowing through each bit-cell (MTJ) is reduced. Hence, the read disturb failure rate is lower for CiM operations than for normal read operations.
2) CiM Decision Failures: The net current flowing through the SL (I SL ) in case of a CiM operation can have three possible values, i.e., I P−P , I AP-P (I P-AP ), and I AP-AP . A read decision failure occurs during a CiM operation when the current I P−P is interpreted as I AP-P (or vice versa), or when I AP-AP is inferred as I AP-P (or vice versa). In contrast to normal reads, CiM operations have two read margins-one between I P−P and I AP-P and another between I AP-P and I AP-AP [see Fig. 7(b) ]. Our simulation results show that the read margins for CiM operations are lower compared with normal reads; therefore, they are more prone to decision failures. Moreover, the read margins in CiM operations are unequal. 3 Thus, we have more failures arising due to the read margin between I P−P and I AP-P .
3) ECC for STT-CiM: In order to mitigate these failures in STT-MRAM, various ECC schemes have been previously explored [12] - [14] . We show that ECC techniques that provide single error correction and double error detection (SECDED) and double error correction and triple error detection can be used to address the decision failures in CiM operations as well. This is feasible because the codeword properties for most ECC codes are retained for a CiM XOR operation. Fig. 8 shows the codeword retention property of a CiM XOR operation using a simple Hamming code. As shown in Fig. 8 , word 1 and word 2 are augmented with ECC bits ( p 1 , p 2 , and p 3 ) and stored in memory as InMemW 1 and InMemW 2 , respectively. A CiM XOR operation performed on these stored words (InMemW 1 and InMemW 2 ) results in the ECC codeword for word 1 XOR word 2 ; therefore, the codewords are preserved for CiM XORs. We leverage this codeword retention property of CiM XORs to detect and correct errors in all CiM operations. This is enabled by the fact that STT-CiM always computes bitwise XOR (CiM XOR) irrespective of the CiM operation that is being performed.
We demonstrate the proposed error detection and correction (EDC) mechanism for CiM operations in Fig. 9 . Let us assume that data bit d 1 suffers from a decision failure during CiM operations, as shown in Fig. 9 . As a result, the combination of logic 1 and 1 in the two bit-cells (I P−P ) is inferred as logic 1 and 0 (I AP-P ), leading to erroneous CiM outputs. An error detection logic operating on the CiM XOR output (see Fig. 9 ) detects an error in the d 1 data bit. This error can be corrected directly for a CiM XOR operation by simply flipping the erroneous bit. For other CiM operations, we perform two conventional reads on words InMemW 1 and InMemW 2 , and correct the erroneous bits by recomputing them using an EDC unit (discussed in Section V). Note that such corrections lead to overheads, as we need to access memory array three times (compared with two times in STT-MRAM). However, our variation analysis shows that error corrections on CiM operations are infrequent, leading to overall improvements.
4) ECC Design Methodology:
We use the methodology employed in [12] to determine ECC requirements for both the baseline STT-MRAM and the proposed STT-CiM design. The approach uses circuit-level simulations to determine the bit-level error probability, which is then used to estimate the array level yield. Moreover, the ECC scheme is selected based on the target yield requirement. Our simulation shows that 1-bit failure probability of normal reads and CiM operations are 4.2 × 10 −8 and 6 × 10 −5 , respectively. With these obtained bit-level failure rates and assuming a target yield of 99%, the ECC requirement for 1-Mb STT-MRAM is SECDED, whereas the ECC requirement for 1-Mb STT-CiM is three-error correction and four-error detection (3EC4ED). Note that the overheads of the ECC schemes [12] , [65] are fully considered and reflected in our experimental results. Moreover, our simulation shows that the probability of CiM operations having errors is at most 0.1, i.e., no more than 1 in 10 CiM operations will have an error. Errors on all CiM operations are detected by using the 3EC4ED code on the XOR output. Detected errors are directly corrected for CiM XORs using the 3EC4ED code, and by reverting to near-memory computation for other CiM operations.
Apart from ECC schemes, STT-CiM can also leverage various reliability improvement techniques proposed for STT-MRAMS [27] - [30] . Furthermore, recent efforts [27] , [28] that increase the TMR of the MTJ and improve sensing margins will reduce read failure in CiM operations as well. These techniques can be used along with ECC to cost-effective mitigate failures in STT-CiM operations.
V. STT-CIM ARCHITECTURE
In order to evaluate the application-level benefits of STT-CiM, we integrate it as a scratchpad memory within the memory hierarchy of a programmable processor [58] . This section describes architectural enhancements for STT-CiM and hardware/software optimizations to increase its efficiency.
A. Optimizations for STT-CiM
In order to further the efficiency improvements obtained by STT-CiM, we propose additional optimizations.
1) Vector CiM Operations: Modern computing workloads exhibit significant data parallelism. To further enhance the efficiency of STT-CiM for data-parallel computations, we introduce vector CiM (VCiM) operations. The key idea behind VCiM operations is to perform CiM operations on all the elements of a vector concurrently. Fig. 10 shows how the internal memory bandwidth (32 × N bits) can be significantly larger than the limited I/O bandwidth (32 bits) visible to the processor. We exploit the memory's internal bandwidth to perform vector operations (N words wide) within STT-CiM.
Note that the data resulting from a vector operation may also be a vector, and hence, transferring it back to the processor is subject to the limited I/O bandwidth. To address this issue, we observe that vector operations are often followed by reduction operations. For example, a vector dot-product involves elementwise multiplication of two vectors followed by a summation (reduction) of the resulting vector of products to produce a scalar value. Based on this observation, we introduce a reduce unit (RU) before the column multiplexer, as shown in Fig. 10 . The RU takes an array of data elements as inputs and reduces it to a single data element. The RU can support various reduction operations, such as summation, Euclidean distance, L1 and L2 norm, and zero-comparison of which two are described in Table III . Consider the computation of
where arrays A and B are stored in rows i and j , respectively (shown in Fig. 10 ). To compute the desired function using a VCiM operation, we activate rows i and j simultaneously, and configure the sensing circuitry to perform an ADD operation and the RU to perform accumulation of the resulting output. Note that the summation would require 2N memory accesses in a conventional memory. With scalar CiM operations, it would require N memory accesses. With the proposed VCiM operations, only a single memory access is required.
The overheads of the RU depend on two factors: 1) the number of different reduction operations supported and 2) the maximum vector length allowed (can be between 2 to N words). To limit the overheads, we restrict our design to the vector lengths of 4 and 8.
2) Error Detection and Correction: To enable correction of erroneous bits for CiM operations, we introduce an EDC unit that implements the 3EC4ED ECC scheme. The EDC unit checks for errors using the CiM XOR output (recall that the XOR is evaluated along with all CiM operations) and signals the controller (shown in Fig. 10 ) upon the detection of erroneous computations. Upon receiving this error detection signal, the controller performs the required corrective actions.
B. Architectural Extensions for STT-CiM
To integrate STT-CiM in a programmable processor-based system, we propose the following architectural enhancements.
1) ISA Extension: We extend the ISA of a programmable processor to support CiM operations. To this end, we introduce a set of new instructions in the ISA (CiMXOR, CiMNOT, CiMAND, CiMADD . . .) that are used to invoke the different types of operations that can be performed in the STT-CiM array. In a load instruction, the requested address is sent to the memory, and the memory returns the data stored at the addressed location. However, in the case of a CiM instruction, the processor is required to provide addresses of two memory locations instead of a single one, and the memory operates on the two data values to return the final output
Equation (1) shows the format of a CiM instruction with an example. As shown, both the addresses required to perform CiMXOR operations are provided through registers. The format is similar to a regular arithmetic instruction that accesses two register values, performs the computation, and stores the result back in a register.
2) Program Transformation: To exploit the proposed CiM instructions at the application level, an assembly-level program transformation is performed, wherein specific sequences of instructions in the compiled program are mapped to suitable CiM instructions in the ISA. Fig. 11 shows an example transformation, where two load instructions followed by an XOR instruction are mapped to a single CiMXOR instruction.
3) Bus and Interface Support: In a programmable processor-based system, the processor and the memory communicate via a system bus or on-chip network. 4 This makes it essential to analyze the impact of CiM operations on the bus and the corresponding bus interface. As discussed earlier, a CiM operation is similar to a load instruction with the key difference that it sends two addresses to the memory. Conventional system buses only allow sending a single address onto the bus via the address channel. In order to send the second address for CiM operations, we utilize the unused write data channel of the system bus, which is unutilized during a CiM operation. Besides the two addresses, the processor also sends the type of CiM operation (CIMType) that needs to be performed. Note that it may be possible to overlay the CIMType signal onto the existing bus control signals; however, such optimizations strongly depend on the specifics of the bus protocol being used. In our design, we assume that three control bits are added to the bus to carry CIMType, and account for the resulting overheads in our experiments.
C. Data Mapping
In order to perform a CiM instruction, the locations of its operands in memory must satisfy certain constraints. Let us consider a memory organization consisting of several banks where each bank is an array that contains rows and columns. In this case, a CiM operation can be performed on two data elements only if they satisfy three key criteria: 1) they are stored in the same bank; 2) they are mapped to different rows; and 3) they are stored in the same set of columns.
Consequently, a suitable data placement technique is required that maximizes the use of CiM operations. We observe that the target applications for STT-CiM have well-defined computation patterns, facilitating such a data placement. Fig. 12 shows three general computation patterns. We next discuss these compute patterns and the corresponding data placement techniques.
Type I: This pattern, shown in the top row of Fig. 12 , involves element-to-element operations (OPs) between two arrays, e.g., A and B. In order to effectively utilize STT-CiM for this compute pattern, we utilize the array alignment technique [shown in Fig. 12(a) into a CiM operation. An extension to this technique is the row-interleaved placement shown in Fig. 12(b) . This technique is applicable to larger data structures that do not fully reside in the same memory bank. It ensures that the corresponding elements, i.e., A[i ] and B [i ] , are mapped to the same bank for any value of i , and satisfy the alignment criteria for a CiM operation.
Type II: This pattern, shown in the middle row of Fig. 12 , involves a nested loop in which the inner loop iteration consists of a single element of array A being operated with several elements of array B. For this one-to-many compute pattern, we introduce a spare row technique for data alignment. In this technique, a spare row is reserved in each memory bank to store copies of an element of A. As shown in Fig. 12(c) , in the kth iteration of the outer loop, a special write operation is used to fill the spare rows in all banks with A[k]. This results in each element of array B becoming aligned with a copy of A[k], thereby allowing CiM operations to be performed on them. Note that the special write operation introduces energy and performance overheads, but this overhead is amortized over all inner loop iterations, and is observed to be quite insignificant in our evaluations.
Type III: In this pattern, shown in the bottom row of Fig. 12 , operations are performed on an element drawn from a small array A and an element from a much larger array B. The elements are selected arbitrarily, i.e., without any predictable pattern. For example, consider when a small sequence of Fig. 13 . STT-CiM device-to-architecture evaluation framework. characters needs to be searched within a much larger input string. For this pattern, we propose a column replication technique to enable CiM operations, as shown in Fig. 12(d) . In this technique, a single element of the small array A is replicated across columns to fill an entire row. This ensures that each element of A is aligned with every element of B, enabling a CiM operation to be utilized. Note that the initial overhead due to data replication is very small, as it pales in comparison to the number of memory accesses to the larger array.
VI. EXPERIMENTAL METHODOLOGY
In this section, we discuss the device-to-architecture simulation framework (see Fig. 13 ) and application benchmarks used to evaluate the performance and energy benefits of STT-CiM at the array level and system level.
1) Device/Circuit Modeling: We first characterize the bitcells using SPICE-compatible MTJ models that are based on the self-consistent solution of Landau-Lifshitz-Gilbert magnetization dynamics and nonequilibrium Green's function electron transport [66] . Table IV shows the MTJ device parameters [67] used in our experiments. Using the 45-nm bulk CMOS technology and the MTJ models, the memory array along with the associated peripherals and extracted parasitics was simulated in SPICE for read, write, and CiM operations to obtain array-level timing and energy characteristics. The obtained characteristics were then used as technology parameters in a modified version of CACTI [68] that is capable of estimating system-level properties for 2) System-Level Simulation: We evaluated STT-CiM as a 1-MB scratchpad for an Intel Nios II processor [58] . Fig. 14 shows the integration of STT-CiM in the memory hierarchy of the programmable processor. In order to expose the STT-CiM operations to software, we extended the Nios II processor's instruction set with custom instructions. The Avalon on-chip bus was also extended to support CiM operations. Cycle-accurate RTL simulation was used to obtain the execution time and the memory access traces for various benchmarks. These traces along with the energy results obtained through the modified CACTI tool were used to estimate the total memory energy.
3) Benchmark Applications: We evaluate the STT-CiM on a suite of 12 algorithms drawn from various applications (see Table V ).
VII. RESULTS
In this section, we first present an array-level analysis of STT-CiM and then quantify its benefits through system-level energy and performance evaluation.
A. Array-Level Analysis 1) Energy: The second and third bars in Fig. 15 show the energy consumed by a standard read operation and STT-CiM uses a 3EC4ED ECC scheme (compared with SECDED in the baseline STT-MRAM), which accounts for about 3% of the 4.4% energy overhead. The CiMXOR operation consumes higher energy than a standard read operation mainly due to the charging of multiple WLs and a slightly higher SL current. However, since a CiM operation replaces two normal read operations, we also present the energy required for two reads in a standard STT-MRAM (last bar in Fig. 15 ). Note that an array-level comparison greatly understates the benefits of STT-CiM, since it does not consider the system-level impact of reduced data transfers between the processor and memory (system-level evaluation is presented in Sections VII-B and VII-C). Nevertheless, it is worth noting that even at the array level, STT-CiM consumes 34.2% less energy than STT-MRAM. The benefits mainly arise from a lower BL dynamic energy (BitL), since only a single access to the memory array is required for STT-CiM.
2) Area and Access Time: Fig. 16 shows the area breakdown for two STT-CiM designs that support vector operations of length 4 (VEC4) and 8 (VEC8). Compared with the STT-MRAM baseline, the area overheads for VEC4 and VEC8 are 14.2% and 16.6%, respectively. As shown in Fig. 16 , peripheral circuits, ECC storage, and ECC logic are the causes of area overheads (5%, 3.6%, and 3.2%, respectively). Peripheral circuits include the enhanced address decoder (1.8%), sense amplifier (0.9%), and RU (2.3%). Note that the total area is still dominated by the core array, which remains unchanged.
Finally, the access time overhead for STT-CiM was found to be only ∼0.8%, because the WL and BL delays dominate the total memory access latency.
B. Application-Level Memory Energy
We next present the system-level memory energy benefits of using STT-CiM in the programmable processor-based system described in Fig. 14. We evaluated the total memory energy consumed by STT-CiM across the application benchmarks, and compared it with a baseline design that uses the standard STT-MRAM. Fig. 17 shows the breakdown of different energy components, viz., read, write, and CiM, that contribute to the overall memory energy in both the STT-MRAM and proposed STT-CiM designs. In addition, it also shows the energy overheads due to near memory corrections on failing CiM operations. The total memory energy for an application is normalized to the memory energy consumed by the baseline design. For the proposed STT-CiM design, we evaluated a version without vector operations (STT-CiM), and two versions with the vector lengths of 4 and 8 (STT-CiM+VEC4 and STT-CiM+VEC8, respectively). Across all benchmarks, we observe 1.26 times, 2.77 times, and 3.83 times average improvement in energy for STT-CiM, STT-CiM+VEC4, and STT-CiM+V EC8, respectively.
To provide further insights into the energy benefits, Fig. 19 presents a breakdown for memory accesses made by each application into three categories-writes, reads that cannot be converted into CiM operations [CiM nonconvertible reads (CNC-Reads)], and CiM convertible reads (CC-Reads). We see that applications where CC-Reads dominate the total memory accesses [Knuth-Morris-Pratt (KMP), bitblit (BLIT), generalized learning vector quantization (GLVQ), K-means clustering (KMEANS), optical character recognition (OCR), image segmentation (IMGSEG), multilayer perceptron (MLP), and SVM in Fig. 19 ] experience higher energy benefits from STT-CiM (see Fig. 17 ). Among these applications, those that benefit from vectorization achieve the highest savings [GLVQ, KMEANS, OCR, MLP, and support vector machines (SVMs)]. Applications with relatively fewer CC-Reads or more frequent writes (AHC, LCS, RC4, and EDIST) exhibit relatively lower energy savings. CNC-Reads and writes are not benefited by STT-CiM, and writes, in particular, consume significantly (∼3 times) higher energy than reads. The energy overheads due to additional writes incurred for data alignment in Types II and III compute patterns were observed to be 0.8% and 0.3%, respectively. Fig. 18 shows the speedup for the Nios II processor system integrated with the STT-CiM across various applications. The speedup shown in Fig. 18 is with respect to the baseline design, i.e., the processor system integrated with a standard STT-MRAM-based memory. As discussed in Section V-B, CiM lowers the total number of memory accesses as well as the number of instructions executed, which leads to performance benefits at the system level. Overall, for STT-CiM without vector operations, we observe performance benefits ranging from 1.07 times to 1.36 times. With vector operations, Fig. 17 . Application-level memory energy. Fig. 18 . Application-level system performance. the average speedup increased to 3.25 times and 3.93 times for the vector lengths of 4 and 8, respectively. Comparing Figs. 18 and 19 , we see that the factors that indicate higher energy savings for an application (a large fraction of memory accesses are CC-Reads, and opportunities for vectorization exist) are also predictive of higher performance improvements.
C. System-Level Performance
In order to demonstrate the performance sensitivity to memory latency, we vary the memory latency and evaluate the execution time for each application. Fig. 20 shows the results of this sensitivity analysis. On the Y -axis, we have the speedup of STT-CiM over STT-MRAM, and on the X-axis the memory latency. We observe that STT-CiM yields higher performance benefits at higher memory latency. This is attributed to the fact that the reduced number of memory accesses for STT-CiM has a larger impact on system performance. On an average, we achieve 1.13 times speedup for a memory latency 
VIII. CONCLUSION
STT-MRAM is a promising candidate for future on-chip memories. In this paper, we proposed STT-CiM, an enhanced STT-MRAM that can perform a range of arithmetic, logic, and VCiM operations. We addressed a key challenge associated with these in-memory operations, i.e., reliable computation under process variations. We utilized the proposed design (STT-CiM) as a scratchpad in the memory hierarchy of a programmable processor, and introduced ISA extensions and on-chip bus enhancements to support in-memory computations. We proposed architectural optimizations and data mapping techniques to enhance the efficiency of STT-CiM. A device-to-architecture simulation framework was used to evaluate the benefits of STT-CiM. Our experiments indicate that STT-CiM achieves substantial improvements in energy and performance, and shows considerable promise in alleviating the processor-memory gap.
