Abstract-Spin-transfer torque random access memory (STT-RAM), as an emerging nonvolatile memory technology, provides very dense array structure and extremely low leakage power consumption. It demonstrates a great potential in replacing conventional static random access memory technology to develop the next-generation on-chip cache memory of microprocessors and graphics processing units. The multilevel cell (MLC) design of STT-RAM that stores two or more bits in one cell potentially has higher storage capacity and faster system performance, attracting significant attention. In this paper, we first quantitatively evaluated the data storage density of the MLC STT-RAM. Our results revealed limited density improvement because of the large size of access transistor induced by high write current amplitude requirement and asymmetry of switching behavior. Moreover, the read and write accesses of existing MLC STT-RAM cache designs require twostep operation. The system level evaluation shows that the long access latency could amortize the performance speed brought by larger cache size, and even degrade the system performance for some applications. To unleash the potential of MLC STT-RAM cache, we proposed a new design through a cross-layer cooptimization. The memory cell structure integrated the reversed stacking of magnetic junction tunneling for a more balanced device and design tradeoff. In architecture development, we presented an adaptive mode switching mechanism: based on application's memory access behavior, the MLC STT-RAM cache can dynamically change between low latency single-level cell mode and high capacity MLC mode. Furthermore, we divided cache lines into fast and slow regions and investigated new data migration policies to allocate frequently access data to fast regions. Simulation results show that the proposed techniques can improve the system performance by 10.2% and reduce the energy consumption on cache by 9.5% compared with conventional MLC STT-RAM cache design.
shows of paramount importance to fill the bandwidth gap between CPU cores and off-chip main memory. Furthermore, the area and power consumption of a processor chip is dramatically affected by the on-chip cache memory [2] . Traditional static random access memory (SRAM) suffers from large leakage power and degraded reliability as fabrication technology further scales down, which severely limits its future application.
In recent years, emerging nonvolatile memory technologies have been extensively studied. Examples include spin-transfer torque random access memory (STT-RAM) [3] [4] [5] , phase change memory (PCM) [6] , resistive memory (ReRAM) [7] and ferroelectric memory (FeRAM) [8] . Among these technologies, STT-RAM is believed to have the greatest potential in developing the next generation on-chip cache memory [9] [10] [11] . By storing data on magnetic tunneling junctions (MTJs), STT-RAM obtains high data storage capacity, faster random access speed, and ultralow power consumption.
Compared with single-level cell (SLC) design, multilevel cell (MLC) that stores two or even more bits in one memory cell is more efficient in data storage density. The MLC design has been successfully adopted in Flash memory and PCM technologies by dividing the threshold voltage of Flash and the resistance range of PCM cell into multiple levels, respectively [12] , [13] . The use of MLC in STT-RAM cache design has also been investigated. For example, Chen et al. [14] examined the read/write scheme and proposed a set remapping solution to extend its life time. Zhang et al. [15] compared series and parallel MLC STT-RAM designs, concluding that series MLC STT-RAM is more resilient to process variations. Jiang et al. [16] addressed the performance issue through line paring and line swapping methods particularly for parallel MLC STT-RAM design. Nevertheless, a number of circuital and architectural challenges remain unsolved in the MLC design, including the limited density benefit and the degraded performance induced by multistep accesses.
An SLC STT-RAM cell is composed of an MTJ for data storage and an nMOS transistor for access control. Its area is mainly determined by the transistor size, the selection of which shall take many factors into consideration, including the MTJ resistance, the MTJ switching current requirement, and the biasing condition of the transistor. Unlike MLC PCM which obtains multiple logic bits by partitioning resistance range without changing the cell structure, MLC STT-RAM design needs insert an extra MTJ pillar to represent the second logic bit. The change in cell structure greatly complicates the design tradeoff and makes the use of the minimal-sized selective transistor very difficult. In fact, our evaluation shows that the conventional MLC structure [14] , [15] , [17] even is in danger of losing the density competition to SLC design. We note that the reverse MTJ stacking has been successfully utilized in SLC STT-RAM [18] , [19] . In this paper, we explore its use in MLC design. The new device structure expands the design space of MLC STT-RAM. Our simulations show that the new cell structure made of reverse MTJ connection can achieve the smallest area and continue the density advantage.
Besides the storage capacity, the access speed is another key metric in the cache design. By nature, accessing an MLC design is slower than SLC, simply because its logic detection in a read operation requires two sensing stages and writing an MLC cell involves two-step programming. At the system level, the enlarged storage capacity and the prolonged access latency of MLC STT-RAM have contradictory impacts on the overall system performance. The winner is determined by application's requirement. Those with large data sets benefit from the high cache capacity that reduces cache miss rate and costly accesses to main memory. In contrast, applications with small data sets may suffer from the long read and write latencies, performing even worse than the system integrated with SLC STT-RAM cache.
We observed that an MLC STT-RAM cache can support the SLC operation mode, which provides fast accesses but sacrifices half of its storage capacity. Based on it, an architectural level solution, named application-aware speed enhancement (ASE), was proposed: according to application's memory access behavior, the MLC STT-RAM cache dynamically changes between the MLC mode with high capacity and the SLC mode that offers low access latency. Furthermore, we presented a cell split mapping (CSM) method, which divides a cache line into a fast and a low region to reduce the mode switching cost. To fully take advantage of the proposed architecture solutions, new data migration policies that allocate frequently used data to fast regions were also studied.
In brief, this paper makes the following contributions. 1) We investigated four MLC STT-RAM cell designs and found that the reverse connection structure provides the highest storage density and the best operation condition, under the given MTJ parameters. 2) We proposed ASE, an MLC/SLC mode switch design that takes advantage of the high speed of SLC and the large capacity of MLC. 3) We integrated CSM and data migration to reduce the cost of mode switching and further improve the speed and power efficiency of the MLC design. The remainder of this paper is organized as follows. Section II describes the basic knowledge of STT-RAM technology. Section III explains the design challenges of MLC STT-RAM and explores different MLC STT-RAM cell structures. We then present the ASE method and describe the optimization of ASE for performance improvement in Sections IV and V, respectively. The system evaluation setup and results are presented and discussed in Section VI. At the end, we summarize the related works in Section VII and conclude this paper in Section VIII. 
II. FUNDAMENTALS OF STT-RAM
Unlike conventional memory technologies (e.g., SRAM and DRAM) using electric charge for data storage, STT-RAM belongs to the class of magnetoresistive RAM, where MTJs are used as data storage elements. An MTJ is composed of three layers, as shown in Fig. 1(a) : two ferromagnetic layers, namely, reference layer and free layer, are separated by an oxide barrier, e.g., MgO. The magnetization direction (MD) of the reference layer is fixed, while the MD of the free layer can be changed through a spin-polarized current [3] . Applying a current larger than the critical switching current (I C ) from the free layer to the reference layer switches the MD of the free layer to be parallel to that of the reference layer, and vice versa. When the MDs of the two ferromagnetic layers are parallel (P) or antiparallel (AP), MTJ demonstrates a low-or high-resistance state, representing logic "0" or "1," respectively. Fig. 1(b) shows the most popular SLC STT-RAM design, which contains one nMOS selective transistor and one MTJ [3] , [4] .
MLC STT-RAM is developed by integrating two MTJs into one single cell. For example, parallel MLC STT-RAM divides the free layer of an MTJ into a hard domain and a soft domain to represent two logic bits [20] . This design demonstrates poor reliability due to its high sensitivity to process variations [15] . Instead, series MLC STT-RAM that stacks two MTJs in series is more feasible and has been widely accepted [17] . Its cell structure is shown in Fig. 2(a) .
No matter in a parallel or serial MLC cell, the two MTJ pillars representing different logic bits have different areas. As shown in Fig. 2(a) , we name the data stored in the small and big MTJs as soft-bit and hard-bit, respectively. Because both the resistance-area product (RA) and critical switching current density (J C ) remain constant in a given magnetic process, the soft bit has a larger resistance value but requires a smaller switching current I C than the hard bit. Fig. 2(b) summarizes the write procedure of an MLC STT-RAM. Programming an MLC cell needs two stages. First, apply a current larger than the hard-bit critical current (i.e., I WH > I C,Hard ), which inevitably switches both the hard bit and soft bit. Then, a smaller current that satisfies I C,Soft < I WS < I C,Hard is deployed to switch only the soft bit. Reading data from an MLC STT-RAM requires two sensing steps too: first detect the soft bit; then according to the value of the soft-bit, apply another reference voltage to detect the hard-bit data. The procedure is shown in Fig. 2(c) .
In this paper, we adopted 32-nm PTM CMOS model [21] and the MTJ parameters from [22] for circuit analysis. The area ratio of the two MTJs is set to 2 in order to balance the difference of adjacent resistance states [15] . The key design and device parameters are summarized in Table I .
III. MLC STT-RAM CELL DESIGN EXPLORATION

A. Design Challenges of Conventional MLC STT-RAM
Higher density is the major motivation to promote MLC design. In STT-RAM, the MTJ pillar is realized at the minimal allowable dimension to reduce the switching current requirement. Hence, the cell area is mainly determined by the selective transistor. On the one hand, a small transistor is preferred to improve data storage density. On the other hand, the transistor must be large enough to provide sufficient current to switch MTJ during programming. In an MLC STT-RAM cell, an extra MTJ is introduced to stand for the second logic bit. The structural modification, however, exacerbates the size requirement of the selective transistor for the following two reasons.
1) Increased Switching Current Requirement: Two MTJs in an MLC cell must be in different areas in order to differentiate the two logic bits. The soft bit uses the smallest pillar which is the same as that in SLC design. The hard-bit size increases properly [15] . Note that J C is fixed and I C increases proportionally with MTJ area. So I C,Hard for hard-bit programming is much bigger than I C,Soft required for soft-bit switching, as shown in Table I .
2) Aggravated Asymmetry in Write Operation: As illustrated in Fig. 1(b) , the current flows from SL to BL direction when writing logic 1 (write-1) to an SLC STT-RAM cell. The voltage drop on MTJ causes V GS degradation and limits the drivability of the selective transistor. Comparably, write-0 is easier and faster, because V GS = V D D . Moreover, the required MTJ switching current in write-1 and write-0 operations are different, usually J C,0→1 > J C,1→0 [4] . This scenario is called as asymmetric writes. MLC design with more MTJs stacking in series increases the overall resistance. Thus, V GS degradation becomes worse and the current from SL to BL direction further reduces.
The conventional MLC STT-RAM design in Fig. 2 (a) is mainly constrained by the "write-1 to hard-bit" operation. First, it requires the highest switching current (I C,Hard,0→1 ). Moreover, the selective transistor is under the weakest biasing condition and produces the lowest driving current when the soft bit is 1. Even the soft bit originally stores 0, large I WH1 will quickly flip it to 1, bringing the design into the worst case condition. The scenario is illustrated in Fig. 3(a) .
During the following evaluation, the transistor of baseline SLC cell is set as 4.5F, which is sufficient to write logic 0 and 1 into an MTJ with an area of 32nm × 64nm. F represents the technology feature size, which is 32nm in this paper. Further reducing the transistor size does not increase density because the layout design rules, e.g., metal wire and via connection of BL and SL, start dominating the cell area [23] .
We simulated the driving current when writing 1 or 0 to the hard bit of an MLC STT-RAM under the worst case conditions. As can be seen in Fig. 4 , enlarging the selective transistor helps improve the driving current. However, the conventional MLC with a transistor of 9F (2× of that of SLC) cannot supply sufficient driving current to flip hard bit to 1 (I WH1 < I C,Hard,0→1 ). Further increasing the transistor size results in an even lower data density than SLC STT-RAM cache, which is meaningless.
B. Exploring More MLC STT-RAM Cell Structures
The conventional MLC structure in Fig. 2(a) has two MTJs in regular connection. In fact, it is not the only possible cell structure. The free layer in MTJ can also be fabricated underneath the reference layer to form a reverse connection [18] . The reverse connection has been successfully utilized in SLC STT-RAM for cell area reduction [19] . Fig. 5 shows three new MLC STT-RAM cell designs. Based on the stacking connections of the soft-and hard bits, we name these designs as soft-bit reversed MLC (SR-MLC), hard-bit reversed MLC (HR-MLC), and soft-and hard-bits reversed MLC (SHR-MLC), respectively. Since device characteristic is solely determined by material engineering, the change in MTJ connection does not affect the switching current requirement. So programming a hard bit is still more difficult than its corresponding soft bit. For comparison purpose, we simulated the driving currents when writing 1 or 0 to the hard bit of these MLC designs under the worst case conditions. The results are shown in Fig. 4 .
It can be seen that reversing MTJ connection helps alleviate the asymmetry in write operations. For example, the worst case condition of I WH1 in an SR-MLC cell is relaxed when the soft bit is 0. This is because even the initial logic of the soft bit is 1, writing 1 to the hard bit will quickly switch the soft-bit to 0 and raise I WH1 up. The scenario is illustrated in Fig. 3(b) . Compared to conventional MLC design, the worse case I WH1 of SR-MLC grows much faster as the selective transistor size increases. As a tradeoff, I WH0 is smaller than that of the conventional design, but still more than sufficient to conduct a successful write-0 to hard bit.
HR-MLC reverses the hard bit, resulting in the change of I WH1 values direction from SL → BL to BL → SL. Therefore, when writing 1 to the hard bit, the selective transistor does not suffer from V GS degradation. The amplitude of I WH1 grows even higher than that of SR-HLC. However, its I WH0 degrades significantly and can barely exceed I C,Hard,1→0 .
Table II summarizes the write current margins provided by four MLC designs over the required MTJ switching current, such as
and
The size of selective transistor is set to 4.5 F, corresponding to twice data density over the baseline SLC. The result showed Fig . 6 compares the write performance of all the four cells for both write-1 and write-0 operations. Because all the four types of cells have the same latency requirement on softbit writing, only the hard-bit write latency is presented in the figure. When setting the select transistor size at 4.5F, the SHR-MLC design has the most balanced performance for both write-1 and write-0. The SR-MLC requires shorter write-0 latency than SHR-MLC, but its write-1 latency is much higher especially at smaller transistor size. Therefore, the SHR-MLC provides the best overall write performance among all the four types of cell structures. The hard-bit write energy comparison among the four types of cells is presented in Fig. 7 . For each design, the longer latency requirement of the write-1 and write-0 operations is adopted for the energy calculation. SHR-MLC demonstrates the lowest energy consumption mainly because it has the shortest overall write time.
Moreover, reversing the MTJ connection helps alleviate the read disturbance and therefore improve the read stability, with the MTJ parameters used in this work. For an MLC cell, the read disturbance mainly happens to the soft bit, because the current density received at the soft bit is always twice larger than the current density of the hard bit. When the soft bit is reversed, the read current from BL to SL is along the direction of the soft-bit's write-1 operation. The required current to switch the soft-bit becomes I C,Soft,0→1 = 51.47μA, which is much larger than I C,Soft,1→0 = 34.31μA in the conventional cell structure. It implies that in the design with reversed softbit is more resilient to read disturbance. The read stability can also be affected by the amplitude of read current (or more precisely, the ratio of the read current and the critical switching current I read /I C ), which in turn affects the sense margin. We calibrate the relation of the sense margin and I read /I C for four cell designs. According to cell structures, I C = I C,Sof t,0→1 for SR-MLC and SHR-MLC, while I C = I C,Soft,1→0 for conventional MLC and HR-MLC. The results in Fig. 8 show that at any given I read /I C , the SR-MLC and SHR-MLC designs have higher sense margins. In other words, under the same possibility of read disturbance, these two types of cells can tolerate high read current.
The impact of process variation on MLC cell has been analyzed in detail in [15] . Since the reversed-connected MLC structure uses the same read/write mechanism as the conventional MLC, the same analysis approach can be applied. Basically, larger sense margin and write current margin help improve the tolerance against process variation. As shown in Fig. 8 and Table II , the SHR-MLC cell provides higher sense margin and most balanced write current margin for both write-0 and write-1 under the given device parameter, and therefore, it is most robust against process variation.
Based on the previous analysis, the SHR-MLC cell provides best read/write performance and smallest cell area with the parameters in Table II , and therefore, be adopted in the following discussion at architecture level. It is worthwhile to mention that this conclusion is not general, but determined by given device characteristics, including MTJ resistance, TMR, and critical switching current. In general, whether hard bit should be reversed depends on the difference between critical switching current density of write-1 and write-0. With larger difference, it is more likely that the current from SL to BL is insufficient for write-1 operation to hard bit, and reversing hard bit helps to increase the write-1 current. On the other hand, reversing soft bit is necessary when the MTJ resistance and TMR are high. For such cell, the "1" state of the soft bit shows very high resistance, causing significant VGS degradation in nonreversed connection for current from SL to BL. However, both approaches could possibly sacrifice write-0 performance. Detailed analysis shall be performed based on given MTJ parameters to find a most balanced choice of cell.
IV. APPLICATION-AWARE SPEED ENHANCEMENT SCHEME
A. Observations and Motivation
From system perspective, the major motivation of promoting the use of MLC STT-RAM cache is to increase the capacity and hence reduce the cache miss rate. Though the two-step Miss-rate statistic at different sets of a four-way L2 cache for h264ref.
read/write prolongs cache access latency, it is expected that the reduction in costly main memory accesses can amortize the impact and eventually enhance the overall system performance. However, it is not always the case.
First of all, the large variety of applications behaves differently and demonstrates different data access patterns. Although some of them need occupy a large amount of data and demand big cache capacity, many others constrain data accesses within only a small data set that can fit into a limited cache space. For the latter cases, increasing cache capacity does not have a significant impact on the cache miss rate. Moreover, many applications show a streaming-like data access behavior: data fetched from lower level memory hierarchy will be accessed only once and then evicted. The cache miss rate in these applications is always relatively higher and usually independent of cache capacity.
Second, even within a single application, the usage of different cache sets could be very different. For example, Fig. 9 presents the set-based miss rate of h264ref in a four-way L2 cache. Many sets obtain a close-to-zero miss rate, implying that these locations unlikely benefit from capacity increase. Considering these factors, directly replacing SLC STT-RAM cache with MLC could result in system performance degradation for many applications. This has been observed in our simulations that shall be discussed in Section VI.
By observing the MLC STT-RAM design and read/write operation mechanism in Fig. 2 , we found that a serial MLC can support SLC-like accesses: 1) reading data from a soft bit needs only one sensing stage because the soft bit itself determines if the total resistance falls into the lower half or the higher half range and 2) programming a soft bit requires only a small current I WS which does not affect the corresponding hard bit.
Based on the observations at circuit and architectural levels, in this paper, we proposed an application-ASE scheme for MLC STT-RAM design.
B. Application-Aware Speed Enhancement
Our approach enables two types of access modes for each cache set. In MLC mode, full storage capacity is provided but a read/write operation need two steps to complete. The cache set can also switch to SLC mode, in which only soft bits will be read/written. Thus, an access can be completed quickly (in one step) though half of data storage capacity is sacrificed. According to the cache set accesses, ASE scheme dynamically switches between the two access mode. Fig. 10 illustrates the utilization of the set-based ASE in an eight-way cache memory. Controlled by a mode-predictor, a cache set can stay at the MLC mode supporting eight-way accesses or change to the SLC mode at which only four ways are accessible. We chose to change the number of ways instead of sets, because the latter scheme requires to modify the word-line decoding circuitry which induces larger overheads in hardware and latency. In this example, ways W4-W7 are discarded when switching from MLC to SLC mode, while W0-W3 are always active in both mode. Such a set-based eight-/four-way configuration will be used in the following discussion.
1) Read Performance Improvement: We can further improve the read performance of SLC mode by reseting all the hardbits to "0." Note that an MLC cell can be at "00," "01," "10," or "11" states. Here, the first and second bits represent the data of soft and hard bits, respectively. As shown in Fig. 11 , when detecting the soft bit, the reference voltage shall be set to ref1. The sense margin defined as the difference between the reference voltage and bit-line voltage is SM1 = 25mV. Erasing the hard bit to "0" reduces the possible data states to "00" and "10" only. We can shift the reference voltage to ref2 and improve the sense margin of soft-bit detection to SM2 = 51mV.
In Table III , the sense margin and write performance/energy is compared between standard SLC and MLC operating in SLC mode; 4.5F transistor width is used for both cells. Only the cell-level latency and energy are shown. Please note that although the MLC cell can possibly provide higher write-1 current to achieve better write-1 performance, we have limited the write-1 current in order to prevent disturbing the hard bit, which should stay at 0 after entering the SLC mode.
2) Mode Switching Control: The mode predictor is used to determine whether MLC or SLC mode shall be applied, based on cache access pattern. In implementing the set-based The mode predictor is a saturation counter with a similar structure as [24] . It is incremented by Latency-Reduction when a hit occurs to W0-W3. If a hit falls on W4-W7, indicating that a miss can be avoided in MLC mode, the mode predictor is decremented by Miss-Penalty.
A set changes to the MLC mode when its mode predictor decreases to 0. Or, if the mode predictor reaches to preset threshold M Th , the set switches to the SLC mode. At the moment, data on W4-W7 will be evicted to lower level memory hierarchy, followed by resetting hard bits to "0." Considering the associated high latency cost, frequent mode changing is unaffordable and can be constrained by increasing M Th . However, a very big M Th could cause the mode changing to be lagged so that cache sets cannot adjust to the suitable mode in time. Therefore, M Th as a key design parameter shall be carefully selected for the best performance. More discussion will be presented in Section VI-E.
C. Logic to Physical Mapping Strategies
The effective mapping of the logic data and physical cells is critical in the ASE scheme. It not only determines the performance but also affects the overhead induced by the SLC/MLC mode switching. In this paper, we propose direct mapping and CSM methods.
1) Direct Mapping:
A straight-forward way to utilize an MLC STT-RAM cache is directly mapping every N logic bits to N/2 MLC cells. For instance, as illustrated in Fig. 12(a) , a cache line with 64-byte (512-bit) can be allocated to 256 MLC cells: half of the data bits are stored in the soft bits and the other half are saved in the hard bits. We name this logic data and physical cell mapping method as direct mapping (DM).
For its simplicity, DM has been naturally adopted in MLC STT-RAM cache designs [14] . Since a cache line contains both soft and hard bits, each read/write access needs take two operation steps. Moreover, DM incurs relatively high overhead during mode switching, as illustrated in Fig. 12(a) . When changing from the MLC to SLC mode, data stored in W4-W7 need to be read out and written to the lower level memory before they can be discarded. Then, W0-W3 need to be remapped, which introduces an extra round of read and write. When switching backward from the SLC to MLC mode, such a remapping needs to be performed one more time.
2) Cell Split Mapping: The DM is not able to leverage the fast soft-bit access, which requires only one-step operation. Moreover, mapping a cache line to both soft and hard bits will cause data reorganization whenever a mode switch occurs. To solve these issues, we propose a new cache line mapping method, named as CSM. Fig. 12(b) shows the cache architecture when adopting CSM. Half of the cache lines (W0-W3) are mapped to soft bits, while W4-W7 are mapped to hard bits. Recall that W0-W3 are also mapped to soft bits in the SLC mode, these ways remain unchanged during the SLC/MLC mode switching. Also, W4-W7 can be activated without affecting the data in the corresponding W0-W3, minimizing the cost in SLC to MLC mode changing. When switching from MLC to SLC, the data stored in hard ways shall be evicted into lower level memory hierarchy if they are marked dirty. So a read operation on hard way and a write operation to lower level memory are needed. In general, CSM eliminates the data reorganization during mode switching and therefore greatly improves the efficiency of ASE. In the following discussion, we use "soft ways" to represent W0-W3 which contains only soft bits, and denote W4-W7 as"hard ways."
Note that the CSM induces nonuniform data access latencies, determined by both the operation type and data location. A cache hit on a soft way, no matter it is a read or write operation, can be completed in one step, which is the same as an SLC operation. The accesses to hard ways, however, are more costly and complex. First, reading data from a hard way behaves the same as that in an MLC cache with directly mapping. While, when writing to a hard way, the data in the corresponding soft way shall be protected by following the sequence of reading the soft-way data, programing the hard way, and restoring the soft-way data back. The write access latency of a hard way can be denoted as
where T per is the latency on peripheral circuitry such as the signal routing and address decoding components. T RS is the sensing time to detect soft way. T WH and T WS are the time to program hard way and soft way, respectively. Note that L C S M,W,H is longer than the write latency of an MLC cache with DM which is
Fortunately, the extra read occurs to the same MLC cells as the original write, so T per can be shared. CSM shares some similarities with line paring for parallel MLC STT-RAM [16] , which pairs two cache line in different banks into one group and reorganizes the data. However, due to the complex characteristics of parallel MLC, the line pairing scheme divides a cache line into write-fast-read-slow and read-fast-write-slow forms, which cannot efficiently handle the data blocks requiring high-frequent read and write accesses. It cannot provide a natural support to the SLC mode as what is proposed in this paper either.
V. OPTIMIZATION OF CACHE WITH CELL SPLIT MAPPING
In an MLC STT-RAM cache with CSM, soft ways and hard ways evenly split the capacity. Without any optimization, about half of the cache hits occur on the hard ways and suffer from long access latency. In order to reduce the hits on hard ways and maximize the usage of soft ways, we propose an optimization methodology which includes the intracell swapping mechanism, the data migration method, the shifting replacement policy during a cache miss, and the associated tag array design. Details of the optimization method will be explained in this section.
A. Intracell Swapping
Data migration is very common in caches with nonuniform access latencies. It is usually performed by swapping data between fast and slow regions that are assigned to different physical locations or even implemented with different memory technologies, e.g., between SRAM and STT-RAM [10] , [16] . However, data swapping in between usually introduces large overheads in latency and energy consumption.
In the proposed MLC STT-RAM cache, a soft way and a hard way in the same group of memory cells, e.g., W0 and W4 in Fig. 12(b) , are coupled. The data swapping between coupled ways, namely, intra-cell swapping, is natural and easy. So our design adopts only the intracell swapping to reduce the data migration overhead.
For example, if swapping W0 with a way belonging to other MLCs, say, W5, the latency of such an inter-cell swapping is where the suffix number 0/5 represents the way index. For comparison, the latency to complete an intracell swapping between W0 and W4 is much shorter, such as
where T RH is the hard-bit sensing latency once the soft bit is known. The benefit of constraining the data swapping within the same MLCs is obvious by comparing T inter and T intra .
Executing data swapping when memory is idle can alleviate the impact on system performance but cannot avoid extra energy overhead. Instead, our approach tends to hide the swap operation into normal read/write accesses to hard ways. Fig. 13 shows the timing diagram of data swapping enabled by a hardway access, which can be a write or a read.
1) Write & Swap:
For a data swapping triggered by a hardway write, we can move the data of its corresponding soft way to the hard way and allocate the new data to the soft way. It is not necessary to read the hard way, which will be overwritten by the incoming data. This operation is exactly the same as a normal hard-way write, with a latency summarized in (1). It does not induce extra latency or energy overhead.
2) Read & Swap: Data swapping can also be initiated by a hard-way read. The soft-way is read-out first, followed by the hard-way read. Then, the two data blocks are swapped and written into the hard way and soft way in sequence. Note that the read-out data can be used for further operation without waiting for the completeness of writes. So the swapping will not cause extra delay to this read access. Also, a great amount of energy cost of swapping such as decoding and sensing can be absorbed by the normal read access.
Although the intercell swapping can provide more flexible data migration and enhance the soft-way utilization, the big latency overhead cannot be completely hidden by normal operations. Our evaluation in Section VI shall show that the intracell swapping together with simple data migration policy can allocate more than 90% of cache hits to soft ways. Thus, we did not adopt the intercell swapping between different MLC cells in this paper.
B. Migration Method
Data migration is possible with the support of the swapping mechanism. Our objective is to move frequent-access data blocks to soft ways that require only one step in read and write operations. Here, we propose two methods, namely, counter-based migration (CM) and aggressive migration (AM), to control the data movement between soft and hard ways. 
1) Counter-Based Migration:
A counter Hcnt is assigned to each pair of coupled soft and hard ways to track access frequency. When a hit occurs on a soft way/hard way, Hcnt increases/decreases one. If Hcnt reaches a preset threshold (H Th ), indicating that more accesses hit the hard-way than the soft way, we swap their data and reset Hcnt to 0. This flow is shown in Fig. 14(a) . The overhead of CM mainly comes from the counters.
2) Aggressive Migration: It is a simpler scheme without counters. Considering the fact that modern embedded processors usually utilize write-back L1 cache for energy reduction [25] , a large portion of writes to L2 cache are caused by dirty line eviction from L1 cache. Many of these data could be sent back to L1 cache again. AM exploits this fact and triggers data swapping whenever a write hits on a hard-way, as shown in Fig. 14(b) . It guarantees that the most recently written data always stay on soft-ways. AM will cause more data swaps than CM, because every write-hit on hard-way triggers a swap. However, the swapping itself does not induce any overhead, because it is totally hidden by write operation as previously discussed. Moreover, AM does not require counters or other complex logic so the area overhead is negligible.
C. Shifting Replacement Policy
When a cache miss occurs, an old cache line will be evicted and replaced with new data fetched from lower-level memory. The widely adopted replacement policy like least recently used (LRU) tends to choose the least recently used data as a candidate for replacement. While applying our proposed data migration method, such data are likely to be located on a hard way. This causes potential harm on performance, because the new data usually incurs more frequent accesses and should be placed in a soft-way that offers better access speed. Thus, we propose a shifting replacement policy which is a modified version of LRU, an example of which is illustrated in Fig. 15: if a hard way (e.g., W4 ) is chosen to be evicted when applying LRU replacement policy, instead of putting the new data directly into the hard way, we locate it to the corresponding soft way (e.g., W0) meanwhile shift the data of W0-W4. With the proposed replacement policy, the new data will always be placed in a soft way, which guarantees fast access. The latency of such a shifting replacement remains the same as a hard-way write as described in (1).
D. Tag Array Design Utilizing CSM
Due to the concern on system performance, previous MLC STT-RAM cache designs usually use SLC to implement tag arrays. The major drawbacks of the approach are the large array area and the increased design complexity caused by different types of cell structures. Here, we propose to apply MLC in tag array. Besides the smaller design area that helps reduce the fabrication cost, another major advantage of the MLC-based tag array is having the same structure for both the tag and the data arrays. The compatibility in array design style eventually results in the design cost reduction through sharing read/write peripheral circuitry and easing the layout organization. Similar to data array design, we utilize CSM to reduce the tag search latency. An illustration is shown in Fig. 16 , where the physical location of tag and data blocks present an one-to-one correspondence, i.e., both Tag0 and W0 use soft bits, while Tag7 and W7 are located on hard bits. Accordingly, a two-round tag searching method takes advantage of possible fast accesses of CSM. At the first round, only those tags located on soft-bits (i.e., Tag0-Tag3 in Fig. 16 ) are read out and compared with the target address. If a match is found, the data on the corresponding way can be identified and the tag search is completed. Otherwise, the second round of search will be performed on the hard bits (i.e., Tag4-Tag7) and the remaining ways are searched. During the procedure, the read out data of the first round search shall be kept and will be used in reading the hard bits in the second round. Thanks to the data migration methods that guarantee the majority of hits happen to soft bits, most tag searches will only require one round with a latency equal to that of an SLC tag. Thus, the system performance after applying the new tag design is close to previous SLC tag design, while the area can be greatly reduced.
VI. ARCHITECTURAL LEVEL EVALUATION
A. Experimental Setup
We conducted the performance evaluation by using the cycle-accurate simulator MacSim [27] . Its built-in cache model was modified to implement our architecture level techniques. The baseline architecture setup is a Dual-Core embedded processor with two-level cache hierarchy, which is similar to Intel Atom [25] . The configuration details of CPU core and L1 cache are summarized in Table IV. SPEC CPU2006 benchmarks [28] were adopted in the architecture simulations. For each benchmark, we fast-forwarded 500 million instructions and then executed 1 billion instructions. The processor performance is measured by the instruction per cycle (IPC). In this work, we compared the following STT-RAM L2 cache designs. 1) SLC: SLC STT-RAM cache; 2) Conv-MLC: Conventional MLC STT-RAM cache.
3) ASE: Our proposed ASE MLC STT-RAM cache design, using DM method. 4) ASE+CSM: The ASE cache with CSM. 5) ASE+CSM+CM: The ASE cache with CSM, applying counter-based data migration. 6) ASE+CSM+AM: The ASE cache with CSM, integrating AM. Our proposed ASE MLC STT-RAM cache adopted the SHR-MLC cell structure in Section III that offers 2× data capacity than SLC cache. Both SLC and MLC cell utilized a 4.5F transistor. Further decreasing the transistor size does not reduce the actual cell size, because the layout design rules start dominating the cell area [23] . The data-array of both SLC and MLC caches is composed of subarray with a size of 1024×1024, and the bit-line latency of MLC is 2.696 ps higher than that of SLC because of the small difference in resistance value. The CSM-based MLC tag array was used to all the CSM-related cache designs, otherwise SLC tag array was deployed.
By default, the Hcnt threshold (H Th ) is set to 32 and the threshold of mode predictor (M Th ) is set to 1024. Table V summarizes the configurations of the STT-RAM L2 caches, where the latency and energy parameters were obtained by using NVsim [29] . The MTJ and CMOS technology parameters can refer Table I . Fig. 17 compares the system performance when utilizing SLC, Conv-MLC, and ASE cache designs. The IPC performance was measured on 19 benchmarks and their arithmetic average is denoted as "avg." Compared with SLC, the average IPC of Conv-MLC improves 1.2%, while the effectiveness varies significantly by applications. The performance improvement (e.g., bzip2) mainly comes from the miss-rate reduction, benefiting from the large capacity of MLC cache, as shown in Fig. 18 . For benchmarks that cannot take advantage of the larger cache capacity, the system performance degrades because of the two-step access of Conv-MLC. These benchmarks either demonstrate extremely low cache miss rates (e.g., gamess) or merely reduce misses even cache capacity is enlarged (e.g., lbm).
B. ASE MLC Cache
The ASE cache performs MLC/SLC mode switching dynamically by monitoring the cache miss rate. It has a similar high IPC in bzip2 as Conv-MLC, mainly due to the missrate reduction induced by enlarged capacity in the MLC mode. For the benchmarks with few cache misses, e.g., gamess, it stays at SLC mode that offers fast accesses. On average, the ASE cache improves performance by 3.4% and 2.1% compared with SLC and Conv-MLC, respectively. However, limited by the long access latency of conventional DM in the MLC mode, the performance gain of ASE is not significant. Fig. 19 shows the normalized dynamic energy consumption on both STT-RAM L2 cache and main memory. SLC consumes the least dynamic energy on STT-RAM cache, because both read and write operations can complete within one step. However, it has the highest energy consumption on main memory among all the designs due to high cache miss rate. Conv-MLC increases L2 cache energy 55% because of the complex and long read/write operations. However, the overall energy reduces 3.3% on average, thanks to the doubled cache capacity and therefore reduced main memory accesses. ASE keeps the main memory energy benefits of MLC, and further reduces the energy on cache memory by 6.4% because the low energy cost during the SLC mode.
C. ASE Cache with CSM
Applying CSM to ASE not only accelerates the accesses to half of cache lines but also leverages the extra data capacity. In addition, CSM naturally supports the switching between SLC and MLC modes with minimal overhead. As shown in Fig. 20 , all the benchmarks obtain performance enhancement after adopting CSM. Even without utilizing any data migration scheme, ASE+CSM obtains averagely 3.8% and 2.2% IPC performance improvements over Conv-MLC and ASE, respectively.
The energy consumption on main memory remains almost same when integrating CSM with ASE. This is because the change of mapping method does not affect much on the miss rate. Energy on L2 cache reduces by 5% as shown in Fig. 21 , since accessing soft ways requires less energy than conventional mixed ways containing of both soft bits and hard bits. Also, unlike conventional mapping method, ASE+CSM does not need data remapping when switching between MLC and SLC modes. However, without specific data control, almost half of the accesses in MLC mode go to hard ways. So the energy reduction over conventional DM is not very significant.
D. Data Migration Scheme Comparison 1) Effectiveness of Data Migration:
The proposed data migration schemes attempt to move the cache lines with frequent accesses to soft ways. The effectiveness can be evaluated by using soft-hit faction F S defined as F S = (#hits-on-soft-ways)/(#total-hits). Fig. 22 compares F S of different policies. Without applying any data migration, F S is in the range between 50% and 60% for most benchmarks, with an average of 56%. Simply utilizing the shifting replacement policy (denoted as shift) increases F S to 67%, because it always put the recently fetched data to soft ways. Not surprising that on average, the design adopting the shifting replacement policy and CM (CM+shift) obtains the highest F S of 90.4%, since it counts the occurrence of hits for both read and write accesses and move the frequently accessed lines to soft ways. The AM moves a cache line to soft way only when a write hits hard way. The read hits are ignored so some data swapping opportunity could be missed. The average F S of AM+shift is 84%, which is still significantly higher than the design without any migration.
2) Performance and Energy: After moving most of accesses to soft ways, CM and WA migration policy obtained 4.4% and 3.1% performance improvement over CSM+ASE without data migration scheme, respectively. Compared with SLC or conventional MLC, the overall performance improvement of ASE+CSM+CM is 12.4% and 10.2%, respectively. The cache energy consumption of ASE+CSM+CM is 1.5% higher than ASE+CSM because of the data swapping overheard, but it is still 9.5% lower than a Conv-MLC cache design. ASE+CSM+AM shows slightly less IPC performance, but significant lower cache energy than ASE+CSM+CM. Compared with Conv-MLC, however, ASE+CSM+AM improves 8.8% in IPC and saves 26% of cache energy, because the swapping of AM occurs with hard-way writes only. And "write & swap" does not incur latency and energy overhead (Section V).
E. Sensitivity Study
A sweet spot of M Th in terms of performance exists, as shown in Fig. 23(a) . A small M Th aggressively forces more cache sets to stay at the SLC mode, resulting in high cache miss rate. A large M Th , on the other hand, delays the switching to SLC mode even a cache set shows extreme low miss rate. Based on our exploration, M Th of 1024 is optimal for average performance. When switching from MLC to SLC mode, the energy overhead associated with dirty data eviction and hard-way resetting shall be considered. Fig. 23(b) shows the relation of such energy overhead and M Th . When M Th decreases from 2048 to 512, the energy overhead increases 34%. Fortunately, the energy overhead caused by mode switching accounts for less than 1% of the total energy even decreasing M Th to 512. So it does not affect much on the energy benefits of the proposed ASE MLC STT-RAM cache.
The threshold of Hcnt (H Th ) is used to control the data swapping frequency in CM migration policy. Fig. 24(a) does not delay the ongoing read operation, the extra write induced by data swapping might stall the following cache accesses. If H Th is too small, the probability of such stalls increases quickly and hurts system performance. Moreover, the high occurrence of swapping increases the energy overhead. Fig. 24(b) demonstrates that the dynamic energy on the L2 cache increases significantly as H Th decreases.
F. Further Discussion
In this paper, we proposed ASE which can make the MLC STT-RAM cache operation switching between MLC and SLC modes for better performance. During the investigation, we note that the read speed in the SLC mode could be further improved. In the current design, the hard bit is fixed to 0 in the SLC mode and only the soft bit toggles between logics 0 and 1. As can be seen from the previous analysis, the change in resistance states and therefore the sense margin is limited. A possible solution is to use a stronger current to program both the soft and hard bits together when operating in the SLC mode. As such, we can obtain a full range of resistance change, improving the sense margin as well as reducing the sensing delay. The concerns about the solution include the bigger write energy overhead and STT-RAM cell endurance. A more detailed design analysis is necessary and will be studied.
VII. RELATED WORKS
Following the progress in fabrication process development, utilizing STT-RAM as on-chip storage has emerged as an attractive topic in embedded system and computer architecture communities [3] , [4] . There were many circuit-level studies on process variation tolerance and write speed/energy improvement. For example, a corner-aware dynamic gate voltage scheme [30] was proposed to achieve constant current sensing under process variations. And a dual reference voltage sensing scheme [31] was invented to maintain high read yield under process variations while keeping acceptable read speed and energy. Using low threshold voltage device for select transistor has been investigated to improve the write margin [32] . The high leakage of low threshold voltage devices was reduced by all-digital write driver. Farkhani et al. [33] proposed a writeassist technique which applies a negative voltage to the bitline when programming logic 1 in order to balance the speeds of write-0 and write-1 operations.
One major application of STT-RAM technology is on-chip cache so many architectural level solutions have been investigated. The long read-penalty issue when using STT-RAM as L1 cache was addressed by means of microarchitectural modifications along with code transformation [34] . Li et al. [35] proposed retention-relaxed STT-RAM for L1 cache to improve the performance. The data in retention-relaxed STT-RAM requires refresh, the overhead of which was reduced through rearranging data layout at compile time. SRAM and STT-RAM hybrid cache structure to tradeoff system performance and energy consumption has been widely studied [36] .
Using STT-RAM for cache or register file designs in GPU has become a popular research topic recently. For example, a high-retention and low-retention mixed STT-RAM based last-level cache for GPU was proposed with a dynamic data migration scheme [37] . A hybrid register file design combining SRAM and STT-RAM technologies was proposed to leverage the wrap schedule on GPU with a wrap-aware write-back strategy [38] . Moreover, techniques that increases the parallelism of read/write access as well as reduces the number of repeated write access were investigated for better performance and energy of STT-RAM-based register file [39] .
Since MLC STT-RAM was presented [17] , [20] , it gained a lot of attentions for density improvement. The MLC STT-RAM cache design in [40] utilizes a partially protected scheme to improve the energy efficiency while achieving target reliability requirement. A two-step state transition minimization scheme is proposed in [41] , to improve the lifetime of MLC STT-RAM when it is employed in cache design. A rescheduling scheme was used to minimize the waiting time of issued wraps for MLC-based register bank as presented in [42] . Jiang et al. [16] investigated a line-paring method which divides the parallel MLC design into read-fast-write-slow and write-fast-read-slow regions. Previous studies showed that of the two MLC STT-RAM cell structures, the parallel MLC [20] is more sensitive to process variations and has poor reliability. The series MLC structure [17] demonstrates overwhelming benefits in read and write reliability and great potential in commercial usage [15] .
VIII. CONCLUSION
In this paper, we studied the design challenges in implementing MLC STT-RAM as on-chip caches. Our analysis showed that the conventional design may not continue the density benefit as expected under scaled technology, but potentially degrade system performance. Accordingly, a cross-layer solution was proposed to address these design challenges. At the circuit level, we introduced the reversed MTJ connection to MLC STT-RAM cell design. Through proper device and design tradeoff, two times capacity over SLC is promised. At the architectural level, the application-ASE scheme was proposed, which can adaptively adjust cache configuration to tradeoff capacity and speed. Moreover, the CSM differences the fast region and slow regions in cache architecture, and the according data migration methods allocate the frequently used data to fast regions. Compared with the conventional MLC STT-RAM cache design, the proposed MLC cache design can improve the system performance by 10.2% while reducing dynamic energy consumption on L2 cache by 9.5%.
