DRAM consumes a significant amount of energy in mobile computing devices today. Emerging non-volatile memory such as magnetoresistive memory (MRAM) offers a DRAM alternative and can potentially lead to a more energy-efficient memory system. The MRAM technology is already mature, but considering the memory industry is highly standardized, we are still unable to see any MRAM used in mainstream products. To tackle this problem, we design an LPDDRx-compatible MRAM interface by considering both MRAM pros and cons. Our design solves the pincompatibility and the performance issues caused by MRAM small page size, and it optimizes the interface protocol by leveraging the MRAM unique feature of non-destructive reads. Combining our techniques, we boost the MRAM performance by 14% and provide a DRAM-swappable MRAM solution with 20% less energy.
INTRODUCTION
Battery-backed mobile devices require low energy consumption. The memory subsystem in mobile devices is unfortunately not energy-efficient, e.g. the DRAM in a smartphone today can consume 34.5% of the total energy [8] . It is because DRAM by nature is volatile: DRAM needs periodic refreshes, which can cause a 20% energy waste [17] . Ever worse, the DRAM refresh issue will soon become a system performance bottleneck [19] . Therefore, it is necessary to explore alternative memory technologies.
Magnetoresistive memory (MRAM), or known as spin-transfer torque memory (STT-RAM), is an emerging non-volatile memory (NVM), and it has potentials to provide an energy-efficient memory subsystem [15, 18] . However, enjoying the MRAM energy-saving benefit is not free. It is a consensus that MRAM Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions@acm.org ISLPED '14, August 11-13, 2014 cannot compete with DRAM in terms of performance, and more importantly, MRAM chip internal structure is incompatible to today's memory interfaces. This compatibility is not only highly required for a successful MRAM early adoption, but also critical for enabling a tiered memory system using refresh-free MRAM and high-performance DRAM [10, 20] . Everspin's effort [22] to produce a DDR3-compatible MRAM is another example of this.
The goal of this paper is to introduce an optimized MRAM interface but totally compatible to the state-of-the-art LPDDR3 specification originally designed for DRAM. All the optimizations are done by tweaking the existing timing parameters. We start with a background introduction in Section 2, followed a discussion in Section 3 on two unique MRAM properties: MRAM small page size and MRAM non-destructive reads. These two properties bring us both challenges and opportunities to reach our goal. Then, we detail three optimization techniques in Section 4: ComboAS and DynLat to solve the compatibility and performance issues caused by MRAM small page size; EarlyPA to leverage the nondestructive reads. Combined all these together, we show in Section 5 that the MRAM performance is improved by 14% (up to 36%), and this DRAM-swappable MRAM solution saves 20% energy consumption.
BACKGROUND
We briefly explain DRAM and MRAM technology background first in this section.
MRAM Technology
Compared to DRAM, MRAM is non-volatile and consumes zero standby power [6, 13, 22, 27] . Figure 1 illustrates the basic concept of MRAM. Instead of using electrical charges, MRAM uses magnetic tunnel junctions (MTJs) to store its binary data. Each MTJ consists of two ferromagnetic layers: a pinned layer with a fixed magnetization direction and a free layer with a switchable direction. The relative direction of these two layers determines the stored data. Previous work [6] has shown that the unit cell dimension of MRAM below 30 nm can be smaller than 8 F 2 , which is comparable to DRAM's 6 F 2 size. 
DRAM and LPDDRx Interface
LPDDRx memory interface is dominant in modern mobile devices. JEDEC released the first LPDDR specification in 2009. Today, almost all the mobile SoC use LPDDR2 or LPDDR3 [12] . Figure 2 shows an exemplary LPDDRx configuration, which is a 1-channel, 2-rank memory subsystem with four x32 DRAM chips.
LPDDRx uses a multiplexed command/address (CA) bus to reduce the pin count. The 10-bit CA bus contains command, address, and bank information. Each LPDDRx DRAM internally has 8 banks, and each bank can independent process a different memory request. Same as DDRx, LPDDRx accesses begin with an activation command (ACT), which includes a row access signal, a bank address, and a row address. Memory controllers send ACT commands to DRAMs, and a corresponding DRAM row is then activated (opened). After that, memory controllers issue column read or write commands with a column access signal and the starting column address for burst accesses.
MRAM Device Projection
Everspin announced the world's first MRAM (based on STT-RAM) DDR3 device in 2013 [22] . This product is DDR3-compatible, but its capacity is only 64Mb, far away from modern gigabit-scale DRAMs. Before detailing how to build DRAMcomparable gigabit-scale MRAM, we first project how the future MRAM will look like.
We use our modified CACTI [26] and NVSim [7] to simulate a 4Gb LPDDR3 DRAM and a 4Gb LPDDR3 MRAM on a 28nm process, respectively. Table 1 lists the timing and power parameters. We can verify that the estimated numbers match to the actual LPDDR3 DRAM and MRAM prototypes. The major differences between DRAM and MRAM LPDDR3 devices are:
• Small page size: MRAM page (row buffer) size is 16 folds smaller than DRAM. This results in an unbalanced MRAM row/column address bit ratio -18:6 versus DRAM's 14:10. The details of this constraint is discussed later in Section 3.
• Non-volatility: MRAM is non-volatile and needs no refresh.
Hence, MRAM has zero tREF, tRFC, auto-refresh current (IDD5 1 ) and self-refresh current (IDD6).
• Non-destructive read: MRAM has smaller tRTP and can issue precharge command sooner because MRAM reads are non-destructive and do not need write-back.
• Fast page close: DRAM precharge needs to balance bitlines (BL and BL) to VDD/2, but MRAM precharge can skip this step. Therefore, MRAM precharge (tRP) is faster.
• Slow page open: MRAM MTJ has small on/off resistance ratio (e.g. 200%), and it is hard to sense the data. Therefore, the MRAM row activation (tRCD) is slower. 1 In auto-refresh mode, MRAM peripheral circuitry still consumes power so that IDD5 is essentially IDD2P. 
CHALLENGE AND OPPORTUNITY
The underlying technology difference poses both challenges and opportunities in designing an LPDDRx-compatible MRAM.
Challenge: Small Page Size
Although most MRAM parameters have their DRAM counterparts as we list in Table 1 , a key difference is MRAM's page size.
The fundamental constraint of MRAM small page size is that MRAM reads require current sensing. Current-mode S/A is much more complicated than voltage-mode S/A used in DRAM, and it is also significantly larger. Our circuit simulation shows an MRAM S/A is 16 times larger than a DRAM S/A. To maintain chip area utilization, MRAM has to use less S/A causing a smaller page size (e.g. 16X smaller). MRAM industry also has consensus on this. For example, a 2012 EverSpin patent [2] discloses that their MRAM page is only 512-bit large, 32X smaller than a DRAM page. Unfortunately, such a small difference makes MRAM incompatible with existing memory interfaces, and worse degrades the system performance.
Unbalanced address bits: A 16x smaller page size means that a same-sized MRAM requires 4 more row address bits but 4 less column address bits 2 . However, LPDDRx uses a multiplexed CA bus for both command and address, and it can only carry 20 bits per cycle. For a row activation command, we need 2 bits for command decoding, 3 bits of bank addresses, and only have up to 15 bits for row addresses. An MRAM with 256K rows is obviously unsupported. Even though we could possibly add more CA pins to future memory interface, MRAM's unbalanced row/column address bit ratio still makes the row/column multiplexing idea highly inefficient. Performance degradation: Although a smaller page size is preferred to avoid over-activation in a memory system that uses close-page policy [28] , modern mobile devices heavily adopt openpage policy, and smaller page size means high page miss rate and low performance. Figure 3 shows the page hit rate and the performance impact due to MRAM's 16x smaller page size (see Section 5 for simulation methodology details). On average, the page hit ratio is decreased by 66% and the resulting performance degradation is around 10% 3 .
Opportunity: Non-destructive Read
DRAM reads are destructive and need data restoration, thus the DRAM S/A in an active bank always connect to bitlines and serves as a row buffer. On the contrary, MRAM reads are non-destructive. In another word, we can treat each row buffer as a copy of the original MRAM row data. This extra redundancy allows us to utilize the MRAM row buffer in a more aggressive way, and this is a unique opportunity for MRAM performance optimization.
SOLUTIONS
To overcome the drawback and leverage the advantage of MRAM, we gradually propose three optimization techniques as our solution.
ComboAS: Balance Row/Column Address
As explained in Section 3.1, the LPDDRx interface only carries 15 row address bits, but our targeted MRAM has a highly skewed row/column bit ratio (e.g 18:6). Adding two more CA pins can temporarily solve this problem, but it requires an industry-wide PHY redesign. Worse, it implies that such MRAM is not DRAMswappable and prohibits any mixture uses of DRAM and MRAM.
Therefore, the first technique we propose is Combinational Row/Column Address Strobe (ComboAS), and its goal is to rebalance the address bits carried by RAS (row access strobe, e.g. ACT command) and CAS (column access strobe, e.g. READ and WRITE commands). The basic concept is straightforward: offloading the overflowed row address from RAS to CAS.
Since we split the row address into RAS and CAS, ComboAS needs both commands before activating a new row. Consequently, instead of waiting for tRCD, we should issue a CAS command immediately after every RAS command. Figure 4 timing diagram of ComboAS against the incompatible solution of adding 2 more CA pins. In ComboAS, the actual row activation is delayed by 1 cycle to wait the remaining row address bits from CAS; read or write accesses are delayed by tRCD0 (the original row activation delay). Table 2 lists the detailed adjustments.
compares the
To implement ComboAS, the modifications include: MRAM device: We need three minor changes.
(1) a new register to hold the partial row address bits carried by ACT and then later combined with the remaining row bits from READ or WRITE commands. (2) a signal generator to latch the remaining row address from the first-arrived CAS command. (3) a small register sets to temporarily hold column addresses. This is because the latency to activate a row (tRCD0) is larger than the minimum delay between two column commands (tCCD0), and multiple CAS commands might arrive during a new row activation. In this work, the size of this register set is 3 (i.e. ⌈tRCD0/tCCD0⌉).
Memory Controller: Only a small latch (4-bit in this work) after the PHY interface is needed to temporarily hold the extra row address bits from RAS command and later deposit them into the next CAS command. Second or latter CAS commands do not need to carry any row address bits. In addition, the memory timing parameters are adjusted according to Table 2 . Figure 4 shows that ideally ComboAS only causes 1-cycle delay. However, that is not always true. ComboAS unconditionally adds tRCD0 on top of every tRL, tWL, and tRTP to avoid internal bus conflicts, which is unnecessary for non-back-to-back accesses. For those non-ideal cases, we define a metric, bubble, to indicate the difference between the actual interval and the minimum interval 4 . ComboAS can be further improved if bubbles exist.
DynLat: Remove Unnecessary Latencies
Taking a deep look into this issue, while the conventional tRL0, tWL0, and tRTP0 are all static values and determined by the memory hardware limitation, the new tRL, tWL, and tRTP parameters in ComboAS become variable as they include the row activation latency (tRCD0) which we only need to pay once for one opened row. We can deduct this tRCD0 overhead from tRL/tWL/tRTP if we find bubbles on the command bus. This observation leads us to a dynamic timing parameter settings where tRL, tWL, and tRTP are adjustable on-the-fly. We call this technique Dynamic Latency (DynLat). To demonstrate the idea and the benefit of DynLat, we use Figure 5 as an example. The differences between the timing diagrams without and with DynLat are:
• Accesses R1 and R2 are back-to-back. To avoid the memory chip internal hardware conflict, R2 in both ComboAS and DynLat have the original tRL setting (tRL0+tRCD0).
• Accesses R2 and R3 are not back-to-back. In ComboAS, the tRL of R3 remains tRL0+tRCD0, which causes bubbles on both the internal command bus and the data bus (the bubble between DATA2 and DATA3).
In DynLat, the bubble on the data bus is eliminated by setting the tRL for R3 to be max(tRL − bubbleLength, tRL0). By forcing tRL larger than tRL0, we ensure the command meets the memory chip internal hardware constraint; by subtracting bubbleLength, we guarantee the bubble is removed. We track accumulated bubble length (ABL) of each memory rank. We reset ABL to 0 upon every ACT command and accumulate the bubble length upon every READ or WRITE command according to Equation 1. ABL ′ = ABL + (curCycle − lastCmdCycle) − minReqDelay (1) ABL keeps increasing during a page open cycle. In practice, we can limit the ABL value less than tRCD0 to reduce counter overhead. Based on ABL, we then calculate the new tRL, tWL, and tRTP, tRL = max(tRL0 + tRCD0 − ABL, tRL0)
where tRL0, tWL0, tRTP0, and tRCD0 are the original timing parameters defined by the memory device. To implement DynLat, the hardware changes we need are: MRAM device: Since DynLat introduces variable read and write latencies, the memory device shall track the latest tRL and tWL, so that it can return the data for read or latch the data for write at the correct cycle. For this purpose, we add a new component called TimeCtrl to each memory device as shown in Figure 6 . TimeCtrl tracks the ABL value and updates the timing parameters to the device internal signal delaying circuitry according to Equation 2 . If a memory rank contains multiple memory devices, their TimeCtrl logics behave in a lockstep mode.
Memory Controller: The same TimeCtrl logic is duplicated in the memory controller so that the optimized command intervals can be correctly generated from the controller. The number of duplications is the same as the memory subsystem rank count. For example, in Figure 6 , we duplicate two TimeCtrl in the memory controller for a 2-rank configuration.
EarlyPA: Leverage Non-destructive Read
DynLat can remove the unnecessary latency and alleviate the performance drop brought by ComboAS, but it cannot mitigate the performance drop caused by reduced page hit ratio (as shown in Figure 3 ), which is another side effect of the MRAM small page size. As discussed in Section 3.2, the non-destructive MRAM reads give us an opportunity to improve the performance. Thus, we devise our third optimization technique: Early Precharge/Activation (EarlyPA), in which we decouple sense amplifiers and row buffers so that bitlines can be precharged right after data sensing and the next ACT command can be issued earlier.
To implement EarlyPA, we first decouple the "data latching" out of a normal MRAM S/A by extracting the last stage amplifier and evolving it into a full SRAM cell. The decoupled row buffer allows bitlines to be early-precharged during the buffer column accesses. An example is illustrated in Figure 7 :
• Time slot 1: Upon the first ACT, the S/A starts sensing, and a self-precharge counter starts counting down from tRCD0.
• Time slot 2: The counter triggers a bitline self-precharge (an internal PRE command 5 ) after tRCD0. The S/A finishes data sensing, and the row buffer holds a copy of the data.
• Time slot 3: When the second ACT arrives, bitlines and S/As are ready for another row activation (row1). At the same time, all the column read accesses to row0 keep proceeding from the row buffer to I/Os.
EarlyPA improves the read performance by issuing PRE and ACT for the next row in advance. On the other hand, EarlyPA handles write accesses as follows: if it comes before the self-precharge is internally issued, we postpone the self-precharge by updating the self-precharge counter to leverage the unfinished row activation cycle; if it comes after the self-precharge, besides resetting the selfprecharge counter, we need to turn on the corresponding wordline again for writing the data, which brings small latency overhead (i.e. 3 cycles in this work).
The EarlyPA implementation can be transparent to the memory controller and only requires some timing parameter manipulations. The controlling policy for column access commands (READ/WRITE) remains the same. As shown in Figure 7 , we modify two precharge-related timing parameters:
• tRAS (activation-to-precharge) of EarlyPA is tRCD0+tRP0.
• tRP (precharing time) value is set as 1 so that the next ACT command can be issued immediately when the selfprecharge is finished. To implement EarlyPA, hardware modifications are: MRAM device: Devices ignore all the PRE commands from the memory controller as EarlyPA automatically precharges the bitlines in advance. Instead, a self-precharge counter is added to each memory device control logic. The counter is set to Figure 8: Normalized IPC of main memory system with each technique: ComboAS as the baseline, Unlimited-pin, DynLat, EarlyPA, and DRAM systems.
tRCD0 after every ACT command and reset to tWL+BL/2+tWR or 3+tWL+BL/2+tWR after every WRITE command depending on whether the counter reaches zero or not at the WRITE command arrival. Furthermore, the memory device skips precharge-related timing rule (e.g. tRTP checking) except the tRAS checking as S/As and row buffers are decoupled in the EarlyPA mode.
Memory controller: Symmetrically, the memory controller manipulates the timing parameters in the same way as memory devices do. An additional modification to the memory controller is the write-to-precharge latency control: after issuing a WRITE command, the minimum required delay for the next PRE command is tWL+BL/2+tWR+tRP0 instead of tWL+BL/2+tWR.
EXPERIMENTS
To quantify the performance and energy improvement achieved by our techniques, we detail our simulation methodology and experiment results in this section.
Simulation Methodology
We model a 2GHz out-of-order ARMv7 microprocessor using our modified version of gem5 [3] . DRAMSim2 [23] is integrated and modified to model the main memory system. Open-page policy with FR-FCFS [21] scheduling is accurately modeled.
We use the parameters in Table 1 to simulate DRAM and MRAM and use open-page policy with row-interleaving to maximize the memory-level parallelism. More details for the simulation setting are provided in Table 3 . The memory-intensive benchmarks are selected from SPEC 2006 [24] , EEMBC 2.0 [9] , and HPEC [11] . We fast-forward each simulation to the pre-defined breakpoint at the code region of interest, warm-up 10 million instructions, and simulate for at least 1 billion instructions. Figure 8 shows the performance speedup of the MRAM system with each proposed technique. The DRAM system performance is also provided for comparison. We use ComboAS as the baseline which has the worst performance. The second bar is the performance of an impractical implementation where 2 more pins are added (referred to as Unlimited-pin in the chart). Compared to Unlimited-pin, Figure 8 shows that the performance of ComboAS is degraded by 5% on average. However, after adopting DynLat, the sytem performance is bounced back by 3% on average (up to 14%) and is comparable to the Unlimited-pin in most cases. In addition, by leveraging the MRAM non-destructive read with EarlyPA, we improve the performance further. The total performance speedup is 14% (up to 36%). Generally, memoryintensive workloads benefit more from our proposed techniques because more efficient memory accesses provide larger system performance improvement.
Performance speedup
After adopting all the proposed techniques, the overall MRAM performance is competitive to the DRAM counterpart (about 98%).
Energy consumption analysis
While the performance of our optimized MRAM system is similar to the conventional DRAM system, the real deal breaker is the energy consumption saving. Figure 9 shows the comparison of energy consumption between DRAM and MRAM systems, in which each value is divided to refresh energy, burst energy, activation/precharge energy and background peripheral circuit energy 6 . The energy overhead of each proposed technique is also included.
Compared to DRAM, MRAM-based system consumes zero refresh energy because of its non-volatility, and this is the major source of the MRAM energy saving. However, as shown in Table 1 , the read/write energy of MRAM is larger than DRAM because MRAM has smaller sense margin and the memory cell is difficult to write. Thus, the MRAM burst energy is usually larger than the DRAM one. The energy consumed by peripheral circuits is similar between DRAM and MRAM because we do not apply any circuit optimization in this work. But we should note that the peripheral energy of MRAM can be further reduced if MRAM is allowed to go into power-collapse mode frequently during the idle state. Figure 9 shows the ComboAS MRAM system reduces the total energy consumption by 17% on average compared to DRAM system. After adopting DynLat and EarlyPA techniques, the energy consumption of MRAM system can be further reduced by 4% on average since the performance is increased and the total execution time is reduced. Considering the comparable performance and smaller energy consumption, MRAM is an attactive candidate to build the main memory system.
RELATED WORK
Many previous works are focused on increasing DRAM energy efficiency by re-designing DRAM organization architecture [1, 28, 29] . Others are focused on improving DRAM performance [14, 25] . The posted-RAS scheme [28] is the most similar work to our ComboAS technique as we both issue RAS and CAS commands back-to-back. However, posted-RAS only works for close-page policy where there is only one CAS command after opening one row, and it does not optimize the timing parameters to mitigate the performance overhead.
Other work aimed at NVM optimizations. PCM is studied as a main memory candidate and some techniques are proposed to reduce its energy overhead and improve its performance [4, 5, 30] . Lee et al. [16] proposed the technique to separate sense amplifiers and row buffers in PCM-based main memory, but unlike EarlyPA, their technique did not issue precharge commands in advance to hide the latency. Meza et al. [18] and Emre et al. [15] also evaluated MRAM as an main memory alternative, but they did not discuss the issue of how to build a compatible interface. Everspin [22] demonstrates a DDR3-compatible MRAM, but their MRAM capacity is so small (i.e. 64Mb) that the pin-compatibility problem is naturally hidden. Also, Everspin does not optimize for MRAM performance.
CONCLUSION
The shift from PCs to mobile devices is requesting low-power memory solutions, and non-volatile memories such as MRAM are promising candidates. Compared to DRAM, MRAM has many unique features such as small page size and non-destructive read. The smaller page size brings challenges in designing commodity MRAM that can be deployed on the same LPDDR interface for DRAM memory, and can cause performance degradation to mobile systems where the page hit ratio is important. In this work, we propose three techniques: ComboAS and DynLat to solve the DRAM-compatibility issue; EarlyPA to further improve the performance. Combined together, our solution enables a commodity MRAM on LPDDR3 interface with a much optimized performance (14% on average and up to 36%). It makes LPDDR3 MRAM have competitive performance but save 20% energy compared to LPDDR3 DRAM does. The proposed architecture is a step forward to the future energy-efficient memory design.
Acknowledgments
This research was funded by NSF grants 1218867, 1313052, 1409798, and Department of Energy under Award Number DE-SC0005026.
