Abstract-This paper describes a write-once-memory-code phase change memory (WOM-code PCM) architecture for nextgeneration non-volatile memory applications. Specifically, we address the long latency of the write operation in PCM-attributed to PCM SET-by proposing a novel PCM memory architecture that integrates the h2 2 i 2 =3 WOM-code at the memory organization and memory controller levels. To further improve the write latency of WOM-code PCM, we propose a PCM-refresh approach that uses idle cycles to preemptively set PCM rows to the initial WOM-code state. Finally, to balance write latency improvements against WOM-code PCM overhead, we propose a WOM-code cached PCM (WCPCM) architecture that uses WOM-code PCM as the cache alongside conventional PCM main memory. Since WOM-code techniques inherently impact PCM endurance by increasing the number of bitwrites in comparison to unencoded PCM, we incorporate additional transitions from the h2 2 i 2 =3 WOM-code transition graph to realize endurance-WOM-code (e-WOM-code) architectures. Transitions between the e-WOM-code states on writes to memory are integrated into an incremental coding for endurance (ICE) approach that exploits redundancies in the conventional WOM-code to reduce the number of bit-writes over unencoded PCM. Simulation results show that the proposed e-WOM-code PCM architecture is able to reduce memory write (read) latency by 19.8 percent (14.7 percent) and the number of bit-writes over unencoded PCM without (with) datacomparison write (DCW), a read-modify-write process that only updates changed cells, by 83.0 percent (22.1 percent) on average across general-purpose (SPEC CPU2006), embedded (MiBench), and high-performance (SPLASH-2) benchmarks. Further, e-WOMcode PCM with PCM-refresh can reduce memory write (read) latency by 51.5 percent (44.1 percent) and the number of bit-writes over unencoded PCM without DCW by 76.5 percent on average across the benchmarks; there is, however, an increase of 19 percent in the number of bit-writes over unencoded PCM with DCW. Finally, for just 4.7 percent memory overhead, the e-WOM-code cached PCM (e-WCPCM) architecture reduces memory write (read) latency by 47.5 percent (41.6 percent) and the number of bit-writes over unencoded PCM without DCW by 68.1 percent on average across the benchmarks; again, there is a 49 percent increase in the number of bit-writes over unencoded PCM with DCW.
endurance-WOM-code (e-WOM-code) architectures. Transitions between the e-WOM-code states on writes to memory are integrated into an incremental coding for endurance (ICE) approach that exploits redundancies in the conventional WOM-code to reduce the number of bit-writes over unencoded PCM. Simulation results show that the proposed e-WOM-code PCM architecture is able to reduce memory write (read) latency by 19.8 percent (14.7 percent) and the number of bit-writes over unencoded PCM without (with) datacomparison write (DCW), a read-modify-write process that only updates changed cells, by 83.0 percent (22.1 percent) on average across general-purpose (SPEC CPU2006), embedded (MiBench), and high-performance (SPLASH-2) benchmarks. Further, e-WOMcode PCM with PCM-refresh can reduce memory write (read) latency by 51.5 percent (44.1 percent) and the number of bit-writes over unencoded PCM without DCW by 76.5 percent on average across the benchmarks; there is, however, an increase of 19 percent in the number of bit-writes over unencoded PCM with DCW. Finally, for just 4.7 percent memory overhead, the e-WOM-code cached PCM (e-WCPCM) architecture reduces memory write (read) latency by 47.5 percent (41.6 percent) and the number of bit-writes over unencoded PCM without DCW by 68.1 percent on average across the benchmarks; again, there is a 49 percent increase in the number of bit-writes over unencoded PCM with DCW.
Index Terms-Emerging technologies, general computer systems organization, primary memory, design styles, memory structures, hardware
Ç

INTRODUCTION
N EXT generation memory systems must be capable of supporting fast read and write accesses, while also offering the memory capacity necessary to maintain large data structures [1] . For example, exascale computing services potentially require 1,000Â the memory capacity of current petaflop class systems [2] . DRAM has played a major role in supporting the demands on memory capacity and performance for decades. However, scaling DRAM below 22 nm is currently unknown [3] , which makes DRAM less suitable for next generation main memory in the big data era.
Of the next-generation non-volatile memory candidates, phase-change memory (PCM), which offers better read latency than resistive RAM (ReRAM) and higher cell density than spin-transfer-torque RAM (STT-RAM) [4] , is a promising candidate to fill this scalability gap. Unfortunately, PCM is an asymmetrical read-write technology with a write latency that is much longer than that of DRAM. The write process in PCM requires strict duration and strength of the programming current. Writing "0" to a PCM cell, i.e., the RESET operation, uses a short but high current pulse to program the phase change material to the amorphous state. However, writing "1", i.e., the SET operation, utilizes a 5-10Â longer but lower current pulse to program the phase change material to the polycrystalline state. When a page of data is written to PCM, the write latency is determined by the long SET operation, which is usually 5-10Â the read latency [1] , [5] , [6] . In general-purpose applications, an extensive study has already shown that the long write latency in PCM may result in as much as 61 percent performance degradation [7] . Thus, the long write latency inherent to PCM remains the biggest challenge that has to be overcome in order to realize scalable PCM memory for next generation memory architectures. Scaling the PCM cell [8] and new phase-change material and cell designs [9] , [10] have been proposed to decrease PCM write latency. Given current technology and existing phase change materials, most write-latency-aware PCM studies have focused on write scheduling [7] , [11] , [12] , [13] and architecture-based write improvement [14] , [15] , [16] , [17] solutions. Write scheduling solutions primarily attempt to distribute writes among idle bank cycles [7] , [11] , [12] , [13] ; however, this is not suitable for high-performance computing where there are little-to-no idle cycles between memory accesses. On the other hand, architecture-based write improvement solutions rely on technology-specific features such as multi-level cells (MLCs) and PCM division write mode to reduce write latency [14] , [15] .
As a result, neither approach is general enough to address the limitation of long writes in PCM. Latency-aware coding schemes were proposed to improve PCM write latency [17] , [18] . However, these approaches need to SET a minimum number of PCM bits in each write operation, limiting improvements in write latency. Note that other writerelated issues such as power consumption, cell endurance, and error detection/correction have been addressed through design [19] , [20] , [21] , [22] , [23] and coding [24] , [25] , [26] , [27] approaches in literature. This paper describes a novel write-once-memory-code phase change memory (WOM-code PCM) architecture that integrates WOM-codes at the memory organization and memory controller levels to reduce the write latency of PCM. It extends the original contributions in [28] by explicitly addressing PCM endurance. The rest of this paragraph summarizes the contributions in [28] ; the extensions to endurance are summarized in the following paragraph. First, it addresses the long latency of the write operation-attributed to PCM SET-by using an "inverted" WOM-code that transforms PCM writes to comprise of low latency RESET operations. Two memory organization methods are described to provision the extra memory space for the WOM-code encoded data: the wide-column method and the hidden-page method. In practice, the wide-column approach is suitable for fixed WOM-code PCM architectures that provide minimum memory controller support and demand better performance; the hidden-page approach is suitable for PCM memory with dynamic WOM-code capabilities, with the memory controller choosing between codes for flexible memory utilization and dynamic performance tradeoffs. Second, it also describes a PCM-refresh approach to further improve the performance of WOM-code PCM. It is motivated by the observation that the write occuring after the WOM-code reaches its rewrite limit is the bottleneck that gates the performance of WOM-code PCM. PCM-refresh periodically and opportunistically "refreshes" a PCM page that has reached the rewrite limit in idle cycles, improving the performance of WOM-code PCM. Finally, a WOM-code cached PCM (WCPCM) architecture that uses a small amount of WOM-code PCM array as the cache (WOM-cache) on top of the large conventional PCM main memory array is proposed. The WOM-cache increases memory access speed by exploiting locality in the memory access stream, requiring less memory capacity in comparison to implementing WOM-code across the entire PCM array. To the best of our knowledge, this is the first paper proposing a memory architecture that integrates WOM-codes with PCM to successfully mitigate the long write latency issue in PCM (attributed to the time-consuming SET operation).
However, the WOM-code PCM techniques described in [28] inherently impact PCM endurance by increasing the number of bit-writes in comparison to unencoded PCM. In order to address this shortcoming, we use additional transitions in the conventional h2 2 i 2 =3 WOM-code graph to explicitly reduce the number of bit-writes to the memory cells (termed the endurance-WOM-code, i.e., e-WOM-code in this paper). Transitions between e-WOM-code states are integrated into an incremental coding for endurance (ICE) approach that exploits redundancies in the conventional WOM-code to reduce the number of bit-writes over unencoded PCM. First, e-WOM-code integrates additional state transitions between the WOM-code states to explicitly delay the long latency PCM SET operation. Second, ICE transforms incoming data on the memory bus to incrementally encode only those bits that have changed in comparison to previous data. Third, residual redundancies in the e-WOM-code-encoded data are used to ensure that only differential writes, i.e., changed bits are propagated as updates to the PCM array. The reduction in the number of bit-writes, coupled with the reduction in the frequency of PCM SET operations, provides significant improvement in the endurance and memory access latency of the PCM array. Finally, a round-robin cache migration (RRCM) scheme for e-WCPCM that integrates e-WOM-code PCM and PCMrefresh with WCPCM to effectively improve the latency while mitigating the impact on endurance is also described. We extended the DRAMSim2 [29] memory simulator to evaluate the WOM-code PCM architecture across a broad set of general-purpose (SPEC CPU2006 [30] ), embedded (MiBench [31] ), and high-performance (SPLASH-2 [32] ) computing benchmarks. Results show that the proposed e-WOM-code PCM architecture is able to reduce memory write (read) latency by 19.8 percent (14.7 percent) and the number of bit-writes over unencoded PCM without (with) data-comparison write (DCW) [33] , a read-modify-write process that only updates changed cells, by 83.0 percent (22.1 percent) on average across the benchmarks. Further, e-WOM-code PCM with PCM-refresh can reduce memory write (read) latency by 51.5 percent (44.1 percent) and the number of bit-writes over unencoded PCM without DCW by 76.5 percent on average across the benchmarks; however, there is an increase of 19 percent in the number of bit-writes over unencoded PCM with DCW. Furthermore, for just 4.7 percent memory overhead, e-WCPCM reduces memory write (read) latency by 47.5 percent (41.6 percent), and the number of bit-writes over unencoded PCM without DCW by 68.1 percent on average across the benchmarks; again, there is a 49 percent increase in the number of bit-writes over unencoded PCM with DCW. Lifetime evaluation of the e-WOMcode PCM device show that the lifetime can be extended by about 13.5Â (1.4Â) in comparison to unencoded PCM without (with) DCW. Finally, we evaluate the read and write energy for two of the proposed PCM architectures and provide comparisons to DRAM using CACTI [34] and NVSim [35] .
BACKGROUND AND MOTIVATION
Phase Change Memory
PCM, an emerging non-volatile memory, is a one-transistorone-resistor (1T1R) or one-diode-one-resistor (1D1R) memory device technology. PCM stores data by switching chalcogenide material, such as Ge 2 Sb 2 Te 5 (GST), between amorphous and polycrystalline states. These two states are characterized by different resistance levels; amorphous chalcogenide material has high resistance, usually in the MV range, whereas polycrystalline chalcogenide material has low resistance, usually in the KV range [17] .
There are three primary operations integral to the use of PCM in a modern memory system: read, SET, and RESET. The read operation loads the data from the memory to the processor or the cache hierarchy. The SET (RESET) operation writes the bit "1" ("0") to the memory cell, i.e., the SET (RESET) operation changes the state of the chalcogenide material in the cell to polycrystalline (amorphous) state. Each of the three operations has its own associated latency and this is discussed in the following paragraphs.
A PCM cell can be read by either applying a small voltage across the cell and sensing the current flow, or by injecting a small current through the cell and sensing the voltage across the cell. Due to the large difference between the two resistance levels of the chalcogenide material, the sensed current/voltage of these two states differ by three or more orders of magnitude. The latency of the read operation in PCM cells is typically tens of nanoseconds [4] .
In the write operation, the programming circuit applies different heat-time profiles to switch cells from one state to the other. To RESET a PCM cell, a strong programming current pulse of short duration is required. The temperature of the chalcogenide material is raised by this programming pulse. After the chalcogenide material reaches its melting point, the programming pulse is quickly terminated. Subsequently, the small region of melted material cools quickly, leaving the chalcogenide material programmed in the amorphous state. Since the region of the melted chalcogenide material is small, the required duration of the RESET programming pulse is short, about tens of nanoseconds. Thus, the RESET latency of PCM is typically comparable to its read latency [4] , [11] .
In contrast, to SET a PCM cell, a long programming current pulse that is weaker than the RESET programming current is applied to program it from the amorphous state to the polycrystalline state. In the SET operation, the temperature of the chalcogenide material should be raised above its crystallization temperature but below the melting point for a sufficient amount of time. As the crystallization rate is a function of temperature and since there is variability between PCM cells in an array, reliable crystallization requires a programming pulse of hundreds of nanoseconds [11] . Therefore, the SET latency of a PCM cell is longer than both the RESET latency and the read latency, usually 5-10Â the read latency [5] , [6] .
From a system perspective, in a typical PCM write, hundreds of bits in a memory line need to be programmed. It is highly likely that both RESET and SET operations occur in such a write. Thus, the write latency is determined by the slower of the two operations, which is the SET operation. Therefore, the write latency in PCM memory is longer than the read latency.
Write-Once-Memory
Memory technologies whose storage cells transit irreversibly between states, e.g., punch cards and optical discs, have been common since the beginning of the data storage technology. Most of these technologies can change the state "0" to state "1", but not vice versa. The "write-once" nature largely limits the application space of such memory technologies. To overcome the write-once nature and explore the true capabilities of such write-once memories (WOMs), Rivest and Shamir proposed a model to efficiently reuse WOMs [36] . They showed that for a fixed k, it is possible to allow t writes of k-bit data using only t þ OðtÞ bits of WOM memory.
In the WOM model, a WOM is an array of "write-once bits", i.e., wits, which are manufactured or predefined in a "0" state. Each wit can be transformed into the "1" state independently but irreversibly. A WOM-code is a coding scheme that is able to write the WOM multiple times. A WOM-code can be defined as follows: a "hvi t /n WOM-code" is a coding scheme that uses n wits to represent one of v values so that the WOM can be written a total of t times. We use a simple h2 2 i 2 /3 WOM-code as a typical example to explain the basic idea of rewriting wits using Table 1. In the first write (WOM 1 ), any 2-bit value x is represented with the pattern rðxÞ given in Table 1 . Then, when another 2-bit value y (x 6 ¼ y) is written into the same 3-wit WOM, it is represented with pattern r 0 ðyÞ (WOM 2 ). Observe that for any 2-bit value, the representation in the second write only changes wits from state "0" to state "1". For decoding this WOM-code, the pattern rðxÞ ¼ ''abc'' written in WOM can be decoded to 2-bit data
With this WOM-code, a set of 3-wit WOM can write 2-bit data twice.
Most recently, flash memories based on floating-gate cells have become a very promising storage technology, due to their high data density, fast read speed, and physical robustness [37] . However, the programming mechanism in flash memories causes problems similar to that of WOMs. The SET operation in flash is relatively simple and fast, realized by injecting charge into the flash cell. But removing the charge, i.e., the RESET operation, is difficult and requires the removal of the entire charge from a large block of cells, i.e., block erasure, which is very time consuming. This programming mechanism makes it easier to change a bit from "0" to "1" (versus a change from "1" to "0") in flash memories. Other WOM-code schemes, such as rank-modulation (primarily proposed for mitigating charge-leakage) for flash-memories, have shown smaller write times due to the use of a single "push-to-top" operation instead of precise incremental writes [38] , [39] . With WOM-codes, the block of bits in flash can be rewritten multiple times without a RESET operation.
The asymmetric write latency problem in PCM is analogous to that of flash. Even though the PCM cell is able to change from "0" to "1", the performance cost of doing so is 5-10Â higher than the cost of changing from "1" to "0". Using an inverted WOM-code in the PCM cell so that the rewrite to the PCM cell consists of only low latency RESET operations can reduce the write latency in PCM. Since various WOM coding schemes for flash and WOMs have been studied intensively in the literature, this paper focuses exclusively on the implementation challenges of the h2 2 i 2 /3 WOM-code in PCM at the architecture level; the integration of other existing WOM-codes into an equivalent PCM framework remains an area for future research. Note that the only work applying WOM-codes to PCM focuses on energy reduction [40] and is not directly related to the work described in this paper.
PCM versus DRAM for Main Memory
DRAM has dominated the main memory technology for the past few decades. However, as process technology moves to lower geometries, DRAM scaling faces challenges. DRAM stores data in a capacitor that is accessed by a transistor. As the size of this capacitor decreases with scaling, charge storage and sensing become increasingly unreliable. DRAM scaling obstacles below 22 nm has driven researchers to find alternate memory technologies to replace DRAM. In recent years, PCM has emerged as a suitable replacement candidate for DRAM [4] . Whereas PCM offers scaling and density advantages, it suffers from higher power/energy requirements and higher access latencies. PCM also suffers from asymmetric write times (writing a '1' takes more time compared to writing a '0') in comparison to DRAM. PCM prototypes in 20 nm technology have been successfully demonstrated [6] , [7] and its scaling potential to 5 nm [8] , [9] has also been demonstrated. Along with this, PCM cell size is about 70 percent of that a DRAM cell [6] which translates to improved memory capacity. In Section 6, we evaluate and compare the the read/write latency and the power/energy requirements of all the PCM architectures proposed in this paper with DRAM.
WOM-CODE PCM AND PCM-REFRESH
In this section, we first discuss the implementation of WOM-code PCM and its benefits. We then describe PCMrefresh, which achieves faster write performance in WOMcode PCM.
WOM-Code in PCM
In the conventional WOM-code, wits are all preset to the "0" state. In each write, wits can only be programmed from "0" to "1", i.e., the SET operation. However, in PCM, the SET operation has longer latency than the RESET operation. Thus, the basic principle in WOM-code PCM is that wits in PCM can only be programmed from "1" to "0". There are two approaches to apply the conventional WOM-code to PCM: the first is to connect an inverter to each bitline of the sense amplifier/write driver in the PCM array, as shown in Fig. 1a ; the other is to use an inverted WOM-code. An example of the inverted WOM-code is shown in Fig. 1b . In each of the rewrites in this WOM-code, wits either stay unchanged or are programmed from state "1" to state "0". Both methods are compatible with any existing WOM-code. Since the inverted WOM-code can be generated off-line in advance, the runtime complexity of these methods are identical. However, the inverted WOM-code method is preferable in practice, since it does not require any additional inverters within the row buffer.
Compared to the original data pattern, WOM-code utilizes extra bits to support multiple writes. For example, to encode the data of a page (e.g., 4 KB) with the h2 2 i 2 /3 WOM-code, the memory controller should be able to manage a 0.5Â increase in page size (i.e., to 6 KB) for every rewrite. There are two approaches to integrate WOM-code in the memory architecture-wide-column WOM-code PCM architecture and the hidden-page WOM-code PCM architecture. We propose the wide-column WOM-code PCM architecture for PCM in this paper. The hidden-page WOM-code PCM architecture is left as an area for future research.
Wide-column WOM-code PCM architecture (Fig. 2 ). In the conventional memory organization, memory cells are organized in a memory array with X rows, Y columns per row, and Z bits per column. In a memory architecture with data width D (e.g., D ¼ 64), D=Z devices are organized together for memory access in a rank. In a memory access, the memory address is decoded as the row address and the column address. The row select signal is enabled based on the row address, resulting in the connection of the row buffer (a buffer of size YZ bits) to the selected row. The column select signal is also enabled based on the column address, and the given column of data (of width Z bits) is either output to the data-out buffer or input from the data-in buffer. To accommodate WOM-code-encoded data in the memory array, the width of a column, i.e., Z, should be increased. Given that the h2 2 i 2 /3 WOM-code is used, the width of a column should be increased to 1:5Z. Accordingly, the total number of bits in a row should be increased to 1:5YZ. Similar modifications are needed for the row buffer and the data-in/data-out buffers. In this memory architecture, memory data is encoded in the unit of a column. The encoded column data is stored in 1:5Z consecutive bits in the memory array.
PCM-Refresh
In the WOM-code PCM architecture presented above, wits are programmed based on the WOM-code scheme before they reach the rewrite limit, resulting in PCM write latency that is as short as PCM RESET latency. However, once the wits in a given page reach the rewrite limit, the next write to this page follows the pattern of the first write in this scheme. This requires both the SET and the RESET operation. Thus, the write latency of the WOM-code PCM on the first write pattern is still as long as the write latency in conventional PCM.
Assume that a k-rewrite WOM-code scheme is used in PCM with a RESET latency of L and a SET latency of SL, where S ! 1 is the slowdown factor. The PCM wit can only be rewritten k times. Thus the total write latency of any k consecutive writes in WOM-code PCM is ðk À 1ÞL þ SL. For classical unencoded PCM without the WOM-code, the write latency of any k consecutive writes is kSL. The performance improvement of k-rewrite WOM-code PCM is thus bounded by a factor of ðk À 1 þ SÞ=ðkSÞ. Obviously, a higher limit on the number of rewrites increases this upper bound. However, a WOM-code with a higher limit on the number of rewrites incurs a larger memory overhead.
Another way to improve this bound is to reduce the latency of the first write after the rewrite limit is reached (termed the a-write in this paper). We propose PCM-refresh that uses idle cycles to hide the long latency of the a-write by periodically and opportunistically refreshing the pages that are at the rewrite limit. Idle cycles in this paper refers to those clock cycles without active accesses or operations going on in a given memory bank. PCM-refresh performed during these idle cycles do not block other normal memory accesses or impact memory performance. PCM-refresh is inspired by the DRAM refresh operation, which is necessary for data readout and restoration in DRAM. In PCM-refresh, the PCM controller sends a PCM-refresh command to "refresh" a page in all banks in a target rank. The PCM controller periodically checks the status of all ranks in the PCM memory, and picks a target rank from the pool of idle ranks in round-robin fashion. The target row address of a given bank is given by a row address table that is implemented in the PCM controller. Each entry of this row address table is a row address buffer of a bank, which records the most recent five pages that have reached the rewrite limit. After the PCM-refresh command is issued, row addresses are sent to all banks in the target rank, and the data is read out and written back into the row following the first write pattern of the WOM-code. The entire PCM-refresh operation is handled in burst mode. Hence, the latency of PCM-refresh is t WR þ ðN bank Ã L burst =2Þ, where t WR is the latency of the normal write in PCM, N bank is the number of banks in a rank, and L burst is the burst length. Note that L burst =2 is the data burst duration in the DDR3 standard. Since L burst is much smaller than t WR , batch processing PCM-refresh operations of multiple banks in burst mode can reduce the performance impact of PCM-refresh. To further reduce the impact of PCM-refresh on memory access blocking, we combine write pausing [7] with PCM-refresh. Write-pausing prioritizes writes and reads that access banks undergoing a PCMrefresh by enabling them to preempt the ongoing PCMrefresh operation. PCM-refresh has limited impact on energy consumption. The energy consumption of PCMrefresh is equal to the energy consumption of a single row read followed by a single row write.
The "refreshed" PCM row can be immediately written by the pattern of the second write in the WOM-code. Ideally, if all memory accesses are perfectly distributed over time and memory ranks, PCM-refresh can hide the latency of the a-write in the idle cycles, and achieve a performance improvement of SÂ, which is not limited by the rewrite limit of the WOM-code. However, it is possible that some rows reach the rewrite limit before PCM-refresh, since we refresh the PCM opportunistically. In this case, the next write access to this row will require a long write latency, due to the SET operations in some of the bits in that row. If a write access is issued to row X while PCM-refresh is currently in progress in row Y, the write access can preempt the PCM-refresh by pausing the PCM-refresh, since we combine write pausing [7] with PCM-refresh. To further improve the efficiency of PCM-refresh in practice, we introduce a refresh threshold parameter r th percent. When the PCM controller selects the target ranks from the idle ranks, only the ranks where more than r th percent of the banks have at least one page that has reached the rewrite limit are selected for PCM-refresh.
WOM-CODE CACHED PCM
Whereas WOM-code PCM can significantly reduce the write latency, it comes at the cost of memory capacity. For example, the h2 2 i 2 /3 WOM-code PCM requires 50 percent overhead to store the extra bits of encoded data. On one hand, the use of WOM-code PCM requires 50 percent more area than conventional unencoded PCM. On the other hand, unencoded PCM has higher latency and better endurance in comparison to WOM-code PCM. We propose a WOM-code-cached PCM (WCPCM) architecture that integrates a small WOM-code PCM array as the cache (WOMcache) on top of the large conventional PCM main memory array, as shown in Fig. 3 . The proposed cache resides behind the memory controller (i.e., DIMM controller) in the main memory along with the other PCM banks. For simplicity, our implementation of WCPCM uses the wide-column architecture.
WCPCM is implemented at the rank level, where each rank has its own WOM-cache array (wide-column design with PCM-refresh) with the same number of rows and columns as a conventional PCM array in a bank of the PCM main memory. Besides the wide-column in each row, there is a selector field associated with each row that includes T , the tag of the row, and V , the valid bit of a page. For a PCM memory architecture with N bank banks per rank, the WOMcache is an N bank -way associative cache, with tag width T equal to logðN bank Þ. The valid bit of V is only one bit wide. Thus, for a memory architecture with 32 banks per rank, the width of the selector field is only 6 bits per row.
Write Protocol in WCPCM
Since the objective of the WCPCM design is to provide fast write accesses in PCM, the WOM-cache only caches the write access. When a write access is issued to the corresponding rank of the PCM, the memory controller first selects the target row in the WOM-cache. After selecting the row, the controller checks the WOM-cache tag and the valid bit. In the case of a WOM-cache hit, where the valid bit is invalid, or the WOM-cache tag matches the bank address of this write access, the memory controller forwards the data in the row buffer and programs the cells accordingly. In the case of a WOM-cache miss, where the valid bit is valid and the WOM-cache tag does not match the bank address of this write access, the memory controller first outputs the current data and the bank address, i.e., the victim data, to a register. Then the controller programs the cells and updates the WOM-cache tag according to the new data. Finally, the write request of the victim data in the register is inserted into the queue of memory accesses that is issued to the PCM main memory.
Read Protocol in WCPCM
In a read access, both accesses to the corresponding WOMcache entry and the physical address in the PCM main memory are issued in parallel. In the case of a WOM-cache hit, the data in the PCM WOM-cache is forwarded to the output buffer. Otherwise, the data in the PCM main memory is forwarded to the output buffer. Note that regardless of a WOMcache hit or miss, the content in the corresponding WOMcache entry will not be changed in the read access.
The WCPCM architecture has the following advantages:
Exploiting WOM-cache locality: The locality in the onchip last level cache (LLC) miss stream is very high [41] . The large entry size of the WOM-cache (e.g., the size of a row), and the high locality in the incoming access stream to the main memory enable the effective utilization of the short write access latency of the WOM-cache, thereby improving the performance of the PCM memory system. Reducing WOM-code overhead: Instead of provisioning at least 50 percent spare memory capacity when integrating WOM-codes into the PCM memory system, WCPCM only applies WOM-codes to a portion of the memory space. The number of addressable units in a WOM-cache is the same as that of a bank. Given that 50 percent memory capacity in each addressable unit is required by WOM-codes, the memory overhead of the entire WOM-cache is only 1:5=32 ¼ 4:7% in a PCM architecture with 32 banks per rank. For memory architectures with more banks per rank, the memory overhead of WOM-cache is even lower. Increasing memory access speed: In the case of a WOMcache hit, the latency of a write is reduced to the latency of the RESET operation. On the other hand, in the case of a WOM-cache miss, the penalty of using WOM-cache is one to two clock cycles for the tag comparison. Given that the write latency is of the order of hundreds of cycles, the penalty of using WOM-cache is negligible in practice. Similar to the WOM-cache miss, a read access in the WCPCM only requires the overhead of the tag comparison, since the accesses to the WOM-cache and the PCM main memory are performed in parallel. Practical cached memory solution: Whereas the functionality of WCPCM is similar to that of hybrid DRAM/PCM [19] , WCPCM is more practical since it uses only PCM cells and is hence easier to fabricate. Hybrid DRAM/PCM approaches not only need to integrate the fabrication of DRAM and PCM, but also face scalability issues inherent to DRAM. Note that the proposed WOM-cache can also be implemented as a dense on-chip cache.
Endurance of WCPCM
Though WCPCM improves latency, its endurance is lower since the cache potentially receives higher write traffic in comparison to the rest of the memory array. To mitigate this problem, a round-robin cache migration scheme is proposed in this paper. A discussion of the RRCM architecture is deferred to Section 5.5 in this paper.
WOM-CODE BIT-WRITE REDUCTION
Whereas the WOM-code PCM architectures proposed above can reduce memory latency for limited overhead, the performance of PCM is affected by a number of practical considerations, most significant of which is poor cell endurance [7] , [11] , [12] , [13] . In PCM, this is a result of the physical stresses on the phase change material during write operations. The repeated heating and cooling of each cell causes the phase change material to slightly expand and contract on every write. Over many cycles, these stresses can create gaps in the material or cause the material to break away from the heating element entirely. This ruins the electrical conductivity of the cell, causing it to be "stuck" in one state. PCM cells are only able to sustain on the order of 10 8 writes before failure, compared to the ability of DRAM cells to sustain on the order of 10 15 writes [17] . Furthermore, limits on the maximum amount of power that the device can consume requires the memory to only write a limited amount of data at a time, further impacting the overall write latency of the device [15] , [17] .
The WOM-code PCM architectures described so far have to address the following endurance considerations. First, implementing WOM-codes increases the effective size of the written data. When using the h2 2 i 2 /3 WOM-code, for every 2 bits of written data, the WOM-code PCM controller encodes it as a 3-bit WOM data, leading to 50 percent larger effective written data size. Thus, for every write operation in WOM-code PCM architectures, up to 50 percent more PCM cells may be overwritten. Second, both PCM-refresh and WCPCM require additional writes to the PCM array. In PCM-refresh, additional write operations are issued during the periodic PCM-refresh command. Finally, a write request in WCPCM may incur two write operations: a write operation in the WOM-cache array and a write operation in the PCM main memory when it becomes a replacement victim later. The rest of this section describes effective solutions to address these challenges.
Background and Related Work
Existing solutions to address the endurance problem in PCM use a variety of approaches. For example, one set of solutions tackle the endurance problem through complex data migration or address translation techniques [42] , [43] , [44] . Although these schemes potentially improve the endurance, they incur penalties in latency and hardware overhead. The second set of solutions use compression techniques to address the endurance problem by reducing the effective size of written data. Compression techniques have been investigated widely for classical memory technologies, largely to improve capacity of both main memory systems and last level caches [45] , [46] , [47] , [48] , [49] . Compression has been used with limited success within the context of PCM: to improve performance in multi-level cell PCM [50] and to improve PCM capacity [23] . The reduced data size as a result of compression can be used to improve endurance as a result of the system writing less data. However, compression techniques have the following limitations | first, the compression and decompression of data usually incur non-negligible latency overheads, and may require complex address translation or space reallocation procedures to account for the variable size of the compressed data. Second, it is difficult to integrate compression techniques with the WOM-code architectures as compression techniques may require both SET and RESET when writing the compressed data, which again compromises the benefit of using WOM-code.
Another popular set of solutions are the bit-write reduction solutions, which utilize encoding to reduce the number of bit-writes that occur as data in memory is overwritten by new data. State-of-the-art methods, flip-n-write (FNW) [18] and data comparison write [33] , have been successful in reducing the number of bit-writes during memory operation to improve performance. FNW [18] , is a solution that improves power, endurance, and write latency of PCM by reducing the maximum number of bits changed during a write access through conditionally "flipping" the data to be written, reducing the number of bit-writes between the old and new data by at least half. DCW [18] reads the existing data from the target address before overwriting it with a new word. A comparison between the existing data and the new data is performed to check which bits need to be updated. Only the changed bits are overwritten in the following write operations. Thus, the improvements in FNW and DCW is dependent on both the existing data and the new data to reduce bit-writes. These methods show how the non-volatility of PCM technologies can be leveraged by only performing write operations on the bits that need to be changed, while leaving the other bits unmodified. Unlike DRAM, where the entire memory line needs to be written back after each row activation during a read or write operation, refreshing the unchanged bits in PCM is not necessary, so extra power or time need not be expended on them. However, FNW requires both SET and RESET operation in a write access, regardless of whether it flips the data or not. Thus, FNW is incompatible with WOM-code PCM.
Another set of solutions are based on hybrid-memory architectures that integrate both volatile and non-volatile memories side-by-side. Solutions [51] and [52] propose hybrid-memory architectures for last-level caches that integrate a small capacity SRAM alongside a larger STT-RAM. By mapping frequently written addresses to the SRAM and less frequently written addresses to the STT-RAM area, these solutions improve the overall endurance of the system. Similarly, several solutions have been proposed using hybrid DRAM-PCM for main memory [19] , [53] , [54] . However, these solutions have certain limitations. First, unlike [51] and [52] , it is difficult to integrate DRAM and PCM on the same chip for main memory systems. Hence, most of the hybrid main memory solutions use multi-module architectures that require additional hardware and software resources. Second, [19] , [53] , [54] not only use separate memory buses for the DRAM and PCM modules, but also require the OS to manage pages that should be mapped to DRAM/PCM. This results in hardware and software overhead. However, efficient hardware and software optimization for multi-die/single-die integration of DRAM and NVM for main memory is an area for future research.
The following subsection describes incremental coding for endurance as a bit-write reduction solution to overcome endurance limitations of the WOM-code PCM architectures described so far in this paper.
Endurance-WOM-Code (e-WOM-Code) and Incremental Coding for Endurance
In this section we describe a technique to reduce the number of bit-writes that occur in the PCM array. The simulation results in Fig. 10 that WOM-code PCM architectures experience an increase in the number of bit-writes in comparison to unencoded PCM. Conventional algorithms like data comparison write [33] which have proven to reduce the number of bit-writes, are not effective in the presence of WOM-code. This is because, DCW cannot control the WOM states from changing even when the input value continues to be the same. Hence, DCW is rendered ineffective in controlling WOM state transitions. Our simulation results show that the application of DCW algorithm on WOM-code PCM over multiple benchmarks, can reduce the bit-writes by only about 8.7 percent in comparison to the WOM-code PCM without DCW algorithm. We identify that a significant amount of bit-writes can potentially be avoided by operating on the incoming raw data instead of the encoded data. Operating on the input data provides control on the WOM state transitions in encoded data. To take advantage of this redundancy we apply endurance-WOM-code (e-WOM-code) that prevents state transitions when the incoming data is unchanged.
Endurance-WOM-Code
For the WOM-code described in Section 3, we discussed that the penalty for resetting WOM 2 to WOM 1 state is higher than that of WOM 1 to WOM 2 state transitions. Hence, the performance of WOM-code can be improved by reducing the number of transitions from WOM 2 to WOM 1 state. Without loss of generality, and for ease of understanding, we use inverted WOM-codes for our discussions in this section. Let us consider SET (RESET) operation as the transition of a bit from 0 to 1 (1 to 0) in the encoded data. The fundamental property of WOM-code is to advance in such a way that the long duration (SET) operations are minimized. In order to have a long chain of RESET-only state advances before having to perform the SET operation, the number of RESET operations per state transition must be minimized. In other words, an effective code would only have a single RESET operation in all its state transitions. Consider the state transition graph shown in Fig. 4a , which is obtained by modifying the WOM-code described in Fig. 1 for endurance. In this endurance-WOM-code (e-WOM-code), state transitions occur only when there is a change in the input 2-tuple. There is no write-operation (or state advancement) performed if the input does not change. Additionally, to take full advantage of the redundancy that the h2 2 i 2 /3 WOM-code presents, the state transition graph accommodates two additional changes explained as follows. First, in comparison to WOM-code, the e-WOM-code modifies the edges of the state-diagram to allow a state transition from the 111-WOM 1 state to the 011, 101, or 110 WOM 1 states. Second, it allows a transition from the 100, 010, or 001 WOM 2 states to the 000 WOM 2 state. Both these cases provide one extra state transition in the state diagram before the long duration (WOM 2 to WOM 1 ) operation.
However since both the 000 WOM 2 and the 111 WOM 1 states represent the same input 2-tuple 00, a transition from one to the other is not possible in our scheme. Hence, the two modifications discussed above cannot provide us with advantages simultaneously. Again, without loss of generality, we use the first case in our analysis (Fig. 4b) . The equivalent state transition graph for the non-inverted WOM-code is shown in Fig. 4c . Note that the state transition graph for the e-WOM-code is equivalent to the state transition graph for the original h2 2 i 2 /3 WOM-code proposed by Rivest and Shamir [36] , except for the self transitions. The use of this scheme converts 25 percent of the WOM 1 -WOM 2 edges to WOM 1 -WOM 1 edges, improving e-WOM-code performance. Note also that this work focuses on the integration of the h2 2 i 2 /3 WOM-code into a system that uses PCM as the main memory; the integration of other WOM-codes proposed in the literature into PCM-based systems remains an area for future research.
Incremental Coding for Endurance
The implementation of e-WOM-code in the PCM system requires additional logic realized within a simple e-WOMcode controller. Consider incoming data DATA IN meant for location ADDR IN . The e-WOM-code controller first reads the existing data at address ADDR IN , decodes it from the e-WOM-code, and compares every 2-tuple in the exisitng data with the corresponding 2-tuple in DATA IN . Note that this decode-and-compare occurs in parallel in hardware. The 2-tuples that differ from each other are then encoded into the next e-WOM-code state and then written into the PCM array using classical DCW. The use of DCW eliminates redundant writes into the PCM array. Thus, e-WOM-code, along with ICE, significantly improves the endurance (due to reduced number of bit-writes) and memory access latency (due to reduced WOM 2 -WOM 1 transitions) of the PCM array.
Example: Consider an example of four data write requests to the same location from an 8-bit data stream, as shown in Fig. 5. Fig. 5a illustrates unencoded PCM without any enhancements. Whenever there is a data write to a location, the memory controller writes the entire word into the PCM array, resulting in 4 Ã 8 ¼ 32 bitwrites. 2) Fig. 5b illustrates unencoded PCM with classical DCW to reduce bit-writes. In DCW, only those bits that differ from the previous data are updated; in this example, only 12 cells are updated instead of 32. 3) Fig. 5c illustrates WOM-code PCM without DCW. In WOM-code PCM, each 2-tuple is encoded into an equivalent 3-tuple. Without DCW, every bit in the PCM array is updated. In this case, each write result requires 12 cell updates for a total 4 Ã 12 ¼ 48 bit-writes. 4) Fig. 5d illustrates the integration of DCW with case (c) to result in 28 bit-writes Note that 28 bitwrites is greater than 12 bit-writes in case (b), i.e., unencoded PCM with DCW. The simulation results in Section 6 support this observation and motivate the development of e-WOM-code solutions. 5) Fig. 5e illustates the use of e-WOM-code with ICE to improve memory endurance. Since the e-WOMcode state advances only when there is a change in the input data and/or state changes from 111 to the WOM 1 state. Unlike case (d), where four state transitions occur in all cases, ICE has its four 3-tuples advance by only three, two, one, and three state transitions, respectively, for a total of nine transitions. The example illustrates how the e-WOM-code with ICE can be used to reduce memory access latency and improve endurance in practice.
1)
WCPCM Endurance and the Round-Robin
Cache Migration Architecture
Though the WCPCM architecture was proposed to improve memory latency, the cache potentially receives higher write traffic in comparison to the rest of the memory array. It is possible to mitigate the impact to endurance through a round-robin cache migration scheme described in this section. Instead of maintaining a fixed designated cache, RRCM periodically changes the physical memory that is used to represent the cache in a round-robin fashion to mitigate the impact on WCPCM endurance. First, we make the following three observations about WCPCM: 1) Whenever a write request appears at the DIMM controller, the data always gets written to the cache, regardless of a cache hit/miss. 2) During a cache miss, the data and the tag bits both need to be updated, and during a cache hit the tag bits can remain the same, but the data should be updated to the new value. Therefore, we base our analysis of the endurance of the cache on data bits, since they receive more bit-writes than the tag bits.
3) The cache memory and data banks are written by the same DIMM controller. During cache writes, the DIMM controller activates the ICE hardware to encode the data using e-WOM-code; however, during writes to data banks, the data is written in unencoded form. Hence, physically, there is no difference between the cache and the data banks. Consider a memory module with ð2k þ 6Þ physical banks, of which 2k physical banks are address-mapped into k logical banks to realize the main memory. Of the remaining six banks, three physical banks are logically mapped to realize the WOM-cache in the WCPCM architecture, while the remaining three physical banks are used as spare banks. The use of spare banks is described later in this section. Since a write to any memory location is always written to the cache, it potentially receives kÂ writes in comparison to the rest of the memory banks, which affects its endurance. However with (i) RRCM, (ii) e-WOM-code PCM, and (iii) PCMrefresh, it is possible to mitigate this problem in practice.
First, in RRCM, instead of a fixed designated cache, we propose to periodically change the physical banks used to represent the cache using a simple round-robin algorithm. Since the cache memory is identical to the address-mapped memory, we can logically swap address-mapped data banks periodically with the cache memory so that the wear is spread across all the banks. Therefore, the additional factor of k writes that would impact the endurance of a single cache are now spread across ðk þ 1Þ logical banks, i.e., ð2k þ 3Þ physical banks. Thus, with RRCM, we see only a factor of d2k=ð2k þ 3Þe increase in the number of writes to any physical bank. Second, our evaluation of e-WOM-codevPCM (in Section 6) shows that it can improve the endurance of memory by 1.4Â on average in comparison to unencoded data with DCW. Thus, by employing e-WOM-code cached PCM, we can actively improve memory latency while mitigating the corresponding impact to endurance.
Third, we employ PCM-refresh (described in Section 3.2) to periodically refresh the cache to hide the long latency writes. Whenever the cache reaches the refresh threshold, the data from the cache banks can either be discarded without migration, which lowers the cache efficiency, or it can be migrated to the spare banks where banks receive data in the WOM 1 state. Once the data migration is completed, the spare banks are brought online as the cache. The retired cache banks are then used as spares until the new cache banks reach the refresh threshold. Therefore, by using PCM-refresh, the latency of the cache can be improved. Our simulations show that PCM-refresh with e-WOM-code potentially reduces the write latency by % 51.5 percent.
Consider a main memory module with 70 physical banks, where 64 physical banks are used to realize 32 logical data banks, three physical banks are used to realize the WOM-cache, and the remaining three banks are used as spare banks. In a fixed cache architecture, the cache would see a 32Â increase in traffic compared to the addressmapped data banks. This potentially makes the cache memory fail 32Â faster than the data banks. However, with the RRCM scheme, the potential writes for any bank reduces to ½1 þ dð32 Â 2Þ=ð64 þ 3Þe ¼ 2Â. Further, with the use of e-WOM-code PCM for the cache, the potential bit-writes for any physical bank is ½1 þ dð32 Â 2Þ=ð64 þ 3Þe=1:4 ¼ 1:7Â. Therefore, with the RRCM scheme, the endurance of the memory module can be increased by 19Â in comparison to the e-WOM-code cached PCM scheme without RRCM. Thus, we believe that the use of RRCM and e-WOM-code PCM effectively mitigates the endurance degradation associated with a fixed cache scheme for WCPCM.
SIMULATION SETUP AND RESULTS
The proposed PCM architecture was evaluated using trace driven simulations. We use Pin [55] to capture the memory access trace on a machine with an Intel Core i7 CPU 980 running at 3.33 GHz. We collect memory access traces from a set of workloads: integer benchmarks (400.perlbench, 401. bzip2, 456.hmmer, 462.libq, 464.h264ref) and floating point benchmarks (410.bwaves, 436.cactusADM, 465.tonto, 470. lbm, and 482.sphinx3) from SPEC CPU2006 [30] in the general-purpose computing domain; qsort, mad, FFT, typeset, and stringsearch from MiBench [31] in the embedded computing domain; and ocean, water-ns, water-sp, raytrace, and LU-non-contiguous-block from SPLASH-2 [32] in the high-performance computing domain. Note that in reality, along with application processes, kernel processes add memory access traffic and this should be accounted for accurate evaluation. However, this requires substantial instrumentation effort of a full system simulator and is also dependent on the emulated operating system.
For accurate simulation of memory access latencies in the proposed PCM architecture, the simulation framework was configured based on accepted parameters and standards for memory modeling obtained from manufacturer data sheets. The PCM simulator is based on DRAMSim2 [29] and follows standard JEDEC protocols for DDR3 memory, since the PCM access protocol will not differ much from DRAM [56] , [57] . The fundamental modifications in DRAMSim2 for PCM simulation are [57] : row read delay is 27 ns, row write delay is 250 ns, the RESET latency is 40 ns, the SET latency is 250 ns, and the PCM-refresh period is 4,000 ns. We used the PCM main memory architecture proposed in [56] as the baseline PCM memory architecture. For evaluating e-WCPCM, we need to consider the hit/miss statistics of the WOM-cache. When there is a hit in the e-WCPCM cache, the latency of this write is not deterministic, since it depends on whether the target row reaches its rewrite limit. In the case that the target row reaches the rewrite limit, the write latency is the SET latency, which is also 250 ns. Otherwise, the write latency is the RESET latency, which is 40 ns in our simulation setup. When there is a miss in the e-WCPCM cache, the latency of this write operation is the write latency of normal PCM memory, which is 250 ns in our simulation setup. For comparing the performance of PCM to DRAM, we mirror the configuration of DRAM to PCM (described in the following paragraph). All the timing parameters for DRAM were obtained from [29] .
The PCM device was configured as follows: the 16 GB PCM main memory architecture is organized in a single channel with 32 banks/rank. Each PCM device has 32,768 rows, 2,048 columns/row, and 4 bits of data/column; thus, 16 devices are used in parallel to form the 64-bit data width of main memory.
Memory Access Latency
e-WOM-Code PCM Architectures
In Fig. 6a , we report the average write latency of e-WOM-code PCM (red), e-PCM-refresh (green), and e-WCPCM (purple), normalized to the average write latency of unencoded PCM (blue). Unencoded PCM is conventional PCM that serves as the baseline architecture to report normalized write latencies. Note that the use of DCW (differential writes) is irrelevant in the measurement of memory access latency. (1) We observe that the write latency of PCM is reduced by 19.8 percent on average in case of e-WOM-code PCM. The best improvement occurs in the 464.h264ref benchmark, where the write latency is reduced by 48.7 percent. (2) e-PCM-refresh further reduces the write latency, by 51.5 percent on average across the benchmarks. The best improvement with e-PCM-refresh is in the 464.h264ref benchmark, where it achieves 66.1 percent lower write latency. (3) We also observe that the performance of e-WCPCM, in terms of write latency, is between e-PCMrefresh and e-WOM-code PCM. e-WCPCM achieves 47.5 percent lower write latency on average, in comparison to unencoded PCM.
We also evaluate the average read latency of the proposed PCM architectures, as shown in Fig. 6b , normalized to the average latency of unencoded PCM. As the longer write latency in PCM is likely to block read accesses to the same bank, we find significant improvement in the read latency of our proposed PCM architectures, compared to unencoded PCM. It is consistent with the improvement of write latency shown in Fig. 6a , even though the degree of improvement is smaller. e-WOM-code PCM can reduce the read latency by 14.7 percent on average. Similar to write latency, e-PCM-refresh has the lowest read latency, 44.1 percent lower than unencoded PCM on average across the benchmarks. e-WCPCM has 41.6 percent lower read access latency than unencoded PCM, which is clearly an advantage over e-WOM-code PCM with ICE. Fig. 6 also reports the read and write latency numbers of DRAM normalized to unencoded PCM. The read latency of DRAM is about 52.6 percent lower than unencoded PCM, while the write latency of DRAM is about 87.4 percent lower than unencoded PCM. In other words, PCM read latency is about 2.4Â that of DRAM and PCM write latency is about 8Â that of DRAM. This is in agreement with results reported based on application profiling in [56] , where it was reported that the write latency of PCM is about 12.0Â and read latency is about 4.4Â greater than DRAM.
DRAM Versus Baseline Unencoded PCM
e-WOM-Code PCM v/s WOM-Code PCM
In order to show the impact of e-WOM-code and ICE on memory access latency, we compare the write and read latencies of e-WOM-code PCM to WOM-code PCM, normalized to the write latency of WOM-code PCM in Fig. 7 . Observe that in 15 out of 20 benchmarks, ICE reduces both the write latency and the read latency in WOM-code PCM. This improvement can be attributed to the finer granularity of WOM-code updates in e-WOM-code PCM, which makes it possible to rewrite the WOM-code beyond the rewrite limit. The remaining five benchmarks show longer write and read latencies in e-WOM-code PCM. The benchmark 456.hmmer has the largest increase in write latency of 9.03 percent. In these benchmarks, the benefits of fine granularity WOM-code updatase in e-WOM-code PCM does not compensate for the penalty of the additional read latency in ICE.
We also evaluate the sensitivity of the tag design in the e-WCPCM. In Fig. 8 , we measure the WOM-cache hit rate in e-WCPCM with four PCM organizations: four banks/rank (blue), eight banks/rank (red), 16 banks/rank (green), and 32 banks/rank (purple). It is worth noting that the tag in the WOM-cache is the bank address. Changing the number of banks/rank impacts the associativity of WOM-cache. The more banks/rank, the lower the hit rate in the WOM-cache. However, increasing the number of banks/rank provides better parallelism in memory accesses, which helps in reducing access conflicts. In Fig. 9 , we report the normalized write latency in e-WCPCM with four PCM organizations, similar to Fig. 8 . The write latency is normalized to that of the four banks/rank organization. We observe that the write latency in e-WCPCM decreases as the number of banks/ rank increases. 
Memory Endurance
Memory endurance is determined by two factors-first, the total bit-writes to the PCM array and second, the peak bitwrites to a given location. In the trivial case, the lifetime of the memory depends on the number of writes that are received by a given bit-cell. This can be non-uniform and hence the lifetime of the PCM would depend on the cell that gets the peak write activity. However, in practice, wear leveling schemes like start-gap [42] and security-refresh [44] can be used to even out the peak bit-writes to a given location. Although not implemented in this work, we believe that wear leveling is orthogonal to the innovation described in this work. This makes total bit-writes to any given location as relevant as an endurance parameter as peak bitwrites at that location. Since the proposed architecture is agnostic to the choice of wear leveling technique, the reduction in the total bit-writes by e-WOM code translates to improved endurance.
We evaluate the total number of bit-writes in our proposed WOM-code PCM architectures and compare them to unencoded PCM without (and with) DCW. In Fig. 10 , we compare the total number of bit-writes in five architectures: unencoded PCM without DCW, WOM-code PCM without DCW, unencoded PCM with DCW, WOM-code PCM with DCW, and e-WOM-code PCM, i.e., WOM-code PCM with both e-WOM-code and ICE. The total number of bit-writes are normalized to that of unencoded PCM without DCW. Since WOM-code PCM without DCW rewrites all bits at an address, it always has 50 percent more bit-writes than We also analyze the reduction in the number of bitwrites, using the e-WOM-code with ICE or WOM-code with DCW. We show the total number of bit-writes in WOMcode PCM with DCW and e-WOM-code PCM in Fig. 11 . Both of these data are normalized to that of the unencoded PCM without DCW. e-WOM-code with ICE have significantly reduced the total number of bit-writes by 87.6 percent on average, in comparison to WOM-code PCM with DCW.
Finally, we measure the number of bit-writes while combining e-WOM-code and ICE with our proposed PCMrefresh and WCPCM, i.e., e-PCM-refresh and e-WCPCM, as shown in Fig. 12 . Our results show that on average, both PCM-refresh and WCPCM, which do not integrate ICE and e-WOM-code, incur 87.7 and 125.9 percent more bit-writes, respectively, in comparison to unencoded PCM without DCW. In contrast, e-PCM-refresh and e-WCPCM reduces additional bit-writes to 19 and 46 percent, respectively. Furthermore, Fig. 12 shows that e-PCM-refresh and e-WCPCM reduces bit-writes by 76.5 and 68.1 percent in comparison to PCM-refresh and WCPCM.
Lifetime Measurement
The endurance of a device cannot be determined solely by total bit-writes. If the write traffic is unevenly distributed across the logical address space, then the lifetime of the PCM device depends on the application. The distribution of writes over the address space was simulated for 2 benchmarks from the SPLASH benchmark suite: volrend and lu_contig, as shown in Fig 13. Volrend is a highly computational application that renders a 3D figure to a 2D image.
On the other hand lu_contig is a simple memory intensive application in which the processor accesses contiguous locations. We observe that the memory write access distribution is fairly spiky for volrend and smooth for lu_contig. As the bit-write distribution can be both even and uneven, we cannot measure memory endurance using just the total bitwrites, motivating peak bit-writes as an equally important parameter for measuring endurance. In order to evaluate the lifetime of a PCM memory device for a given application, we assume that the application runs forever in the system. The write traffic from the memory trace is tracked for every line in the memory until one of the line becomes defective. This is similar to the methods used in [52] . We use this method to estimate the lifetime of the e-WOM-code PCM in comparison to unencoded PCM without DCW and unencoded PCM with DCW. The normalized lifetime with respect to unencoded PCM without DCW is shown in Fig. 14. e-WOM-code increases lifetime by about 13.5Â over unencoded PCM without DCW; when the PCM incorporates DCW, e-WOM-code improves lifetime by 1.4Â over unencoded PCM with DCW. However, it is important to note that in practice, various wear leveling schemes (not considered here) can be used to even out the wear in the memory to further improve the lifetime of the device [42] .
Energy and Power
In this sub-section we discuss energy and power requirements of the proposed PCM architectures in comparison to a conventional DRAM architecture. First we provide an analytical background for the energy penalties, followed by an evaluation of unencoded PCM with DCW and e-WOM-code PCM. Last, we qualitatively describe the power requirements for using PCM-based architectures in main memory.
Energy
Earlier works have investigated the energy requirements of PCM in comparison to DRAM [56] . PCM array read and write energies are reported to be 2.1Â and 43.1Â greater than those for DRAM. This places stricter restrictions on the number of bits that can be SET/RESET at a given time in a PCM-based main memory. Note that the use of differential writes, i.e., DCW can actually reduce PCM write energy to only the bits that have to be modified [33] ; however, DCW necessitates the implementation of a read-modify-write protocol which adds a read energy component to every write operation. Since the active read/write energy in both PCM and DRAM is directly proportional to the number of bitreads and the number of bit-writes, respectively, we compare the energy requirements of PCM (unencoded PCM, WOM-code PCM, e-WOM-code PCM, e-PCM-refresh, and e-WCPCM) to DRAM using the number of bit-reads and bit-writes in Table 2 using using the following equations:
where p (q) is the number of simultaneous reads (writes); N is the ratio of the WOM-code word size to the unencoded word size; r is the ratio of the average number of bit-writes in PCM to that in DRAM for a given architecture; R E;DRAM and W E;DRAM are the read and write energy for a word in DRAM, respectively; and R E;PCM and W E;PCM are the read and write energy for a word in PCM, respectively. Note that the ratio of DRAM read energy to write energy, R E;DRAM /W E;DRAM is % 3 as reported in [56] . The remainder of this sub-section compares the energy consumption for the three architectures: DRAM, unencoded PCM with DCW, and e-WOM-code PCM. The energy and power parameters for DRAM and PCM were obtained from CACTI [34] and NVSim [35] , respectively. The runtimes were obtained from DRAMSim2 using timestamps in the memory traces. To compare the energy consumption of DRAM and PCM, we consider a main memory architecture of 4 GB capacity organized into eight banks (scaled architecture of 16 GB, 32 banks) using the 90 nm process node. Since the technology options for the comparison are vast, we have made the following choices for our evaluation. Out of the DRAM cell choices provided by CACTI, we chose the commodity DRAM cell, since it is the standard choice for main memory, and LOP (low operating power) transistors for the peripheral circuits [56] . On the other hand, the PCM cell was chosen to be a 9F 2 cell in 90nm process node with ON (OFF) resistance of 1 kV (1 MV), SET (RESET) current of 150 mA (300 mA), and SET (RESET) pulse durations of 150 ns (40 ns) [56] . Though the PCM cell by itself does not contribute to leakage power, the peripheral circuits contribute to leakage power. As indicated in [58] , the peripheral circuits of an inactive PCM bank can be powered down completely without losing data due to its non-volatile nature. This limits the leakage to be only from the banks that are active at any given time. Further, since the e-WOMcode architecture requires 50 percent area overhead, we assume that its leakage power is 1.5Â the leakage power of unencoded PCM with DCW.
DRAM energy has three components: leakage, refresh, and dynamic whereas PCM energy has two components: leakage and dynamic. Fig. 15a shows the constituent components of total energy for the three architectures, where the total energy is normalized to the total energy of DRAM. Fig. 15b compares the total energy for the three architectures. As shown in the figure, unencoded PCM with DCW dissipates about 1.4Â total energy in comparison to DRAM and e-WOM-code PCM dissipates about 1.9Â total energy in comparison to DRAM. This is in agreement with [56] , wherein it is reported that replacing DRAM with PCM would incur an energy penalty of about 1.4Â to 3.4Â. Fig. 15c shows the comparison of the dynamic energy component alone for the three architectures. As shown in the figure, unencoded PCM with DCW dissipates about 16 percent dynamic energy and e-WOM-code PCM dissipates about 10 percent dynamic energy in comparison to DRAM.
Power
The power specifications of a PCM device has two components: average-power budget and peak-power budget. The average-power budget limits the average number of bit-flips for a given cycle, which impacts memory bandwidth. In contrast, the peak-power budget determines the maximum number of bit-writes for a given cycle. The use of WOMcode increases the number of bit-writes in comparison to unencoded data, which potentially necessitates budgeting for higher power. However, with the use of e-WOM-code and ICE, the average number of bit-writes is reduced in comparison to (Fig. 12) . Thus, for the same average-power budget, more data write operations can be sustained and this can potentially result in improved memory bandwidth. On the other hand, the peak power dissipation ultimately manifests as a constraint on the number of bits that can be SET/RESET at a given time. Since both e-WOM-code PCM and conventional PCM result in roughly the same number of SET/RESET operations in the worst-case, it can be concluded that the use of e-WOM code complies with the power budget of conventional unencoded PCM.
CONCLUSIONS
In this paper, we proposed a WOM-code PCM architecture for next-generation non-volatile memory applications. Two memory organization methods-wide-column and hiddenpage-were proposed to provide the memory capacity required by WOM-codes. A PCM-refresh policy was proposed to increase the write performance of WOM-code PCM by periodically refreshing PCM rows that have reached the WOM-code rewrite limit. To balance memory overhead and write performance improvements in WOMcode PCM, we proposed a WOM-code cached PCM (WCPCM) architecture that uses WOM-code PCM as the cache alongside a conventional PCM main memory array. Finally, we used additional transitions from the conventional h2 2 i 2 /3 WOM-code transition graph to evaluate endurance-WOM-code (e-WOM-code) architectures that explicitly address PCM endurance. The e-WOM-code is realized in practice through the incremental coding for endurance algorithm embedded in the memory controller, and it reduces the number of bit-writes over unencoded PCM memory. Evaluation on traces from multiple benchmarks using the modified DRAMSim2 simulator show that the e-WOM-code PCM architecture can reduce memory write (read) latency by 19.8 percent (14.7 percent) and the number of bit-writes over unencoded PCM without (with) DCW by 83.0 percent (22.1 percent) on average across general-purpose (SPEC CPU2006), embedded (MiBench), and high-performance (SPLASH-2) benchmarks. When PCMrefresh is integrated with e-WOM-code PCM, reduce memory write (read) latency by 51.5 percent (44.1 percent) and the number of bit-writes over unencoded PCM without DCW by 76.5 percent on average across the benchmarks; there is, however, an increase of 19 percent in the number of bit-writes over unencoded PCM with DCW. Finally, for just 4.7 percent memory overhead, e-WCPCM reduces memory write (read) latency by 47.5 percent (41.6 percent) and the number of bit-writes over unencoded PCM without DCW by 68.1 percent on average across the benchmarks. Although there is an increase of 49 percent in the number of bit-writes over unencoded PCM with DCW, e-WCPCM provides a practical low overhead memory architecture solution to address the write latency and endurance problems in PCM.
