Multilevel/triple-level cell nonvolatile memories (MLC/TLC NVMs) such as phase-change memory (PCM) and resistive RAM (RRAM) are the subject of active research and development as replacement candidates for DRAM, which is limited by its high refresh power and poor scaling potential. In addition to the benefits of nonvolatility (low refresh power) and improved scalability, MLC/TLC NVMs offer high data density and memory capacity over DRAM. However, the viability of MLC/TLC NVMs is limited primarily due to the high programming energy and latency as well as the low endurance of NVM cells; these are primarily attributed to the iterative program-and-verify procedure necessary for programming the NVM cells.
INTRODUCTION
The ITRS projects that resistance-class nonvolatile memory (NVM) technologies such as phase-change memory (PCM) ] and resistive RAM (RRAM) [Baek This research was supported in part by NSF CAREER Award CCF-1208933 and in part by NSF Award CCF-1217738. Authors' addresses: P. M. Palangappa and K. Mohanram, Department of Electrical and Computer Engineering, 1238 Benedum Hall, Pittsburgh, PA 15261; emails: pmp30@pitt.edu, kartik.mohanram@gmail.com. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from . (b) Performance penalties for MLC/TLC RRAM in comparison to SLC RRAM for the Stream benchmarks [McCalpin 1995] using MARSS [Patel et al. 2011] and DRAMSim2 [Rosenfeld et al. 2011] . The key take-away is that the move to denser technologies like MLC/TLC usually degrades IPC, bandwidth, and energy.
Extended contributions. This article extends the work of Palangappa and Mohanram [2016] with the following contributions. First, we propose CompEx++ coding, a flexible CompEx coding scheme, which leverages the variable compressibility of pattern-based compression techniques to integrate custom expansion coding to each of the compression patterns to exploit maximum energy/latency benefits of CompEx coding. Along the lines of CompEx coding, we evaluate CompEx++ coding for FPC and B I by integrating (3,2) 8 and (3,1) 8 expansion coding. Second, we also provide a mathematical model for the analysis of energy gains for compression, expansion coding, and CompEx++ coding, independently. Finally, we discuss the integration of error detection and correction (EDAC) with CompEx++ coding using both theory and simulations. CompEx++ coding extends the NVM EDAC capability by integrating with stronger EDAC schemes; we show that CompEx++ coding reduces the NVM scrub overhead by 40% over standard EDAC schemes for no additional overhead. Evaluation. In this work, the efficiency of CompEx/CompEx++ coding is evaluated on a 64-bit word, 64-byte cache line architecture-that is, every word is stored using 22 ( 64/3 ) TLCs, and every cache line is stored using 171 ( 512/3 ) TLCs. CompEx/ CompEx++ coding relies on a codec embedded in the NVM module controller, which abstracts all data manipulations from the memory controller on the processor side, allowing the processor to communicate seamlessly with the NVM module. On memory writes, the CompEx/CompEx++ codec attempts to compress the data at the word level using FPC or the cache line level using B I. If compression is successful, the data is encoded using (3,2) 8 /(3,1) 8 expansion coding before it is written into the NVM array; else the data is written as is into the NVM array. Note that although a single tag bit is necessary to record the outcome of CompEx/CompEx++ coding, it is concatenated with the data (513 logical bits for B I and 65 logical bits for FPC) and absorbed into the last TLC at no cost ( 513/3 = 512/3 = 171 TLCs for B I and 65/3 = 64/3 = 22 TLCs for FPC). On memory reads, the tag bit is recovered from the last TLC and used to determine if the word must be decoded using the CompEx/CompEx++ codec or if it can be forwarded directly to the NVM module controller. The logic overhead of the CompEx/CompEx++ codec 1 is ≈10k/12.5k gates (<0.1% per NVM module); although the codec has a latency of two to three cycles, we show that this can be hidden through memory access pipelining. Results. Our full-system simulations of a system that integrates TLC RRAM using MARSS [Patel et al. 2011] and DRAMSim2 [Rosenfeld et al. 2011] for workloads from the SPEC CPU2006 [SPEC 2006 ] benchmark suite show that CompEx coding improves system performance by 5.7%, as measured using IPC, and memory-system performance by 11.8%, as measured using memory bandwidth, in comparison to binary encoding using data-comparison write (DCW) ] (a read-modify-write process that only updates changed cells). Additionally, CompEx++ coding extends the benefits of CompEx coding to improve IPC and bandwidth by 10.6% and 19.9% over DCW and 5.2% and 6.4% over CompEx coding. For a deep, multibillion-instruction evaluation of CompEx/CompEx++ coding, we use NVMain [Poremba and Xie 2012] with the memory traces of SPEC CPU2006 benchmarks generated using the Intel Pin toolset [Luk et al. 2005] . Simulations of these traces show that CompEx/CompEx++ coding reduces total (active) write energy by 57%/61% (76%/78%), 16%/16% (15%/15%), and 25%/32% (23%/29%) on average over DCW, FPC, and B I, respectively. Simultaneously, CompEx/CompEx++ coding reduces write latency by 23.5%/26%, 18.5%/19%, and 23.5%/28% on average in comparison to DCW, FPC, and B I, respectively, and improves TLC RRAM lifetime by 1.8×.
This article is organized as follows. Section 2 provides the background for energy/latency problems in MLC/TLC NVMs. Section 3 describes CompEx coding and CompEx++ coding with examples. Section 4 presents the evaluation methodology and the simulation setup. Section 5 presents evaluation results. Section 6 discusses related work, and Section 7 presents our conclusions.
BACKGROUND AND MOTIVATION
MLC/TLC NVM technologies such as PCM and RRAM are the subject of active research and development as replacement candidates for DRAM, which is limited by its high refresh power and poor scaling potential [Lee et al. 2008; Kang et al. 2011; Wong et al. 2012; Yue and Zhu 2013] . However, most MLC/TLC NVMs suffer from higher write energy [Yue and Zhu 2013; and latency [Jiang et al. 2012d] per cell, limited lifetime , and asymmetric read/write latencies [Kang et al. 2007; Choi et al. 2012; Jog et al. 2012; Yue and Zhu 2012] . In this ] using P&V across eight states. States 0 through 3 and 4 through 7 are programmed using RTS and STR, respectively. Programming the central states (2, 3, 4, and 5) requires more energy and latency in comparison to the terminal states (0, 1, 6, and 7). The key takeaway is that the overall energy and latency can be reduced by encoding data using only the low energy states.
section, we discuss the procedure used for programming an MLC/TLC NVM cell and its associated limitations. MLC/TLC NVM P&V. MLC/TLC NVMexhibits nondeterministic behavior due to process variation, variation in material composition, and resistance drift. This nondeterminism increases the complexity of programming an MLC/TLC NVM to the target resistance range using a single precise pulse of current or voltage. Therefore, in practice, MLC/TLC NVMs are programmed using an iterative P&V procedure to bring the resistance of the cell to the required range. P&V uses a combination of set-to-reset (STR) and reset-to-set (RTS). In STR (RTS), the cell is first brought to a full SET (RESET) state by applying a long (short) pulse of small (high) magnitude current. Following this, multiple small duration RESET (SET) pulses are applied until the resistance of the cell is brought to the required range. However, since a TLC NVM has eight different resistance states, using only STR or RTS leads to high write latency. Hence, TLC NVMs are programmed using a combination of both STR and RTS depending on the proximity of the target state to the full SET/RESET states as shown in Figure 2 . Therefore, the write energy and latency for MLC/TLC NVM can vary depending on the final state to which the cell is being programmed. For example, Figure 2 shows the average energy and latency for programming a TLC RRAM into different states . States 0 and 7 require the lowest energy/latency since they can be programmed using a single P&V iteration, whereas states 3 and 4 require maximum energy/latency since they require the most number of P&V iterations to be brought to the desired resistance range. Impact of P&V on energy and latency. Iterative P&V has a negative impact on the energy and latency of MLC/TLC NVMs. First, writing to MLC/TLC NVMs requires much higher current and latency than their SLC counterparts due to the iterative P&V method. Second, to prevent the modules from overloading the power supply, and to prevent overheating, NVM modules are restricted to writing a limited amount of data at once, which increases the write latency and impacts performance [Yue and Zhu 2012] . Third, since a single write access can potentially require MLC/TLCs to be brought to different target states in a word, the latency of the write operation is determined by the longest latency cell write, creating a bottleneck for individual write operations . Finally, a single write for MLC/TLC NVMs involves multiple P&V cycles, which limits the lifetime of these memory cells to around 10 5-8 writes ] in comparison to SLC NVM, which has a lifetime of 10 8-10 writes Wong et al. 2012; Yue and Zhu 2013] . 
CONTRIBUTIONS
Moving to an MLC/TLC technology from an SLC technology (and similarly to a TLC technology from an MLC technology) enables static trade-offs between memory density/capacity and memory energy/latency. We motivated the technology considerations and the system-level implications of these trade-offs in Figure 1 . However, to the best of our knowledge, there has been no systematic effort to dynamically leverage these trade-offs to realize the density/capacity benefits of an MLC NVM (TLC NVM) while also reaping the low energy/latency benefits of an SLC NVM (SLC/MLC NVM). In other words, we believe that the ability to integrate a dense high-capacity MLC/TLC technology for the memory while dynamically operating that memory in the SLC/MLC/TLC mode (fully or partially, as developed in the theory of expansion codes in this section) has far-reaching implications for simultaneously improving NVM energy, latency, density/capacity, and lifetime (the lifetime improvement is indirect due to a reduction in the P&V effort needed for memory operation). CompEx/CompEx++ coding, as instances of such a dynamic trade-off approach, relies on the holistic integration of two independent/orthogonal but complementary areas of research (data compression and data coding). The core idea of CompEx/CompEx++ coding, developed over the rest of this section, is to apply expansion coding to selectively compressed data such that the resulting data in expansion coded form does not exceed the original data width. Section 3.1 introduces two pattern-based compression techniques that can be used in CompEx coding: frequent pattern compression [Alameldeen and Wood 2004] and B I [Pekhimenko et al. 2012] . Section 3.2 provides a formal treatment of expansion coding. Section 3.3 describes CompEx coding from first principles with examples and discusses a practical architecture for CompEx coding. Section 3.4 presents CompEx++ coding, which extends the energy/latency benefits of CompEx coding. Finally, Section 3.5 demonstrates the integration of CompEx/CompEx++ coding with ECC and encryption.
Compression
Although CompEx/CompEx++ coding is agnostic to the choice of compression technique, good candidates for CompEx/CompEx++ coding should have features such as low latency, low overhead, high compressibility, and low complexity. Based on our survey of compression techniques, two pattern-based compression techniques-FPC and B I-possess these desirable traits for integration into CompEx/CompEx++ coding.
Frequent pattern compression.
FPC is a pattern-based compression scheme that leverages program data statistics to successfully compress a wide range of data. FPC was originally proposed for 32-bit data word compression in L2 caches to increase their memory capacity [Alameldeen and Wood 2004] . More recently, FPC was extended to reduce bit writes in NVMs [Dgien et al. 2014] . In this work, we extend FPC to a 64-bit word using the patterns tabulated in Table I . FPC maintains a table of seven most frequent data patterns, against which the incoming data is compared. When the incoming data matches one of the frequent patterns, it is compressed and stored along with a 3-bit prefix corresponding to the pattern that is represented by column 1 of the table. A 1-bit tag is used to indicate whether the written data is its compressed/uncompressed form. During a read access, the tag bit and prefix are used to uncompress the data to a full word. Columns 3, 4, and 5 show examples for each of the data patterns along with their compressed form and compressed data size. Column 6 represents the range of data values that each pattern can compress, and column 7 represents the percentage of data words compressed by each pattern (FPC cumulatively compresses about 60% of write accesses) in trace-based simulations of the SPEC CPU2006 [SPEC 2006 ] benchmark suite.
Base-Delta-Immediate. B I was originally proposed for on-chip cache data compression [Pekhimenko et al. 2012] by storing the compressed data using a "base" (B) and "delta" ( ), a series of offsets with respect to B. B I proposes that a cache line C = {V 0 , V 1 , . . . V n−1 } can be compressed and written as X = {B, 0 , 1 , . . . n−1 }, 2 along with a 4-bit tag, where B = V 0 (definition), i = V i − B, n = sizeof(cache line)/k, and k = sizeof( ). It is reported that data stored in a cache line is often regular, with limited dynamic range [Pekhimenko et al. 2012] . Whereas the regularity in the data is due to the common use of array data structures to store program data, the limited dynamic range in the stored data is due to the nature of computation. Similar to FPC, B I maintains a pattern table for compression. B I leverages regularity in data to compress cache lines using 64-byte patterns; Table II shows the eight different cache line patterns that B I is capable of compressing, along with illustrative examples for each of the patterns. The last column represents the percentage of data words compressed by each pattern (B I cumulatively compresses about 46% of write accesses) in trace-based simulations of the SPEC CPU2006 [SPEC 2006 ] benchmark suite. Given a cache line, it is matched against each row (pattern) of the table. If the cache line data matches a pattern, then the compressed data along with the prefix is stored in the data memory; the tag bit is set to indicate that the data stored is in compressed format. In contrast, if the data is not compressed, then it is stored as-is in the uncompressed format; the tag bit is reset to record this information.
Due to the low overhead and high compression ratio of FPC and B I, this work is motivated by the potential of integrating one of these compression techniques with expansion coding for a holistic solution that reduces energy and latency without any additional memory overhead.
Expansion Coding
As described in Section 2, the iterative P&V procedure used for programming MLC/TLC NVMs results in the central MLC/TLC states requiring more energy and latency in comparison to the terminal states. This motivates data encoding using only the low energy states, avoiding the high energy states to reduce the overall energy and latency. This work uses expansion coding, which encodes data using only these lower energy states. The rest of this section provides formal definitions for expansion coding and illustrates expansion coding with examples. Definition. A (k,m) q expansion code, m < k, is a linear block code with q-ary codewords of length k encoding q m q-ary message words. Expansion code encodes data using p ). Energy for (3,2) 8 expansion coding < (6,5) 8 expansion coding < binary coding.
lowest energy states of total q states ( p < q), where p = q (m/k) , incurring memory overhead of at least (k/m) − 1 (see Chapter 3 of Costello and Lin [2004] ).
We elaborate on the (k,m) q expansion code with the help of Figure 3 . Given a technology where each cell can represent q states, we can encode log 2 q bits of data in each cell. The total number of cells in each message is m. Hence, each message word is one of the q m possible words, where each word is q-ary. This is represented using the left-hand side of the figure. All of these messages are mapped to k cells, where each cell can represent only p states logically (although physically they are still capable of storing q states). This is represented using the right-hand side of the figure. The code expands an m-cell word to a k-cell word after encoding, hence the name expansion code. Expansion coding lowers both NVM energy and latency in practice, as the lowest energy TLC states also require fewer P&V cycles (Figure 2 ).
Example. For ease of understanding, we illustrate (3,2) 8 expansion coding and incomplete data mapping (IDM) , which is an instance of (6,5) 8 expansion coding. First, consider an example of (3,2) 8 expansion code, which encodes 16-bit data using TLC RRAM as shown in Figure 4 . In the regular case, data is encoded using all eight TLC states as shown in Figure 4 (a). Every slice of three logical data bits is stored in one TLC, requiring 6 (= 16/3 ) TLCs and 120.1pJ. In contrast, the (3,2) 8 expansion code uses only four ( 8 (2/3) ) out of the eight TLC states to encode the data (Figure 4 (b)). Every slice of two logical data bits is stored in one TLC, requiring eight (= 16/2 ) TLCs and 39.2pJ. In this example, although expansion coding results in energy and latency reduction, it incurs 50% memory overhead. IDM seeks to lower the memory overhead to 20% by using the (6,5) 8 expansion code utilizing the six lowest energy TLC states. Whereas binary coding encodes three information bits into each TLC, IDM encodes five data bits into two TLCs. Since two TLCs together can store a maximum 6×6=36 states, they can easily encode the 2 5 =32 states required for storing information from 5 bits. For a 16-bit data encoded using IDM (Figure 4 (c)), binary encoding uses six TLCs to represent 16 bits and requires 120.1pJ. However, using IDM, 16 data bits can be encoded using seven TLCs (using six states) and 94.3pJ. Although IDM in combination with dynamic data remapping has been shown to improve the lifetime of MLC/TLC NVMs , it provides marginal reduction in energy (up to 15%) for 20% memory overhead. In this work, we look to data statistics as a potential source of flexibility to realize the full energy and latency benefits of expansion coding for negligible memory and logic overhead.
CompEx Coding
The core idea of this work is to apply expansion coding selectively to compressed data such that the resulting data in expanded form does not exceed the original data width. For example, if n-bit raw data is compressed to ≤2n/3 bits, the (3,2) 8 expansion code will incur no memory overhead and also yield the full energy and latency benefits of expansion coding. However, to be successful in practice, the compression scheme must compress a large fraction of the memory traffic while also being lossless and simple to design with low performance overhead. FPC and B I are excellent candidates for this purpose due to their ability to compress a wide range of data for low overhead (refer to Tables I and II) . Furthermore, since the largest size of an FPC-compressed 64-bit word (or B I-compressed 64-byte cache line) is only 35 bits (292 bits), we can easily layer (3,2) 8 expansion coding atop FPC without incurring additional memory overhead. In this work, all examples and evaluation of CompEx coding use the (3,2) 8 expansion code (and the (6,5) 8 code as necessary/appropriate).
Example. Without loss of generality, we illustrate CompEx coding using a 16-bit word as an example, which is compressed to 8 bits using FPC as shown in Figure 5 . In the first case, we encode the data using all 8 TLC states from Figure 2 (a). This allows us to encode data using just 3 of the 6 TLCs in the memory location for 97.0pJ. However, using CompEx coding, as shown in Figure 5 (b), the 8-bit compressed data can be encoded using the (3,2) 8 expansion code. This uses 4 of the 6 TLCs for 24.9pJ. This example illustrates the integration of FPC with expansion coding without any additional memory overhead. Therefore, for a 64-bit word, CompEx coding encodes an FPC-compressed 64-bit word to a maximum of 18 ( 35/2 ) TLCs. Similarly, CompEx coding encodes a B I-compressed 64-byte cache line to a maximum of 146 ( (36×8+4)/2 ) TLCs.
No tag overhead. Although a single tag bit is necessary to record the outcome of CompEx coding, it is concatenated with the data (65 bits for FPC-based and 513 bits for B I-based CompEx codec) and absorbed into the last TLC at no cost ( 65/3 = 22 TLCs for FPC and 513/3 = 171 TLCs for B I). Consider the 16-bit example shown in Figure 5 (b), where the tag bit is appended to the end of the data word and encoded into the last TLC. Since 2 n (n ≥ 1) can never be a multiple of 3, the tag bit can always be encoded within the existing TLCs without any overhead for word sizes that are integer powers of 2. Furthermore, to ensure that the tag bit does not impact latency, we reserve state "7" of the TLC to indicate CompEx-coded data. 3.3.1. Theoretical Analysis of CompEx Coding. As we have established that CompEx coding can be applied without additional memory overhead, we now theoretically estimate the energy reduction for TLC RRAM whose energy and latency numbers are given by Figure 2. We start by calculating the average energy for TLCs with eight states, followed by the average energy for only four lowest energy TLC states. First, we assume that during writes, all permitted states of a TLC are equally likely to be programmed to Cho and Lee [2009] . Thus, the average/expected energy for a TLC with eight equally likely states (i.e., binary coding) is given by the sum of the entries for energy for all states in Figure 2 divided by 8 (= 16pJ for this case)-that is, E[Energy binary ] = 7 i=0 e i = 16pJ. Here, e i is the energy required for programming the TLC RRAM to the i th state. Second, the average energy for a TLC that utilizes only zero, one, six, and seven states, i.e., the (3,2) 8 expansion coding, is given by the sum of all energy for these subset of states divided by 4 (= 4.7pJ for this case)-that is, E[Energy (3,2) 8 ] = i=0,1,6,7 e i = 4.7pJ. Since (3,2) 8 expansion code uses (3/2)× more TLCs over binary encoding, we also need to factor this number into our energy computation. Therefore, the overall energy reduction using (3,2) 8 expansion code is E[Energy binary ]/E[Energy (3,2) 8 ] = (16/4.7) (2/3) ≈ 2.3. Additionally, since a fraction of the data is left uncompressed by FPC, the overall reduction in energy by using CompEx coding in comparison to binary encoding is E[Energy binary ]/ E[Energy CompEx ] = 2.3 p, where p is the compression ratio of the compression scheme. Similarly, for (6,5) 8 expansion code, the average energy reduction can be derived to be 1.4× using the same procedure (not discussed here for brevity). Our evaluation of expansion coding using real-world workloads (discussed in Section 5.1) closely agrees with the theoretical results-that is, 2.1×/1.3× practical energy reduction in comparison to 2.3×/1.4× derived theoretically for (3,2) 8 /(6,5) 8 expansion coding. Figure 6 (a). The CompEx codec (encoder-decoder) logic is embedded completely inside the NVM module controller, between the NVM array and the data bus, to seamlessly encode and decode the accessed data as shown in Figure 6 (a). The CompEx code encoder is made up of compression logic followed by an expansion code encoder. Similarly, the CompEx code decoder is made up of an expansion code decoder The incoming 64-bit write data is first encoded using the FPC encoder (latency = 2 cycles) followed by the expansion coding encoder (latency = 1 cycle) to obtain the encoded data and tag bit. (b) Decoder: The tag bit and the encoded data are first decoded for expansion coding (latency = 1 cycle) and then decoded for FPC (latency = 1 cycle). Our implementation hides these latencies using memory access pipelining.
followed by decompression logic. FPC-based CompEx coding uses word-size write/read accesses, whereas B I-based CompEx coding uses cache line size accesses.
Write. Whenever the MLC/TLC NVM module receives a write access from the memory controller on the processor side, the CompEx codec handles them as follows. First, the incoming data is passed through the compression logic, which compares the data to all of the compression patterns to attempt data compression. If the data is compressible, then the compressed data is passed through the expansion code encoder; uncompressible data are directly sent to the write circuit. The expansion code encoder encodes every 2-bit slice of the compressed data to 3-bit codewords. However, since the width of the encoded data is always less than 64 bits (for FPC) and 64 bytes (for B I), CompEx coding has zero memory overhead. Note that the tag bit may need to be updated to record the outcome of CompEx coding, regardless of whether it is successful or unsuccessful. However, as discussed earlier, this tag bit can be concatenated and absorbed into the data using the last TLC without requiring an extra cell for the tag bit. Read. When the MLC/TLC NVM module receives a read access from the memory controller, the CompEx codec inside the NVM module controller decodes the data before forwarding it to the memory controller. The read circuit reads the cells of the whole word and forwards the data to the CompEx codec. The tag bit is recovered from the last TLC of the read word and used to determine whether the data has to be CompEx decoded or forwarded directly to the NVM module controller.
3.3.3. CompEx Codec Logic Overhead. To estimate the logic overhead, we designed and synthesized 64-bit FPC-based CompEx codec (Figure 7) . The design of the B I-based compressor is assumed from the original B I proposal [Pekhimenko et al. 2012] . The estimated logic overhead for the CompEx codec tabulated in Figure 6 (b) is ≈10k 2-input nand gates (<0.1% per NVM module). Furthermore, although the CompEx codec has an estimated latency of 3/2 cycles for encoding/decoding, respectively, our implementation uses memory access pipelining to hide this in practice.
CompEx++ Coding
This section presents CompEx++ coding, which extends the energy/latency benefits of CompEx coding by leveraging the variable compressibility of pattern-based compression schemes. Pattern-based compression schemes like FPC and B I use multiple compression patterns, which compress data to different sizes. This variability in compressed data size motivates us to explore a scheme with custom expansion code for each compression pattern as opposed to a single expansion code for the entire scheme. In other words, we design CompEx coding for each of the compression patterns separately such that the custom expansion code along with the compression pattern maximizes the energy/latency advantages.
CompEx coding using FPC (B I) compresses 64-bit (512-bit) data using 22 (171) TLCs to 3 to 35 bits (3 to 292 bits), as illustrated in Table I (Table II) . Although the (3,2) 8 expansion code is a good choice when we consider FPC or B I as a whole, it might not be the optimum choice for each pattern individually. To find the best choice for each of the patterns, we need to integrate a custom expansion code, which is based on the compressed data size, for each compression pattern.
Example. Without loss of generality, and for ease of understanding, consider a 16-bit word FPC example, which requires six TLCs for storage, as illustrated in Figure 8 (a). The incoming write data, as illustrated in Figure 8 (a)(i), requires six TLCs and 77.2pJ. However, the write data is recognized to be a sign-extended half-byte that can be written using only 4 bits after compression. Note that although FPC requires the storage of a 3-bit prefix along with the data, we ignore the 3-bit prefix in this example for brevity. First, the compressed data is encoded using conventional binary encoding-encoding 3 bits per TLC-for 77.9pJ, as illustrated in Figure 8(a)(ii) . Second, the regular CompEx coding that uses (3,2) 8 expansion coding encodes the compressed data by packing two logical bits into each TLC; each 2-bit group is encoded using the four low energy states of the TLC (0, 1, 6, and 7) resulting in 16.7pJ, as illustrated in Figure 8(a)(iii) . Finally, since the compressed data size is about a quarter of the original size, it is possible to encode data using (3,1) 8 expansion coding. Figure 8 (a)(iv) illustrates data encoding using (3,1) 8 expansion coding; this requires five TLCs for data encoding, which is less than the originally available six TLCs, for only 8.5pJ. Thus, by making a smart choice of expansion codes (shown in Table III ) that is customized for each compression pattern, CompEx++ coding reduces the overall energy/latency over CompEx coding.
Encoding the compression prefix. Since CompEx coding encodes using only one type of expansion coding, encoding/decoding of the prefix is trivial-it depends only on the compression tag bit. However, CompEx++ coding uses multiple types of expansion coding, and hence it becomes important to carefully encode the prefix bits to eliminate ambiguity in decoding and latency overheads during encoding. Figure 8(b) illustrates the encoding and decoding of compression prefixes. Since both CompEx coding and CompEx++ coding do not require any modifications to the read circuit, the encoded data is always decoded for conventional binary encoding-that is, each TLC is decoded to be in one of the eight possible states. Following the read circuit, the data is expansion code decoded using the compression tag bit. However, in the case of CompEx++ coding, it is not clear if the data should be decoded for (3,2) 8 expansion coding or (3,1) 8 expansion coding. This ambiguity is resolved as follows. As we know, (3,2) 8 expansion coding encodes the 3-bit compression prefix using two TLCs, and (3,1) 8 expansion coding encodes the same using three TLCs. If the first TLC among the 2/3 prefix TLCs encode states 1 or 6, the data is decoded using (3,2) 8 expansion coding. In contrast, if the first of the 2/3 TLCs encodes states 0 or 7, the data is decoded using (3,1) 8 expansion coding. Such an encoding inherently prevents the compression prefix from becoming a latency bottleneck during encoding. Note that such a compression prefix encoding enables CompEx++ coding to encode a maximum of four compression patterns using each of (3,2) 8 and (3,1) 8 expansion coding. However, if more than four compression patterns are encoded using one type of expansion coding (e.g., B I-based CompEx++ coding, as shown in Table III ), then additional prefix bits become necessary.
3.4.1. Theoretical Analysis of CompEx++ Coding. We now evaluate the theoretical energy gains of CompEx++ coding using TLC RRAM as an example. We start by mathematically describing a pattern-based compression scheme and establish the bit-write (cell-update) reduction for a given compression scheme. This is followed by the mathematical description of custom expansion codes for each of the compression patterns. Finally, we integrate the closed-form expressions for compression and expansion coding to obtain a closed-form expression for the energy reduction due to CompEx++ coding.
Compression. Consider a pattern-based compression scheme with n different compression patterns, which are represented by P i , where i = 0, 1 . . . (n − 1). The i th pattern P i has a compression ratio of r i (0 ≤ r i ≤ 1) and matches the w i (0 ≤ w i ≤ 1) fraction of the whole write traffic. Such a compression scheme would reduce bit writes to n−1 i=0 r i w i (≤ 1) times the original bit writes. This implies that the compression scheme with low r (high compressibility) and high w (matching a large fraction of the traffic) can significantly reduce the bit writes.
Expansion coding. The optimum expansion code for the i th compression pattern leverages the entire space recovered using compression but does not spill over the original data size. This implies that for a data word (cache line) size of L, for a wordlevel (cache line-level) compression scheme, the ideal expansion code would be (L, r i .L) 8 expansion code. Furthermore, the (L, r i .L) 8 expansion code encodes data using only 8 r i (refer to Section 3.2 for details) low energy states (the set of these low energy states is defined as S i ) out of the eight states for TLC RRAM. Therefore, the energy reduction from encoding a TLC using (L, r i .L) 8 3.4.2. CompEx++ Codec Logic Overhead. Since CompEx++ coding requires multiple expansion code codecs, the logic overhead is higher in comparison to CompEx coding. However, it is important to note that since these expansion codecs are multiplexed (chosen) depending on the compression pattern and do not appear in series, there is no additional latency penalty for CompEx++ coding over CompEx coding. Our evaluation of the 64-bit FPC CompEx++ codec shows that each additional expansion code codec requires ≈2.5k 2-input nand gates. Therefore, along with the original 10k gates for the rest of the logic (CompEx coding), the total logic overhead for CompEx++ coding is ≈12.5k 2-input nand gates. However, this might not be a steep price to pay if the energy gain from CompEx++ coding is significantly higher than CompEx coding.
CompEx/CompEx++ Coding, ECC, and Encryption
This section discusses the integration of error correction support and encryption with CompEx/CompEx++ coding while preserving the low energy and latency benefits of CompEx/CompEx++ coding. First, we describe the integration of error correction support with CompEx/CompEx++ coding. Second, we establish the improvement in NVM scrub overhead due to CompEx/CompEx++ coding using theory and evaluate the same using simulations. Finally, we discuss encryption in CompEx/CompEx++ coding.
CompEx++ECC (CompEx/CompEx++ Coding with ECC)
. MLC/TLC NVMs are susceptible to both soft and hard errors, which necessitates the use of EDAC techniques such as ECC [Awasthi et al. 2012] and ECP [Schechter et al. 2010] . Since a write may alter both the data and the EDAC fields, the benefits of CompEx/CompEx++ coding the data field may be nullified by high latency writes in the EDAC field. However, by smartly organizing data and EDAC bits, we can preserve the energy/latency benefits of CompEx/CompEx++ coding for no additional overhead.
NVMs with EDAC usually use separable coding techniques (i.e., the data field is stored separate from the EDAC field). Therefore, we propose that whenever the data field is compressible, the EDAC field be written in expansion coded form-the additional cells required for expansion coding the EDAC field can be obtained from the residual cells after compression of the data field. From Tables I and II Without loss of generality and for ease of understanding, consider a 64-bit word, 8-bit ECC example, as illustrated in Figure 9 . The 64-bit data and the 8-bit ECC require 64/3 + 8/3 = 25 TLCs using conventional binary encoding, as illustrated in the figure. If the data is compressible, then the compressed data field requires at most 35/2 = 18 TLCs, whereas the 8-bit ECC requires 8/2 = 4 TLCs using (3,2) 8 expansion coding (i.e., one additional TLC in comparison to conventional binary encoding). However, the additional TLC required to encode the 8-bit ECC can be repurposed from extra TLCs remaining unused in the data field after encoding using CompEx/ CompEx++ coding, as illustrated in the figure. In contrast, if the data field is incompressible, then both the data and EDAC fields are written using conventional binary coding. Thus, CompEx/CompEx++ coding can be used even in the presence of ECC while preserving its energy and latency benefits.
3.5.2. Improvement in the NVM Scrub Interval due to CompEx++ECC. As we have already established, CompEx/CompEx++ coding can easily support ECC without compromising its energy/latency benefits. In addition to supporting the existing error correction capabilities, CompEx/CompEx++ coding can further increase the error correction capability by leveraging the unused TLCs (after accommodating both data and EDAC fields), shown in gray in Figure 9 . Therefore, FPC-based (B I-based) CompEx/CompEx++ coding can use a 14-bit (94-bit) EDAC field, as opposed to the 8-bit (64-bit) EDAC field resulting in stronger error correction capability. This increase in the error correction strength not only improves NVM lifetime but also system performance as follows.
MLC/TLC NVMs suffer from resistance drift, where a programmed cell resistance tends to continuously increase (drift) due to the ambient temperature. If the resistance drift is left unchecked, it moves the cell resistance to a different range altogether, which leads to data corruption. To prevent such data corruption, it is proposed that the NVM be periodically refreshed (popularly known as scrubbing in literature)-the data is read, corrected for errors, and rewritten to memory [Awasthi et al. 2012] . Although periodic scrubbing of NVMs can mitigate errors, it incurs performance penalties due to the mandatory reads and writes. Whereas stronger ECC can reduce the scrub intervals, this is at the expense of memory overhead. However, CompEx/CompEx++ coding has the potential to reduce performance overhead due to periodic scrub for no overhead. CompEx/CompEx++ coding encodes about 46% to 60% of the data residing in the memory using (3,2) 8 /(3,1) 8 expansion coding. The fraction of the memory encoded using CompEx/CompEx++ coding enjoys superior error correction strength 4 due to stronger ECC, which improves its resilience toward resistance drift. Therefore, the scrub period for the data encoded using CompEx/ CompEx++ coding can be much lower in comparison to binary encoded data, as shown in Figure 10 . Note that strengthening error correction capability using the memory cells recovered after compression is not a contribution of our work-several earlier schemes have leveraged compression for improved reliability [Chen et al. 2013; Palframan et al. 2015] . We integrate the principles from such schemes to evaluate not just the improvement in error correction capabilities of CompEx/CompEx++ coding but also its effect on the scrub rate, which directly impacts system performance and availability. Theory. This section builds the theoretical framework for evaluating CompEx++ ECC. Consider a k-bit data line and an m-bit EDAC field, which together [Chen et al. 2013] 2 1× 0% COP [Palframan et al. 2015] 2 1× 0%
constitute an n-bit (n = k + m) cache line stored using TLC NVM. Let p(t) be the soft error rate, which is defined as the probability that an NVM cell is not in its originally programmed state, where t is the time elapsed since the cell was last was programmed. Let us consider an ECC scheme, which can correct up to n e soft errors-a stronger ECC will correspond to a higher numerical value of n e and vice versa. Assuming that the soft errors are independent and uncorrelated [Awasthi et al. 2012] , we use the principles of binomial distribution to find the uncorrectable block error rate (UBER), which is the probability that a cache line (block) has more than n e errors, as follows:
Whereas a programmed NVM cache line exhibits almost no errors immediately after the time of programming, the probability of error continuously increases as time progresses. Thus, UBER(t) is a monotonically increasing function of p(t), which is in turn a monotonically increasing function of time due to resistance drift. When a cache line accrues more errors than its ECC capability, the data in the cache line is of no use for all practical purposes; such an event triggers expensive system-level memory error exceptions resulting in program rollbacks, program termination, and so forth. Thus, the upper limit on UBER(t), UBER max , is an important system-level design parameter (UBER max = 10 -10 in this work), which constrains the maximum time between successive scrubs (i.e., t scrub = UBER −1 (UBER max )).
Evaluation. In the rest of this section, we compare and contrast the scrub intervals across various error correction schemes for TLC PCM [Seong et al. 2013; Stanisavljevic et al. 2016] using Equation (1); this is tabulated in [Chen et al. 2013] and Compress and Protect (COP) [Palframan et al. 2015] , which are the state-of-the-art compression-based ECC schemes. Both Free ECC and COP integrate the standard (72,64) SECDED Hamming code, which results in a scrub interval equal to that of the baseline; however, it is important to note that they do so for no overhead. Thus, we conclude that CompEx++ECC can improve the scrub interval of the memory system, which improves system performance and availability.
3.5.3. CompEx/CompEx++ Coding and Encryption. NVM nonvolatility has emerged as a serious data security concern [Chhabra and Solihin 2011; Young et al. 2015; Kong and Zhou 2010] -the stored data continues to persist in the memory even after the system is powered down, exposing sensitive data to a malicious intruder. The dominant proposal to thwart such attacks advocates processor-side data encryption before the data is written to the NVM main memory. However, encryption scrambles the data and potentially reduces the data regularity that is necessary for CompEx/CompEx++ coding. Thus, if CompEx/CompEx++ coding follows encryption, there would be no energy/latency benefits. Therefore, we propose a key modification to the CompEx-encryption data path, which allows CompEx/CompEx++ coding to retain its energy/latency benefits even in the presence of data encryption. We propose to incorporate the following sequences for write and read data paths. On a write, the write data is first compressed, then the compressed data is encrypted in the processor-side memory controller. The resulting compressed encrypted data, which is smaller than the original word/cache line size due to compression, is stored in the memory using expansion coding (inside the NVM DIMM). On a read, the data stored in the memory is first decoded for expansion coding (inside the NVM DIMM), which is followed by decryption and decompression in the processor-side memory controller. Thus, data encryption can be seamlessly integrated with CompEx/CompEx++ coding to achieve data security without compromising the energy/latency benefits of CompEx/CompEx++ coding. Although data encryption is a robust data security solution, the data after encryption results in a high cell-update overhead due its high randomness. Furthermore, the move from SLC to MLC to TLC increases the average cell updates from 50% to 75% to 87.5%, respectively, which translates to prohibitively high energy and latency overheads. Therefore, although CompEx/ CompEx++ coding is compatible with data encryption, we believe that translating data security solutions from SLC to MLC to TLC is best addressed as a subject of future research.
METHODOLOGY
Our evaluation of CompEx/CompEx++ coding is based on (i) full-system simulations to evaluate the system-level performance and (ii) trace-based simulations for deep, multibillion instruction evaluation of memory-level energy and latency of CompEx/ CompEx++ coding. Note: The WPKI and MPKI numbers are sensitive to the CPU and memory system architecture. The numbers presented in this table are compiled for the architecture defined in Table V (a) using MARSS.
Full-System Simulation
CompEx++ coding is evaluated using full-system simulations of a system that integrates TLC RRAM memory using MARSS [Patel et al. 2011 ], a full-system multicore simulator, and DRAMSim2 [Rosenfeld et al. 2011 ], a cycle-accurate main memory simulator. MARSS uses x86 core models from PTLSim [Yourst 2007 ], a cycle-accurate x86 microarchitecture simulator, and plugs it into QEMU [Bellard 2005 ], a binary-translation system for emulating full systems. QEMU provides the capability of emulating various I/O devices (HDD, ethernet, HID, etc.) that can be used to boot up entire operating systems without any modification (Linux in this work). Note that here, we modify MARSS to propagate write data along with the address throughout its memory hierarchy.
We use DRAMSim2, a cycle-accurate main memory simulator for simulating the DDR3 NVM main memory system. MARSS and DRAMSim2 are integrated to provide a monolithic, seamless, cycle-accurate simulation of the entire system. Since each access in a TLC NVM memory can potentially have different access latencies, we modify DRAMSim2 to account for this. MARSS setup. MARSS was configured to simulate a standard four-core out-of-order system running at 3GHz. Each core has its own L1 cache with two separate instances of 32kB SRAM for data and instructions; the L2 cache is private, with each core having its own instance of 256kB SRAM; finally, the L3 cache is a single, shared, write-back cache of size 8MB. The latencies of each level of cache is tabulated in Table V(a). DRAMSim2 setup. For accurate timing simulation, we modified the DDR timing parameters along the lines of for substituting DRAM with NVM. We use TLC RRAM with latency parameters extracted and summarized in . Workloads. We evaluate CompEx/CompEx++ coding using the SPEC CPU2006 [SPEC 2006 ] benchmark suite, which reflects a variety of integer and floating-point workloads used by modern computing systems. To evaluate real-world usage, we use nine composite workloads, with each workload containing four SPEC CPU2006 benchmarks. These composite workloads are derived from Ham et al. [2013] , where benchmarks are selectively picked due to their memory-intensive nature. Table V(b) lists the constituent benchmarks for each composite workload and their corresponding writes per kilo-instruction (WPKI) and misses per kilo-instruction (MPKI) as reported by MARSS.
Trace-Driven Simulation
For running deep, multibillion-instruction simulations, CompEx/CompEx++ coding is evaluated using both an in-house trace-driven simulator for evaluating the memory ] to update only the modified cells in the NVM array.
array-level dynamic energy and NVMain [Poremba and Xie 2012] -an architecturallevel main memory simulator for emerging NVMs-for evaluating the total energy at the memory module level. We modified NVMain to reflect the variable-write-latency behavior of CompEx/CompEx++ coding and also configured NVMain to simulate an architecture equivalent to that in Table V(a). The traces are generated from the SPEC CPU2006 [SPEC 2006 ] benchmark suite using the Intel Pin binary instrumentation tool [Luk et al. 2005 ] on a machine running a 3.3GHz Intel Core i7 CPU. Note that we also use the Gem5 system simulator [Binkert et al. 2011 ] to validate these results using an equivalent architecture; the results are consistent with what is reported here and not discussed for brevity. Our simulation framework captures memory accesses from the processor, recording only those accesses sent to main memory. During trace generation, the benchmarks are first run through 5×10 5 memory writes to ignore the write accesses from program initialization; they are then run until 4×10 6 memory write operations (equivalent to about 4 billion instructions on average) have been recorded or until the program terminates.
EVALUATION AND RESULTS
This section presents the evaluation of CompEx/CompEx++ coding at the memory and system levels. First, we present the energy and latency results at the memory level for different encodings-baseline (binary encoding with DCW), compression techniques (FPC and B I), expansion coding ((3,2) 8 and (6,5) 8 ), CompEx coding (FPC-based and B I-based CompEx coding, using (3,2) 8 expansion coding), and CompEx++ coding (FPC based and B I based). Second, we present the results for system-level evaluation of CompEx/CompEx++ coding, primarily to evaluate the impact of latency improvements of CompEx/CompEx++ coding on system performance.
Memory Energy/Latency
Summary. Table VI summarizes and compares the total module energy, memory array dynamic energy, latency, and overhead for the nine encoding techniques considered in this article. In summary, FPC-based and B I-based CompEx coding reduce the memory array write energy by 33% and 76% and write latency by 22% and 25%, respectively, in comparison to binary encoding for no overhead. The energy and latency benefits of CompEx coding are further extended by CompEx++ coding; FPC-based and B Ibased CompEx++ coding reduces the memory write energy by 1.2% and 9.3% and write latency by 0% (no improvement) and 6.7% in comparison to FPC-based and B I-based CompEx coding, respectively. Table II ), the dynamic energy performance of B I-based CompEx coding is better than FPC-based CompEx coding. Furthermore, FPC-based CompEx++ coding and B I-based CompEx++ coding reduce the dynamic (total) energy by 33% (18%)% and 88% (61%) in comparison to binary encoding, respectively, which is 0% (1.2%) and 16% (7%) over CompEx coding, respectively.
Energy.
Our simulation framework tracks all cell writes that occur from the beginning of program execution to compute the cumulative energy required. Note that although CompEx/CompEx++ coding does not require any additional memory overhead in comparison to classical binary encoding, the static energy from peripheral circuits and the memory array are indirectly influenced by the reduction in the latency of each write operation. A lower write latency translates to a lower energy expense to keep the peripheral circuits active, which is evaluated using NVMain [Poremba and Xie 2012] . Figure 11 (a) and (b) show the memory array dynamic energy and the total memory module energy, respectively, for FPC, B I, (3,2) 8 and (6,5) 8 expansion coding, and FPCbased and B I-based CompEx coding, normalized to binary coding. The last entries of Figure 11 represent the geometric mean (GM) of the energy reduction across all benchmarks, which is equivalent to simulating all of these benchmarks for the same execution time. Note that all cases use classical read-modify-write (DCW) for writing only the modified cells to the NVM array.
Our simulations show that FPC-based CompEx coding reduces the memory array dynamic energy (total energy) by 33% (18%) and 15% (16%) in comparison to binary coding and FPC, respectively, whereas B I-based CompEx coding reduces energy by 76% (57%) and 16% (25%) in comparison to binary coding and B I, respectively. Additionally, (3,2) 8 and (6,5) 8 expansion coding in isolation show a reduction of 2.1× and 1.3× in memory array write energy in comparison to binary coding, which is in close agreement with the theoretical estimate of 2.3× and 1.4×, respectively, as derived in Section 3.3.1. Furthermore, our trace-based evaluations show that FPC-based and B I-based CompEx++ coding extend the energy benefits of CompEx coding by 0% (1.2%) and 16% (7%), respectively, in terms of dynamic (total) energy. CompEx++ coding definitely shows improvements in energy over CompEx coding; however, the difference in improvements is quite small for FPC-based CompEx++ coding. The reason for this small difference in energy numbers is because of the low percentage of pattern match Latency. We evaluate the impact on memory array write latency due to CompEx/ CompEx++ coding. As detailed in Section 2, the states with lower energy also require lower write latencies due to the iterative P&V procedure. Whereas the energy required for writing a word is given by the sum of the energy required for all cells, the latency for writing a word depends on the cell that requires the longest latency. Therefore, since CompEx/CompEx++ coding encodes data using only lower energy states, the program latency for writing compressed data is also reduced. Our simulation determines the latency for each write access individually by tracking the maximum latency cell write for the word access. The write latency for each access is then cumulatively computed to obtain the overall latency for program execution. Figure 12 shows that FPC-based CompEx coding is able to reduce the overall write latency by 22% and 19% in comparison to binary coding and FPC, respectively, and B I-based CompEx coding reduces overall write latency by about 25% and 26% in comparison to binary coding and B I, respectively. Furthermore, CompEx++ coding extends the latency benefits of CompEx coding by 0.6% and 6.7% over FPC-based and B I-based CompEx coding, respectively. Since the latency improvements, as observed using trace-based evaluations, of FPC-based CompEx++ coding over FPC-based CompEx coding is marginal, we do not evaluate FPC-based CompEx++ coding separately using full-system simulations.
Full-System Evaluation
In this section, we report and discuss the impact of CompEx/CompEx++ coding on system performance. We use two metrics: (i) IPC, which is a metric for full-system performance, and (ii) main memory bandwidth, which is a metric for main memory performance. Table VII ] to update only the modified cells in the NVM array. and B I-based CompEx++ coding improve IPC by 34%, 6.3%, 5.1%, and 10.6% and memory bandwidth by 76%, 11.1%, 12.6%, and 19.9% in comparison to binary encoding, respectively. In addition, (3,2) 8 expansion coding, which is an upper bound for CompEx coding, improves IPC by 34% over classical binary encoding.
Summary.

Instructions per cycle.
To evaluate the impact of CompEx/CompEx++ coding on IPC, we use a full-system simulator based on MARSS [Patel et al. 2011] and DRAMSim2 [Rosenfeld et al. 2011] . The simulation setup is described in detail in Section 4. We simulate nine composite workloads from Table V(b). Figure 13 shows the IPC for each benchmark in each workload. The last set of bars in each workload in Figure 13 represents the mean IPC for that workload (since each benchmark is run on a separate and exclusive core, the mean IPC is given by the arithmetic mean of the IPCs of the constituent benchmarks of the workload (see Chapter 1 of Hennessy and Patterson [2011] )). The last set of bars in Figure 13 represents the harmonic mean (HM) of the IPCs of all workloads (computing the HM is equivalent to running all workloads in the same system for a fixed number of instructions).
First, our simulations show a good correlation between IPC and both WPKI and MPKI. For example, consider the workloads WD 3 (highest MPKI) and WD 5 (lowest MPKI). As expected, the high cache miss rate of WD 3 lowers the IPC in comparison to WD 5 , which has a lower cache miss rate. To understand the dependence of IPC on WPKI, let us consider the workloads WD 7 (highest WPKI) and WD 4 (lowest WPKI). CompEx coding reduces latency and improves IPC only during write accesses. Thus, the workload with higher WPKI should show a higher improvement in IPC in comparison to the workload with lower WPKI; our simulations, in agreement with the preceding argument, show that FPC (B I)-based CompEx coding improves the IPC by 17.5% (14.8%) for WD 7 (high WPKI) in comparison to 5.5% (1%) for WD 4 (low WPKI). Second, our simulations also show similar correlations at the individual benchmark level. Intuitively, memory-intensive benchmarks like milc (WD 6 ) or lbm (WD 3 ) should have lower IPC in comparison to less memory intensive benchmarks like gemsFDTD (WD 6 ) or omnetpp (WD 4 ); our simulations are in excellent agreement with the preceding argument. Finally, our simulations show that the overall IPC improvement using FPC-based CompEx coding, B I-based CompEx coding, and B I-based CompEx++ coding is about 6.3%, 5.1%, and 10.6%, respectively, in comparison to classical binary encoding.
5 In addition, (3,2) 8 expansion coding, an upper bound for CompEx coding, improves IPC by 34% over classical binary encoding.
Memory bandwidth. Figure 14 shows the main memory bandwidth across nine composite workloads for the baseline and CompEx/CompEx++ coding in comparison to classical binary encoding. In the figure, the bars represent FPC-based CompEx coding and B I-based CompEx/CompEx++ coding normalized to the baseline (binary encoding with DCW). The last set of bars in the figure represents the GM of the normalized improvements (computing the GM is equivalent to running each workload for the same execution time). Intuitively, similar to IPC, workloads that have high WPKI should contribute to higher bandwidth improvements in comparison to those with low WPKI. This can be seen using the example of workloads WD 6 /WD 7 (high WPKI), which show an improvement of 30% in bandwidth using B I-based CompEx coding, in comparison to WD 5 (low WPKI), which shows an improvement of only 6%. On the whole, our simulations show that FPC-based, B I-based CompEx coding, and B I-based CompEx++ coding improve the average memory bandwidth by 11.1%, 12.59%, and 19.9%, respectively, in comparison to binary encoding. Furthermore, the bandwidth improvement of 5 We see a good correlation between the trace-based latency results and the full-system IPC results. Benchmarks like milc and libquantum, which have better improvement in latency using FPC- (3,2) 8 expansion coding, which is an upper bound for CompEx coding, over classical binary encoding is 76%.
6
Lifetime. In this article, we theoretically evaluate the lifetime gains of CompEx/ CompEx++ coding using TLC RRAM as an example. discuss the three primary mechanisms for cell failure in RRAM in detail, and each mechanism contributes to limit the lifetime of a cell; at the memory array level, the lifetime of an RRAM cell is specified as a limit on the number of programming cycles (SET/RESET) that it can endure before the cell becomes dysfunctional. Therefore, the lifetime of MLC/TLC RRAM is lower in comparison to SLC RRAM, since programming an MLC/TLC RRAM requires a higher number of P&V cycles in comparison to SLC RRAM . Since CompEx/CompEx++ coding effectively reduces the number of P&V cycles for programming an NVM cell (by limiting cell states to only low energy/latency states), it also improves the lifetime of the MLC/TLC NVM. Trace-based lifetime evaluations of CompEx/CompEx++ coding for TLC RRAM-along the lines of , , and Jiang et al. [2012a] -show that CompEx/CompEx++ coding increases the lifetime by 1.8× over classical binary coding.
RELATED WORK
MLC/TLC NVMs definitely benefit from the broad set of solutions developed to improve energy, latency, and lifetime of SLC NVMs [Zhao et al. 2014 ]. However, SLC-based solutions do not address the energy/latency problems of MLC/TLC NVMs that are primarily due to the iterative nature of the cell write operation (i.e., iterative P&V). Solutions that explicitly address the challenges of working with MLC/TLC NVMs have focused on data compression and data coding by exploiting the physical properties of the NVM cell and the locality in data traffic [Yang et al. 2000; Alameldeen and Wood 2004; Wang et al. 2011 ].
Memory compression. Increasing cache capacity by various compression techniques that leverage different localities have been studied [Sardashti and Wood 2013; Ahn et al. 2001; Chen et al. 2010; Yang et al. 2000; Yang and Gupta 2002] . On similar lines, there are solutions that compress main memory traffic for the benefits of bandwidth, power, and capacity [Mahapatra et al. 2005; Yim et al. 2004; Dgien et al. 2014; Ekman and Stenstrom 2005; Arelakis and Stenstrom 2014; Shafee et al. 2014] . In the context of MLC/TLC NVMs, compressing main memory traffic yields fewer cell writes per write access, thus lowering energy. Whereas some solutions reduce energy/latency by excluding undesirable states for additional memory area (summarized in the following), CompEx coding, to the best of our knowledge, is the first work to explicitly investigate the trade-offs between the area recovered from compression and solutions that exclude undesirable MLC/TLC states.
Excluding undesirable states. Arjomand et al. [2011] and propose circuit and architectural changes to dynamically configure MLC PCM cells as either MLC or SLC for latency benefits. In contrast, Jiang et al. [2012b] excludes the hard-to-reset cells that require high programming current and recovers the lost data using ECC, thereby reducing power consumption and improving lifetime. Extending the observations of Jiang et al. [2012b] , Elastic RESET (ER) [Jiang et al. 6 Note that the improvement in bandwidth is not due to compression of a cache line/word to a smaller sizedata transfer between the memory controller and the NVM module is always in the uncompressed format in CompEx/CompEx++ coding architecture; the bandwidth improvement is primarily attributable to the faster cell writes due to expansion coding, which avoids high latency states of MLC/TLC NVM cells. Thus, the data transfer of a 64-byte cache line always requires eight 64-bit word transfers (in a 64-bit bus architecture) irrespective of whether the data is compressible or incompressible. 2012a] proposes data coding that eliminates the undesirable terminal RESET state (high programming current) and uses only two or three of the four states of a cell to realize lifetime/power/latency improvements. Similar to ER, hybrid MLC/SLC proposes to opportunistically use PCM cells as either MLC or SLC for energy and latency benefits. Independently, for MLC PCM, Wang et al. [2011] propose energy-efficient data coding to reduce the usage of intermediate high energy states by mapping the most frequent data patterns to the low energy states. However, Wang et al. [2011] require online computation and storage of the most frequent patterns for every memory line at runtime, incurring compute, memory, and logic overhead. For MLC PCM, Yoon et al. [2013] propose data coding that eliminates one of the intermediate resistance states; this improves cell retention but incurs memory overhead. For TLC RRAM, propose data coding that uses six out of eight TLC RRAM states to improve latency and energy by eliminating the use of intermediate resistance states [Wang et al. 2011; . By combining IDM with dynamic data remapping and error-correcting pointers, lifetime improvements for 20% memory overhead and negligible impact on energy/latency are reported. Xu et al. [2015] observed a superlinear relationship between RESET latency and the number of 1s written to the array and advocated writing smaller chunks of data using compression to reduce the RESET latency. For MLC PCM, form switch (FS) [Jiang et al. 2012c ] first introduced the notion of writing data in SLC/MLC depending on the result of compression. However, FS may not always result in energy/latency reduction, as it depends on the compression technique used and the energy/latency profile of the NVM cell. In practice, it is necessary to balance dynamic trade-offs between data compression, the NVM energy/latency profile, and data encoding: CompEx/CompEx++ coding is a step in this direction to realize simultaneous improvements in energy, latency, and lifetime of MLC/TLC NVMs.
CONCLUSIONS
This article described CompEx coding-a low overhead, dynamic trade-off framework that synergistically integrates pattern-based compression with expansion coding to realize simultaneous energy, latency, and lifetime improvements in MLC/TLC NVMs. The core idea of CompEx coding is to selectively apply expansion codes (i.e., linear block codes that encode data using only the low energy states of an MLC/TLC cell) to compressed data, thereby ensuring that the resulting data in expansion coded form will not exceed the original data width. CompEx coding is agnostic to the choice of compression technique; in this work, we evaluated CompEx coding using both FPC and B I compression. We also proposed CompEx++ coding, which extends CompEx coding by leveraging the variable compressibility of pattern-based compression techniques. CompEx++ coding integrates custom expansion codes to each of the compression patterns to exploit the maximum energy/latency benefits of CompEx coding. Our full-system simulations of a system that integrates TLC RRAM show that CompEx/CompEx++ coding reduces total memory energy by 57%/61% and cell latency by 23.5%/26%; these improvements translate to a 5.7%/10.6% improvement in IPC, a 11.8%/19.9% improvement in main memory bandwidth, and 1.8×/1.8× improvement in lifetime over classical binary coding using DCW. CompEx/CompEx++ coding thus addresses the programming energy/ latency as well as the lifetime challenges of MLC/TLC NVMs that pose a serious technological roadblock to their adoption in high-performance computing systems.
