This paper describes a low overhead, offline frequent value encoding (FVE) solution to reduce the write energy in multi-level/triplelevel cell (MLC/TLC) non-volatile memories (NVMs). The proposed solution, which does not require any runtime software support, clusters a set of general-purpose applications according to their data frequency profiles and generates a dedicated offline FVE that minimizes write energy for each cluster. Results show that the write energy reduction of evaluation sets-using FVEs generated for training sets-are close (equal) to the best known solution for MLC (TLC) NVM encoding; however, our solution incurs a memory overhead that is 16× (5.7×) less than the best comparable scheme in the literature for MLC (TLC) NVMs.
Introduction
It is acknowledged that DRAM scalability will slow beyond the 22nm feature size due to its considerable static and dynamic power requirements [1] . Non-volatile memory (NVM) technologies, such as phase change memory (PCM) [2] and resistive RAM (RRAM) [3] , are being actively investigated to replace DRAM. Such technologies are highly scalable, have extended data retention times, and consume negligible static power. NVM technologies, however, have their own limitations, e.g., higher write energy, higher write latency, and limited endurance in comparison to DRAM [2, 4, 5] .
Data encoding solutions (e.g., [6] [7] [8] ) explicitly reduce write energy in NVM technologies. They also implicitly improve NVM endurance [9] , and (in some cases) improve access latency [10] . Frequent value encoding (FVE) was originally introduced to encode data and address buses [11] [12] [13] . FVE can achieve high reductions in NVM write energy by mapping frequent values (words) into low energy codewords. Recently, FVE has been proposed to encode single-level cell (SLC) memory words in PCM [14] . The authors investigated both static (offline) and dynamic (online) encoding. They report that although offline encoding (originally introduced as "find-once for a given program" in [11] ) achieves higher energy reductions on average, it requires compiler and operating system support, which is not desirable in practice. In contrast, online encoding may be better, but it requires a non-trivial online profiling effort that may impact system performance. However, to the best of our knowledge, FVE has not been applied to MLC/TLC NVMs, where This research was supported by King Fahd University of Petroleum and Minerals and NSF Award CCF-1217738.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.
GLSVLSI '16, May 18-20, 2016 , Boston, MA, USA a single physical cell can store more than one logical bit to realize density and cost advantages [15, 16] . Since MLC/TLC NVMs have higher write energy and access latency as well as lower endurance in comparison to SLC NVMs, there is strong motivation to develop data encoding and wear-leveling techniques for MLC/TLC NVMs [9, 10, [17] [18] [19] .
In this paper, we present an offline FVE for MLC/TLC NVMs that achieves an average write energy reduction approaching, and sometimes exceeding, that of optimal offline encoding [11] . The proposed method, which does not require compiler or operating system support, is based on finding multiple optimal FVEs. Each FVE is derived by aggregating the data frequency profiles of a group of compatible applications. To the best of our knowledge, this is the first work that presents a feasible version of an offline FVE approaching the average energy reductions of optimal offline FVE. This work makes the following contributions:
• It describes the use of k-medoids algorithm [20] to cluster a set of general-purpose benchmark applications (SPEC CPU2006 [21])into k compatible groups. The applications are represented as an observation matrix consisting of rows of feature sets that are essential inputs to the k-medoids algorithm (Section 3.1).
• We process the output of the k-medoids algorithm to produce offline, low overhead, energy-efficient FVEs. The objective is to reach the write energy of optimal offline FVE while avoiding its disadvantages (Section 3.2). The proposed codec (coder/decoder) architecture uses read-only memories (ROMs) that are known to consume low power/energy in comparison to lookup tables (LUTs) and content-addressable memories (CAMs) (Section 3.3).
We divide the set of SPEC CPU2006 benchmarks into training and evaluation sets. The training set is used to derive the k offline FVE mappings, which are then used to encode the applications in the evaluation set. We evaluate the proposed solution for the MLC PCM prototype in [22] and the TLC RRAM prototype in [16] . Results (Section 4) indicate that average write energy in the MLC case is only 5% more than that of optimal offline FVE. The result is even better in the TLC case, where the average write energy of the proposed technique is 1% less than that of optimal offline FVE. In comparison to the write energies of MLC and TLC data comparison write (DCW) [23] , the proposed solution achieves 39% and 35% energy savings, respectively, while the memory overhead in both cases does not exceed 3.5%. We report results for different values of k (the number of clusters), and for some values, our method results in average write energy that is only 4% more than that of the state-of-the-art MLC PCM encoding technique in [24] . However, [24] requires 50% NVM overhead, which is 16× in comparison to the proposed solution. Whereas the 50% overhead of [24] can be reduced, we argue that such a reduction will result in exponentially larger sizes of LUTs in comparison to the codec ROMs required by our solution. For TLC RRAM, our method results in the same average write energy as incomplete data mapping (IDM) [10] . However, IDM has a memory overhead of 20%, i.e., 5.7× the overhead of the solution proposed in this paper. 
Background and Motivation
This section provides an illustrative overview of offline FVE [11] and highlights its potential advantages. Motivated by the fact that most computer applications have high data locality, FVE encodes frequent values such that they are assigned to the least energy codewords. It has been reported that these frequent values may account for 32% of all the values written to memory by a specific application [12] . Consider an example where an application has only four possible values: A, B, C, and D: In this table, the 1 st row is the possible value, 2 nd is the corresponding frequency of occurrence of that value, and 3 rd is the corresponding write energy for each value. Therefore, the total energy consumed by this application is 7,805 pJ. The following valuecodeword assignment:
reduces the total energy to 4,483 pJ, i.e., 43%. The assignment ranks the values in decreasing frequencies and codewords in increasing energies, and assigns values with high frequency to codewords with low energy in the order of the ranking. For example, A has the highest frequency and D has the lowest energy, we assign value A the codeword D. This optimal offline code assignment will only be good if used for this specific application (or a similar one), and therefore it is referred to as optimal perfect knowledge assignment, i.e., find-once for a given program [11] . Henceforth, we simply refer to this method as the perfect knowledge method. In practice, if we have more than one application, we aggregate their frequency profiles and generate a combined code assignment resulting in lower energy. However, combining frequencies of incompatible applications may eliminate skew that is essential to achieve high energy reductions, as illustrated in Fig. 1 . The plots show (1) the aggregate frequency plot of 32 distinct benchmark applications (28 SPEC CPU2006 [21] and four Splash-2 [25] ) and (2) the frequency plot of the leslie3d benchmark. Even though most offline frequency profiles of individual applications are skewed in favor of FVE, aggregating frequency profiles may eliminate skew and plateaus the graph for almost 98% of possible values. This is in comparison to the individual frequency plot of the benchmark leslie3d, in which the last elbow in the plot appears after 23% of possible values, implying high energy reduction.
Most applications have skewed frequency profiles in favor of perfect knowledge FVE. A primary motivation for this work is the careful aggregation of the frequency profiles of applications to derive multiple encodings, in which each encoding is still a result of a skewed aggregate frequency profile.
The authors in [14] evaluated perfect knowledge FVE on SLC PCM. While they reported that average write energy is greatly reduced, perfect knowledge FVE requires profiling support during the compilation process, implying a long compilation time for most applications. Another disadvantage of perfect knowledge FVE is that it requires operating system support, since for every application, we need to store the code assignment specific to that application in its executable. Perfect knowledge FVE is therefore expensive in practice.
On the other hand, online profiling occurs during application execution, e.g., [14] . Frequent values are identified on the fly and may also be replaced by other more frequent values. Although online profiling provides adaptive write energy reduction, it requires significant hardware and may affect overall system performance.
Contributions
The main contribution of this paper is the integration of an offline FVE solution for the purpose of MLC/TLC NVM write energy reduction approaching the reductions of perfect knowledge FVE without its disadvantages: the need for operating system and compiler support. Specifically, this work makes the following contributions:
• We use the k-medoids algorithm to cluster a set of applications into k compatible subsets. Precisely, our contribution is encoding the input applications as an observation matrix that consists of rows of feature sets, which is an essential input to the k-medoids algorithm (Section 3.1).
• We process the output of the k-medoids algorithm to derive a set of offline frequent value encodings. The objective is to achieve an average write energy that is as close as possible to the average write energy of perfect knowledge FVE (Section 3.2). We propose a codec architecture that utilizes ROMs that are known to consume low dynamic power in comparison to LUTs and CAMs (Section 3.3).
The proposed solution requires a slight modification to the traditional NVM memory word structure. This modified word structure is only internal to our codec and does not require any architectural changes outside the boundaries of the codec. As in Fig. 2a , an NVM word is composed of one or two tag cells, followed by a number of fixed length data slices. Each slice stores a frequent value in one of k encoded forms, and the tag cell(s) helps in the decoding process. The logical width of a data slice is referred to as frequent value length (FVL), and is equal to log2vmax , where vmax is the maximum possible frequent value. The FVL parameter and the number of possible frequent values to be encoded (FVN) characterize FVE. In this work, we fix FVL to 8 logical bits in the MLC case (Fig. 2b ) and 9 logical bits in the TLC case (Fig. 2c) . For FVN, however, the obvious choice would be 2 FVL , and this is used in this work. It is possible, however, that FVN < 2 FVL , and this can be combined with compression to lower the number of bit writes, e.g., [14] . However, the cells in which the compressed values are written will potentially wear out faster. Hence, in this work, FVN is fixed to 2 FVL , i.e., 256 for MLC and 512 for TLC, to avoid biased wear of the memory cells.
Memory Trace Clustering
Finding data clusters is an unsupervised learning process in which data is divided into a given number of subgroups. Classical clustering algorithms include k-means [26] and k-medoids [20] . Clustering has played an important role in many areas, including artificial intelligence, pattern recognition, medical research, business intelligence, psychology, and political science [26, 27] .
We utilize the k-medoids algorithm to cluster 32 memory traces of SPEC CPU2006 [21] and Splash-2 [25] benchmarks. These benchmarks represent a wide range of real user applications, and it is expected that most applications will have similar workloads. The k-medoids algorithm takes the following inputs: the number of clusters k, an observation matrix, and a similarity distance metric. For this work, the observation matrix consists of 32 rows (one row per benchmark), while the number of columns equals the feature set size, which is 256 for MLC and 512 for TLC.
Feature set extraction: The first step in generating the feature set of a given application is to extract its frequency information. Let f (v) be a function returning the frequency of value v in the given application, where 0 ≤ v < FVN. Using this function, we form a set of pairs {(v, f (v)) : 0 ≤ v < FVN} and sort it in decreasing order with respect to f (v) to generate an ordered list of these pairs. Each pair in the ordered list is composed of two entries: a value v and its frequency f (v). Augmenting all the first entries of the pairs in the ordered list results in the feature set row vector v d . This feature set vector uniquely characterizes the given application. Two applications have the same vector iff they share similar, if not identical, frequency profiles. The rows of the observation matrix are constructed from these feature set vectors for all the 32 benchmarks. Note that a feature set vector is a permutation of all the values from 0 to FVN−1.
The similarity metric provides a means of measuring the distance between two applications, or more precisely, two feature sets or observations. It helps assign an observation to a cluster. In this work, since the rows of the observation matrix are permutations of the same set, a rank-based metric (Spearman rank correlation) was adopted. However, in most cases (see Section 4), other classic metrics, which are not rank correlated, perform equally well, i.e., the cosine and square Euclidean metrics.
As illustrated in Fig. 3 , given k, an observation matrix, and a similarity metric, the k-medoids algorithm assigns a cluster ID (ranging from 0 to k − 1) to every observation. Obviously, the size of the clusters may not be equal. Also, some of the clusters may have a single observation. For the sake of separating training data from evaluation data, we split the output of the k-medoids algorithm into a training set and an evaluation set. This strategy is widely used in the literature and its purpose is to measure and compare the quality of clustering in both sets. A good clustering results in relatively close qualities, provided that the right number of clusters is chosen.
Since the size of the clusters may be different, and some clusters may contain an odd number of observations, we cannot always choose half the observations in each cluster to construct the training set. Instead, we create k lists, L[i], where each list is initially identical to the corresponding cluster. We remove one observation at a time from the list containing the largest number of observations until the number of observations in the lists is halved. We assume that L is available as input to the code generation algorithm.The observations in the cluster lists represent the training set and the remaining observations represent the evaluation set.
Code Generation
Consider a function e(v) that calculates the energy of frequent value v as the sum of the write energies of the cells composing v. For example, with respect to the PCM MLC prototype in [22] , if v = 0 and FVL=8, then e(0) = 36 × 4 = 144 pJ, since the value 0 is composed of 4 MLCs whose values are all 00. Another example is e(0) = 2 × 3 = 6 pJ in case of RRAM TLC [16] .
Next, we form a set of pairs {(v, e(v)) : 0 ≤ v < FVN} and sort it in increasing order by e(v) to generate an ordered list of these pairs. Augmenting the first entries of all these ordered pairs, i.e., the v portion of the pairs, results in the vector ve. Similar to v d , the elements of ve form a permutation of the integers in the range 0 to FVN−1. Note that ve depends on the underlying MLC/TLC NVM technology and not on the application. It is a list of all the codewords ordered in increasing order by their energies. In the code generation algorithm, this vector is referenced item-wise, i.e., ve [i] refers to the i th element in the vector. Also, in the same algorithm, we assume that the function Fp(o, v) returns the perfect knowledge frequency of value v in observation (application) o. Further, L[i] is the i th cluster list introduced earlier. Algorithm 1 outlines the process of code generation. The final output of the algorithm is a two-dimensional mapping M (i, v), where 0 ≤ i < k and 0 ≤ v < FVN are the encoding number and the value-to-be-encoded, respectively. Each cluster list generates one code as follows. First, the perfect knowledge frequency profiles of the applications in the cluster are aggregated, resulting in a set of pairs {(v, f (v)) : ∀v}. This set is sorted in decreasing order by frequency. The sorted list is stored in structure P , where P [j] is the j th pair with respect to the sort order, and P [j].v refers to the v entry of the pair. The final step to build the i th dictionary, M (i, v), for all v. This is achieved by associating P [j].v with the j th codeword, ve [j] . These steps are repeated to build the codes for each of the other cluster lists.
Algorithm 1: Generating the k codes for the k clusters
Sort the pairs {(v, f (v)) : ∀v} in descending order by f (v), and let P [i] be the i th pair of this list. Further, let P [i].v is the value part of the pair for j = 0 to FVN-1 do
Hardware Realization
The resulting codes from algorithm 1 are used to program k ROMs, one ROM per cluster. ROM i is programmed with the portion of the mapping tables corresponding to cluster i. Each slice of the incoming memory word is associated with its own encoding ROM, and the incoming memory word is encoded k times, one time per code. Then, the encoded version that results in minimum energy is selected for writing, taking into account the current content of the target address. Fig. 4 shows the encoding path for k = 2. Since we have two clusters, the code assignment algorithm will result in two mappings. Tag cell T0 (T1) takes the value 0 (1) to indicate that the word is encoded using the 0 th (1 st ) mapping. The encoding overhead in bits due to using ROMs equals the size of one encoding ROM × the number of slices per word × the number of words in the memory line × k. Note that all encoding ROMs have the same size which is FVN × FVL bits.
For decoding, we invert the mapping tables into ROMs as shown in the decoding architecture in Fig. 5 . To avoid extra delay in the read path, the same ROM is duplicated in each slice across the whole memory line. One or more tag cells store the cluster number to decode the encoded slices inside the NVM array. Given a tag cell and an encoded slice, a decoding ROM outputs the decoded frequent value. Despite the duplication, the overhead is low in practice. The decoding overhead equals the size of one ROM × the number of slices per word × the number of words in the memory line, and ROM size is k × FVN × FVL bits.
In Fig. 4 and 5, encoders and decoders are duplicated across all the slices in the memory line for simplicity. If this design is not affordable, an alternative lower overhead design is also possible using multiplexers and pipelining. 
Evaluation and Results
In this section, we present the simulation results of the proposed solution for the 4-state MLC PCM prototype in [22] and the 8-state TLC RRAM prototype in [16] . For brevity, we refer to these prototypes in this section as MLC and TLC, respectively.
Simulation setup: We evaluated the proposed solution using a trace-driven memory simulator. We used the Pin binary instrumentation tool [28] to extract the memory traces of 32 distinct applications: 28 SPEC CPU2006 [21] benchmarks and 4 Splash-2 [25] benchmarks. Table 1 lists the simulation configuration and the memory overhead due to using the tag cell(s). The number of logical bits in the memory line is 512 bits for MLC and 513 bits in TLC due to padding. In MLC, a 64-bit logical word is constructed from 32 physical cells. To avoid unnecessary padding in the TLC case, we chose a word size of 57 physical cells (171 logical bits). Note that this word size is local to the memory controller and does not imply a change to the word size of the CPU. Memory overhead: As indicated in Table 1 , NVM overhead due to tag cell(s) is always 3.1% in MLC, since the ratio between the number of tag and data cells in the word is fixed. In TLC, the number of slices per word is constant, but the number of tag cells is one or two, and therefore, the overhead is 1.8% or 3.5%, respectively. The number of tag cells in the MLC and TLC cases is log2k , where k is the number of clusters. Further, Table 2 reports the overhead for the codec ROMs for different values of k. Clearly, the codec overhead is negligible in comparison to the sizes of state-of-the-art NVMs. Hardware and latency overheads: We implemented Verilog modules for our codec circuitry. In our design, we avoid replicating the codec across all the words of the memory line, since this results in higher area. Instead, a single codec instance is shared (and pipelined to avoid latency penalty) among all the memory words across the memory line. Using Design Compiler, we synthesize these modules on a 45nm technology node [29] . Since k = 8 provides best results for MLC, we chose to synthesize this case to examine the overheads. We implement the codec ROMs using case statements and the the cosine metric. The area of the synthesized codec is 0.23 mm 2 , which is negligible in comparison to the area of the state-of-the-art PCM memories, e.g., less than 0.4% the area of the 8Gb PCM in [30] . The latencies (energies) of the decoder and encoder paths are 1.93 ns (1.1 pJ/cell) and 3.91 ns (2.3 pJ/cell), which is also negligible in comparison to the PCM program-and-verify (P&V) write latencies (energies) [22] . Assuming P&V writes 34 cells at once, we can reduce the total memory line latency to a single word latency, since by the time the P&V modules complete writing one PCM word, the encoder will have encoded the entire line. But the decoding latency of the complete memory line equals the number of words per line times the word latency, i.e., 15.44 ns. If this is high, the decoder can be replicated to achieve a read latency that is as low as 1.93 ns for the memory line. Although this replication results in tripling the codec area (0.62 mm 2 ), it also reduces the read latency by 8×.
Energy reductions: Fig. 6 compares state-of-the-art methods to the proposed solution for MLC PCM. The bars represent the geometric mean (GM) of the write energy for each method across all the 32 applications. GMs are normalized to the GM energy consumed by DCW [23] . The methods from left to right are: DCW, the 50% overhead method [24] , MFNW [18] , cell remapping [9] , and perfect knowledge FVE [11] . NVM overheads of these methods are 0%, 50%, 3.1%, 3.1%, and 0%, respectively. The following bars belong to the proposed solution, labeled by the number of clusters, followed by a comma, followed by the distance metric used, which is either cosine, square Euclidean, or Spearman rank correlation (denoted by cos, sqeuc, and spearman, respectively). The first bar assumes no clustering, i.e., a metric is not required. It is clear that clustering is beneficial as k increases from 2 to 16. Clearly, as k increases, energy reductions also increase. k = 8 results in more or equal energy reduction in comparison to 16 clusters. Although the GM of the write energy of the proposed solution is measured across all the 32 applications, the training set, which is used to generate the k FVE mappings, is only composed of 16 applications. Fig. 7 compares the GM energy of the proposed solution across all applications, training set applications, and evaluation set applications. Energy reduction across the evaluation set for k = 1 is almost equivalent to DCW. As k increases, the reduction across the evaluation set improves. Although energy reduction across all the applications for k = 8 is better than k = 16, the energy reduction across the evaluation sets is marginally better (≈ 1%) for k = 16 in comparison to k = 8. Fig. 8 is the TLC equivalent of Fig. 6 . The methods from left to right are: DCW, MFNW for TLC, incomplete data mapping (IDM) [10] , and perfect knowledge FVE. NVM overheads of these methods are 0%, 1.8%, 20%, and 0%, respectively. Similar to the MLC case, as k increases, the energy reduction also improves. To compare energy reductions across the training and evaluation sets, we also provide a break down of this result in Fig. 9 . Similar to MLC, when no clustering is performed (k = 1), energy reduction is only 5% better than DCW across the evaluation set. As k increases, the energy reduction across evaluation set improves. NVM overheads of the proposed solution are 1.8% for k ≤ 4, and 3.5% for k = 8 and 16.
In all the previous charts, we divide the 32 applications into two halves: training set and evaluation set. To show the robustness of the clustering, we re-run the simulations using a training set size of Figure 6 : The x-axis lists state-of-the-art MLC PCM encoding techniques followed by our proposed method. The y-axis is the geometric mean (GM) write energy normalized to data comparison write (DCW) [23] . From left to right, the encodings are: DCW, 50% overhead encoding [24] , MFNW [18] , cell remapping [9] , and perfect knowledge FVE [11] . This is followed by aggregate FVE without clustering (1,no metric), and variations of our method, broken down by the number of clusters and distance metric, e.g., (2,cos) to indicate two clusters with the cosine metric. 8 applications and evaluation set size of 24 applications. Since we use only 8 benchmarks for training, a cluster size of 16 is no longer possible, and therefore, we only show the result for k =2, 4, and 8. Fig. 10 shows the robustness chart in the MLC case. Clearly, the clustering is robust and the energy of the 25% training set size case (8 applications) is only 3.34% more on average in comparison to the 50% training set size (16 applications). The result is similar in the TLC case (Fig. 11) , as the average energy is only 3.5% more.
In Fig. 6 , the 50% overhead method produces the lowest MLC write energy, which is almost identical to perfect knowledge, but costs 50% NVM overhead, which is about 16× more than our overhead. Moreover, the average write energy of the proposed solution is only 5% more than that of perfect knowledge when using 8 clusters. In comparison to MFNW and cell remapping, the proposed method consumes 29% and 24% lower energy, respectively.
For the TLC result in Fig. 8 , IDM results in marginally better energy reduction over perfect knowledge (≈ 1%). Note that our proposed solution is comparable to IDM when k = 16 for the square Euclidean and Spearman metrics. However, IDM has NVM overhead of 20% ≈ 5.7× the proposed solution NVM overhead.
Conclusions
This paper proposed an energy efficient, low overhead offline FVE for MLC/TLC NVMs that approaches the performance of perfect knowledge FVE. The main idea is to cluster the perfect knowledge frequency profiles of a broad set of general-purpose applications to generate combined FVE mappings for each cluster. Results show that the average write energy for MLC PCM of the proposed method is only 5% more than the average write energy of perfect knowledge FVE. For TLC RRAM, the average write energy of the proposed solution is better by 1%. Furthermore, the low NVM and codec overheads makes the proposed method easy to implement and integrate into modern memory controllers. 
