Phase change memory (PCM) is a promising technology for computer memory systems. However, the non-volatile nature of PCM poses serious threats to computer privacy. The low programming endurance of PCM devices also limits the lifetime of PCM-based main memory (PRAM). In this paper, we first adopt counter mode encryption for privacy protection and show that encryption significantly reduces the effectiveness of some previously proposed wear-leveling techniques for PRAM To mitigate such adverse impact, we propose simple, yet effective extensions to the encryption scheme. In addition, we propose to reuse the encryption counters as age counters and to dynamically acijust the strength of error correction code (ECC) to extend the lifetime of PRAM Our experiments show that our mechanisms effectively achieve privacy protection and lifetime extension for PRAM with very low performance overhead.
Introduction
Phase change memory (PCM) is an emerging memory technology, which features low access latency, byte addressability, high integration density, and low leakage energy consumption. Recently, there have been strong interests in using PCM as main memory (PRAM) in computer systems [13] [20] [27] .
Despite of having the aforementioned strengths, PCM raises new challenges for computer system design. One key property of PCM is its non-volatility: the information stored in PRAM is persistent without power supply and may last for more than 10 years [18] . Although non volatility is a fundamental reason for power efficiency, it makes PRAM much more vulnerable to malicious security attacks than volatile DRAM. Another important issue of PCM is its umeliability, particularly due to its limited write endurance [18] [27] .
In this paper, we adopt a hardware encryption scheme, counter-mode encryption, to protect the privacy of PRAM,
given its proven security and high performance. However, This negates the effectiveness of some previously proposed wear-leveling techniques, redundant bit-write removal [27] and partial writes [13] in particular.
To mitigate the encryption impact upon wear-leveling techniques, we propose two new schemes. First, we
propose simple yet effective extensions to the encryption scheme to revive partial writes. Second, we propose to use encryption counters as age counters and to dynamically adjust error protection strengths. In this work, we use error correction code (ECC) and design an efficient way to manage ECC protection. Our experiments using a cycle accurate timing simulator show that the performance overhead introduced by encryption and ECC is small given the long latency of PRAM accesses.
Our main contributions include (1) to our knowledge, this is the first work to investigate the impact of encryption on wear-leveling techniques for PRAM; (2) we propose a new encryption counter scheme to revive partial writes to reduce write traffic; (3) we propose to leverage encryption counters as age counters and design an efficient way for ECC management to extend PRAM lifetime.
The remainder of the paper is organized as follows. In Section 2, we present background and discuss related work.
In Section 3, we analyze the encryption impact on wear leveling techniques and we propose a new encryption counter scheme to mitigate it. In Section 4, we present our design to leverage encryption counters as age counters for efficient ECC management. In Section 5, the overall architecture is presented and the interaction between encryption and ECC is discussed. Detailed experimental results are presented in Section 6. Section 7 concludes the paper.
Background and related work
In PCM, a memory cell can be transformed between a low-resistivity state and a high-resistivity state by atomic arrangements. Compare to FLASH memory which has the effective and of higher performance [20] [27] . In this section, we discuss the challenges for using peM as main memory (PRAM) and some can be applied to the FLASH technology as well.
Privacy
The non-volatility nature of PRAM aggravates the privacy concerns over the contents residing in main memory. It was reported that secret keys for disk encryption can still be retrieved from volatile DRAM even minutes after a computer is powered off [10] . Attackers may simply dump the plaintext image of PRAM to extract critical information. Such one-time storage dump is also a major security concern for disk storage systems and disk encryption is often used for security protection [8] . In a more strict security model, it is assumed that more complex security attacks exist and attackers have the abilities to monitor the dynamic data read from and stored to main memory. Protection against such advanced security attacks is addressed with secure processor architectures [14] . The security of those secure architectures relies on the assumption that processor cores are unbreakable. Secret keys involved are generated or sealed inside processors.
Given the privacy issues with PRAM, we opt to use the counter mode encryption for its proved security and high performance [3] and assume that the secret keys are inside processors .
. Figure I shows the counter-mode encryption proposed for secure processors [3] . The counter data block is composed of a counter, two address offsets and a logical page identifier (LPID). The counter is associated with a cache line and is incremented by one every time when the cache line is written back to main memory. One address offset is the cache line offset within a page. The other is the plaintext data block offset within the cache line when the cache line size is larger than the size of an encryption/plaintext data block. The LPID is assigned uniquely for each allocated page in the main memory. The security strength of the counter-mode encryption comes from the fact that the value of the counter data block for each plaintext data block is unique both spatially and temporally. The LPID and address offsets provide spatial uniqueness while the counter ensures temporal uniqueness. For resource efficiency, counters of limited sizes are used.
As a result, when a counter overflows, a new unique LPID is generated and the whole page would be re-encrypted to ensure uniqueness of counter data blocks [3] . The performance advantage of the counter-mode encryption is that the block cipher (i.e., encryption) latency can be overlapped with the latency of fetching the encrypted data block.
Wear-leveling techniques
Various wear-leveling techniques have been proposed to address the limited endurance of PRAM to extend its lifetime. The granularity of the approaches can be at the segment/page level, the cache-line level, and the individual bit level. They can be classified into two categories. One is using swapping/shifting/rotation to even the wear-out among hot (more write accessed) and cold (less write accessed) spots. For example, segment swapping [27] happens when one segment becomes so hot that it has to be swapped with some cold segment to even out wear-out.
Similarly, in the start-gap scheme [19] , each cache line would be rotated through the whole memory with the support from one spare cache line. above the cache line granularity such as segment swapping [27] , the start-gap [19] and the line-level write-back [20] are not affected by encryption. For redundant bit-write removal [27] , with encryption, the new ciphertext data values would be largely different from the old ciphertext data values. Figure 3 shows such an example with AES encryption. Even if the to-be-written-back plaintext data block is the same as the old one, the two ciphertext data blocks are largely different from each other due to the encryption counter update. Therefore the effectiveness of the redundant bit-write removal is greatly reduced. For the partial-write scheme [13] , the problem with encryption is that the encryption counter for a cache line is incremented for every write back. As a result, the whole cache line needs to be re-encrypted and modified, thereby completely disabling partial writes even if there is only one dirty word in the cache line.
A new encryption counter scheme to mitigate the encryption impact
Based on the analysis in Section 3.1, we propose to extend the original encryption counter scheme to revive the partial-write wear leveling. Our extension includes additional counters at the encryption-block granularity. In other words, for each cache line, besides one cache-line level counter, we add mUltiple block-level counters.
Encryption for each data block is done using the 
Adaptive ECC management
A straightforward way to extend PRAM lifetime using ECC is allocating enough ECC storage to cover the maximum number of bit errors that are expected. However, during the most of the PRAM lifetime, the number of bit errors is much less than the expected maximum number.
Therefore it is wasteful in space and potentially harmful to the performance. The reasons are (a) some memory space are allocated for unnecessary ECC storage and (b) the logic for correcting a high number of bit errors is slower than it for a small number of bit errors.
In this paper, we propose to dynamically manage ECC strength according to the reliability/wear-out status of PRAM. To keep track of the wear-out status of PRAM, we use the number of write-accesses to PRAM as a metric. We assume that the raw bit error rate increases as PRAM ages [9] . Therefore, we gradually increase the strength of ECC protection by allocating more ECC bits when PRAM is gradually worn out. For efficient ECC management, two main challenges need to be addressed. The first is where to store the ECC bits as the size varies for different error correction requirements and how to access them accordingly. The second is to obtain the write-access counters as they are necessary for monitoring the wear-out status of PRAM.
The ECC space and address mapping
To accommodate dynamic ECC management, we propose a hardware-software integrated approach, shown in Figure 5 .
Group 0 Group i :
The design of our proposed scheme for dynamic
ECC management
Instead of having separate memory chips for ECC, we propose to store both data and their ECC bits in a unified memory space (i.e., virtual memory space). The operation system (OS) will assist the dynamic allocation of ECC and data pages and maintain information of the ECC pages so that the data pages will not be mapped to the same places.
The virtual memory is partitioned into groups and each group contains multiple data pages and the corresponding ECC pages. All the data pages in the same group have the same level of ECC protection. The assumption is that some aforementioned rotation-based wear-leveling techniques The process of determining the physical address of ECC bits of a data access is shown in Figure 6 . First, the physical address of the data (data_cachOine_addr in Figure 6 as the unit of data operation is the cache line size) is used to decide which group contains the data (ECCbase_addr in Figure 6 ). Then, the offset within the group (ECCof!secaddr)
is calculated to see where the data is located within the group at the cache line granularity. Based on how many ECC bits are allocated for each cache line, the address of the ECC bits is generated. Note that due to the use of the two's power numbers, the mUltiplication and division in Figure 6 can be implemented using simple shifts and adds.
The group size is mainly dependent on the PRAM endurance characteristics. Since all the data in the same group have the same ECC, we essentially assume that the memory in a group will wear out at a similar rate. If the endurance variation is expected to be large, we prefer a small group size to avoid a small region to affect a large amount of memory. Note that although an ECC page is 
Leveraging encryption counters as write-access (age) counters
There are several ways to track the number of write accesses to memory pages in PRAM. One is to associate a local counter with each cache line. The maximum among all local counters in a page would be the age of the page.
Another way is to have a single counter for one page and it records the total number of write-accesses to the page. The first approach incurs high space overhead (local counter size * number of lines in a page). The second approach has much less space overhead (just one counter) but less accurate since write-accesses to different lines in a page are accumulated, leading to a highly overestimated age. To reduce space overhead without losing much accuracy, we choose to use a two level counters scheme similar to the one used in [22] .
In the counter-mode encryption described in [3] , there already is a local counter (LC) for each line in a page. In addition, there is a 64-bit logical page identifier (LPID) assigned for each memory page when it is allocated. To account for a high number of write accesses, we add another global counter (GC) for each page. When a local counter overflows, GC would be incremented by one and all the local counters in the same page would be reset to zero. In this case, as discussed in Section 2.1, a new unique LPID is generated and the page would be re-encrypted to ensure security [3] . Such a combined two-level global and local counters are much more accurate than one global counter and have much less space overhead than the local counter scheme. Note as the overflow happens rarely, the associated performance overhead is negligible [22] .
Overall Architecture
The overall architecture for improving privacy and lifetime is shown in Figure 7 . With the proposed architecture, a memory access proceeds as follows. When an encrypted cache line is to be fetched from PRAM, its memory address is used to locate the corresponding counters and generate the seed for the block cipher. As the counters are stored along with the ciphertext data in the PRAM, directly accessing them will postpone the seed generation process and expose the block cipher latency. To overcome this performance issue and overlap this latency with PRAM access latency, a counter cache is included as in the previous work [3] . Here, note that if the counter scheme proposed in Section 3.2 is used, each local counter (LC) will be appended with a few small block-level counters. The encrypted data are stored in a data page in PRAM. The ECC bits are stored in an ECC page in the same group as the data page. In our scheme, ECC and data pages have the same organization and the counters in either type of pages are updated when there is a write access. The only difference is that the counters in the ECC page are only used to track the age and not for decryption.
There is an interesting interaction between encryption and ECC generation. Two options exist: the ECC bits can be computed either based on plaintext data or ciphertext data. If ECC is computed using plaintext data, the write access time can be reduced as the ECC computation and encryption can be performed in parallel. On the other hand, Figure 7 .
In an alternative memory organization, which uses PCM as main memory and DRAM as another level of cache [20] , the counter cache can reside in the DRAM cache. This DRAM cache can also be used to store the uncompressed data when memory compression technologies [1] [7] are employed. Since encryption makes data less compressible, encryption should happen after the compression stage. In this case, the plaintext data shown in 
Experimental results

Methodology
Our experiments are conducted using a cycle-accurate timing simulator developed upon the SimpleScalar toolset [2] . The underlying processor model is MIPS R 10000 and the default configuration is listed in Table 1 . The L2 cache size is set to 1 MB to increase the memory traffic. The PRAM access latency is 1024 cycles [19] . The block cipher engine is a 128-bit AES cipher and the encryption latency is assumed to be 80 cycles [3] . Each cache line in the counter cache stores the counters for one page. It is composed of a 64-bit LPID [3] , multiple local counters and one global counter (GC). The error correction code used is a binary cyclic code (BCH) [17] . The BCH latency depends on the number of bit errors that can be corrected and the maximum latency is 120 cycles for correcting 8 bit errors [11] [26] . For each message data of k bits, a BCH codeword (containing both data and the redundant ECC bits) with a length of n bits can be constructed to correct up to t bit errors out of the entire codeword. The length of the codeword n should satisfy 2(m-I)-I<n<=r-1 and m*t<=n k, where m is the minimum number of redundant ECC bits required for every bit error correction. In our experiments, 4 BCH codes are interleaved to protect the data at the granularity of the last-level cache line size (256 bytes). For each BCH codeword, k = 512 bits (64 bytes) and m = 10.
Therefore n-512 >= IO*t must be satisfied. It indicates that each additional bit error correction would need an additional 10 redundant ECC bits.
Memory-intensive benchmarks from SPEC2000 and SPEC2006 with high cache miss rates are used in the experiments. For lifetime analysis, an in-order processor model is used for simulation speed and the benchmarks are simulated to run for a hundred-billion instructions or upon completion.
Impact of encryption on wear-leveling techniques
In this section, we show the impact of encryption on me/, has around 99.9% of its total bit writes redundant.
Even for the lowest one, equake, the redundant bit writes are around 68%. In contrast, with encryption, every benchmark has only around 50% of total bit writes redundant. On average using the geometric mean (Gmean),
978-1-4244-7501-8/101$26.00 ©2010 IEEE 339
.<:: c. Vl Figure 8 . Impact of encryption on redundant bit-write removal redundant bit-write removal can save around 90% of the bit-write traffic to PRAM. With encryption, however, it would only save half of bit-write traffic to PRAM.
As discussed in Section 3.1, encryption completely removes the benefit of partial-write. In this experiment, we first confirm the benefits of partial writes in reducing write traffic when encryption is not used. The traffic reductions normalized to the baseline, in which every replacement of a dirty cache line writes the whole cache line to PRAM, are shown in Figure 9 . It can be seen that partial writes at word granularity (4 bytes) or encryption block granularity (16 bytes) can reduce around 45% or 35% of the write traffic on Gmean. Note here we conduct experiments in a conservative way as we do not simulate memory buffer organization. With coalescing effects under memory buffers, the results may be better [13] . With encryption, however, partial writes fail to reduce any write traffic.
_ word_granularity _ encryption_block_granularity _ with encryption Figure 9 . Impact of encryption on partial writes 6.3 Effect of the new proposed encryption counter scheme on partial writes Figure 10 shows the effect of our proposed new encryption counter scheme (Section 3.2) to revive partial writes to reduce write-traffic to PRAM. The baseline is the one in which the whole dirty cache line is written back to PRAM upon replacement. In our experiment, we examined block-level encryption counters of different sizes, I-bit, 2-bit, 3-bit, and 4-bits. We also compared them to the ideal case when the block-level counters can be arbitrarily large (labeled 'MAX-bit counter' in Figure 10 ). space overhead. Therefore, we choose 2-bit block-level counters, which achieve 26.8% write traffic reduction on average using Gmean.
Impact of memory compression on wear leveling techniques
In this section, we examine the impact of memory compression. It can be used as another wear-leveling technique to reduce write traffic. However, it also varies the bit-level redundancy compared to uncompressed data.
In our experiments, Frequent Pattern Compression (FPC)
[1] and LZSS [21] similarity is also reported in [7] . Note that it is not practical to combine compression with partial writes as compression may change the data layout in the compressed data. There are several observations. First, there are two to abundant non-dirty data as shown in Figure 9 and high bit-level redundancy as shown in Figure 8 . Overall, compression combined with bit-write removal achieves similar write-traffic reduction (87.7% on Gmean) to redundant bit-write removal only (89.5% on Gmean).
PRAM lifetime comparison
In this experiment, we map write traffic to PRAM lifetime estimate. We assume that write traffic to PRAM is .. This is much more efficient compared to the fixed ECC allocation scheme, which means the space otherwise reserved for ECC can be allocated for program use.
R -w-
• with encryption 100%
.� In the next experiment, we examine redundant bit writes in ECC with and without encryption. Two ECC schemes, SEC-OED [24] and BCH [26] , are used in this experiment and the results are shown in Figure 13 . From the figure, it shows that without encryption, there are a high number of redundant bit writes, which can be eliminated with the redundant bit-write removal technique.
However, with encryption, for both SEC-OED and BCH, the ratio of redundant bit writes becomes 50%, indicating the same behavior as the bit-write traffic in regular data.
Performance overhead of encryption and ECC
As discussed in Section 5, we choose to compute ECC based on encrypted data. To examine the overall performance impact, we model the proposed scheme in our timing simulator. For each benchmark, we skip the first one billion instructions and execute the next three-hundred million instructions. The performance results, which are normalized to the baseline without encryption or ECC, are shown in Figure 14 . Figure 14 . Performance overhead of encryption and ECC
From Figure 14 , it shows that adding encryption and ECC has very small perfonnance impact (0.3% on average). The benchmark, meL 06, has the worst perfonnance degradation (1 %) due to its high memory traffic. The reason for such small perfonnance overhead is that the PRAM access latency is large enough to dominate the overall perfonnance.
Summary
In summary, without encryption, redundant bit-write removal and partial writes can achieve 51.3 years and 7.9 years of PRAM lifetime, respectively. With encryption, those two combined can only achieve 2.6 years of lifetime.
Our proposed new encryption scheme can improve it to around 5.0 years at 1.6% space cost. Redundant bit-write removal combined with compression can achieve 27.5 years without encryption and 2.6 years with encryption.
The encryption counters can be leveraged to monitor PRAM wear out status and the adaptive ECC management can be deployed with an increment of 2.0% space cost and dozen-cycle latency for each additional bit error to be corrected.
Conclusion
Phase change memory is a promising technology for computer systems. In this paper, we investigate the largely overlooked privacy issue of PRAM due to its non-volatility.
We show that if encryption is used for privacy protection, some of previously proposed wear-leveling schemes will be severely affected. To mitigate the adverse impact of encryption, we propose to extend the counter-mode encryption to revive a wear-leveling technique: partial writes. We also investigate memory compression and show that it reduces memory traffic but hurts bit-level redundancy. Then we study error correction code (ECC) as an essential mechanism for PRAM lifetime extension. We propose a dynamic ECC management scheme to vary ECC protection strength according to the age of PRAM, which is conveniently provided from the encryption counters. Our experimental results show that the perfonnance overhead
