Abstract -Hybrid memory, where the DRAM acts as a buffer to the PRAM, is a promising configuration for main memory systems. It has the advantages of fast access time, high storage density and very low standby power. However, it also has reliability issues that need to be addressed. This paper focuses on low cost Error Control Coding (ECC)-based schemes for improving the reliability of hybrid memory. We propose three candidate systems that all guarantee block failure rate of 10 -8 but differ in whether the DRAM and/or PRAM data get coded and the strength of the corresponding ECC code. The candidate systems are evaluated with respect to lifetime, Instruction Per Cycle (IPC) and energy. We show that (1) at lower Data Storage Time (DST), the proposed system which has different ECC schemes for DRAM and PRAM has the longest lifetime and one of the highest IPC; (2) at higher DST, stronger ECC codes are necessary for all the systems and longer lifetime can be achieved at the cost of decrease in IPC.
INTRODUCTION
Phase-Change Random Access Memory (PRAM) is a promising memory technology that has higher storage density and lower standby power compared to Dynamic Random Access Memory (DRAM) and is likely to replace DRAM in main memory systems. However, several challenges need to be overcome before incorporating PRAM into main memory [1] : (1) larger read and write latency: PRAM is 2x and 10x slower for read and write respectively, thereby negatively affecting the system performance; (2) poor write endurance: PRAM-based main memory is only able to sustain of 10 6 to 10 8 programming cycles as opposed to > 10 16 for DRAM-based main memory; and (3) higher write energy: programming energy of PRAM is 50x higher than DRAM, thus the write traffic to PRAM should be reduced significantly. A hybrid main memory, where DRAM serves as a cache or buffer for PRAM, can exploit the benefits of both memories. Such a memory will have the advantage of lower write latency, higher endurance and less power consumption of read and write operations compared to PRAM-only memory and significantly higher storage capacity and lower standby power compared to a DRAM-only memory.
There are several studies on latency reduction, lifetime improvement and power management for hybrid DRAM+PRAM main memory [2] , [3] , [4] . The hybrid architecture in [2] uses a lazy write policy to successfully reduce the write traffic, thereby *This work was supported in part by NSF CNS 1218183. hiding the slow latency and high energy consumption of PRAM and indirectly addressing the endurance problem. Methods based on DRAM bypass and dirty data keeping are proposed to reduce refresh energy and memory access latency in [3] . For low cache hit rates, the DRAM cache can be bypassed when reading data from PRAM resulting in 23.5% ~ 94.7% reduction in the energy consumption compared to DRAM-based main memory. Also, dirty data in DRAM can be kept for a longer period of time in order to decrease the number of write-backs to PRAM. The architecture in [4] achieves the same effect as lazy write in [2] by restricting CPU writes to DRAM and only writing to PRAM when a dirty page is evicted from DRAM. Such a scheme has been shown to have significant improvement in endurance, performance and energy consumption.
In this paper, we focus on use of low cost Error Control Coding (ECC) schemes for improving the reliability of DRAM+PRAM hybrid systems. Our contributions are as follows: (1) we outline and analyze error characteristics of DRAM+PRAM hybrid main memory; (2) we propose three candidate systems that all guarantee a predetermined reliability constraint but differ in what type of ECC is used and where; (3) we present a thorough evaluation and discussion of Instruction Per Cycle (IPC), lifetime and energy consumption of the three systems. We find that if Data Storage Time (DST) is low, the system that has different ECC schemes for DRAM and PRAM has the longest lifetime and one of the highest IPC with negligible energy penalty. At higher DST, stronger ECC codes are required for all three systems to satisfy the reliability constraints, and longer lifetime can be achieved at the cost of decrease in IPC.
The rest of this paper is organized as follows. Section II reviews the basics and error models of PRAM and summarizes the techniques to reduce the error rate in PRAM. We propose three candidate systems and their corresponding ECC schemes in Section III. In Section IV, we describe the evaluation platform based on CACTI [5] and GEM5 [6] and present the tradeoffs between IPC performance, energy and lifetime of the three candidate systems before concluding in Section V.
II. BACKGROUND

A. Basics of PRAM
A Single Level Cell (SLC) PRAM consists of two states, RESET state (logical '0') corresponding to the high resistance amorphous phase; and SET state (logical '1') corresponding to the low resistance crystalline phase. A 2-bit Multiple Level Cell (MLC) PRAM has 4 states: '00' is full amorphous state with the highest resistance, '11' is full crystalline state with the lowest resistance, and '01' and '10' are the two intermediate states. 
B. Errors in MLC PRAM
There are two main types of errors in an MLC PRAM as shown in Fig. 1 . The soft errors increase with increase in DST, defined as the time that the data is stored in memory between two consecutive writes. When DST increases, the resistance of state '11' remains almost constant, while the resistances of the other states increase resulting in soft errors. Specifically, soft error Es('01'->'00') is due to the resistance of state '01' crossing the threshold resistance Rth(01,00), and Es('10'->'01') is caused by the resistance of state '10' crossing the threshold resistance Rth(10,01). Thus soft error rates depend on the distributions of the resistances of states '10' and '01' and the values of Rth(10,01) and Rth(01,00).
Hard errors are mainly caused by repeated cycling that leads to Sb enrichment at the bottom electrode [9] . As a result, the bottom electrode cannot heat the GST material sufficiently, and the resistance of state '00' becomes lower than the desired level for reset state. Thus hard errors are mainly due to resistance of state '00' crossing Rth(01,00). These errors increase as the Number of Programming Cycles (NPC) increases. In this paper, we use the accurate error models that we had developed in [8] based on our device models [7] to calculate the soft and hard error rates.
C. Techniques to Reduce the Error Rates in PRAM
In order to reduce the Bit Error Rate (BER) in PRAM, a multilevel approach including techniques at the circuit and the architectural levels has been proposed in our previous work [8] .
At the circuit level, threshold resistance tuning assigns an optimal value to Rth(01,00) to minimize total error rate. This value is a function of NPC and DST. At the architecture level, Gray code and 2-bit interleaving are used to divide the data into two groups ---even blocks and odd blocks. By using Gray code, '10' is mapped to '11' and '11' is mapped to '10' and thus soft error Es('10'->'01') is encoded into Es('11'->'01'). There are fewer errors in the odd blocks that house the most significant bit, making it possible to use simple ECC scheme, such as Hamming code. In contrast, the even blocks have larger number of errors and require stronger ECC. Moreover, while the errors in the odd block are only dependent on DST, the errors in the even block increase with NPC and DST. In addition, we use non-iterative sub-block flipping which helps in transforming the visible errors into invisible errors and eliminating most of the visible errors through flipping after verify-after-write process [8] . Figure 2 describes the soft and hard error rate for different combinations of DST and NPC as a function of Rth(01,00), the other threshold resistances are fixed. These errors rates are obtained after applying Gray coding, bit interleaving and subblock flipping. As Rth(01,00) increases, the hard error rate increases monotonically while the soft error rate drops at the beginning due to reduction in Es('01'->'00') and then remains constant. Both trends can be explained with the help of Fig. 1 . In this paper, we use the following notation: BERodd,soft and BEReven,soft are the soft bit error rates in odd blocks and even blocks, respectively, BEReven,hard is the hard bit error rate in even block and BERodd,hard ≈ 0. , 10 6.0 and 10 6.6 cycles.
D. Error Characteristics
From Fig. 2 , we see that for each combination of DST and NPC, the total error rate reaches a minimum. At this point, the hard error rate equals the soft error rate, and the corresponding resistance is referred to as optimal Rth(01,00) [8] . For the minimum error point, BEReven,hard, BEReven,soft and BERodd,soft can be obtained from Fig. 2 by the following steps: (1) BEReven,hard is equal to BERsoft (sum of BERodd,soft and BEReven,soft). (2) BERodd,soft is given by the value of the BERsoft curve in the constant region, marked in Fig. 2 ; and (3) BEReven,soft is the difference between BERsoft and BERodd,soft since BERodd,soft is unchanged for fixed DST.
III. HYBRID MAIN MEMORY SYSTEMS
In this section, we describe three different PRAM+DRAM hybrid memory systems. In all three cases, the PRAM is of size 1GB and is organized into blocks of size 512bits; the DRAM cache size varies from 512KB to 32MB. The write-back policy is employed where the PRAM writes occur when the DRAM block is replaced. The Least Recently Used (LRU) scheme is used as the replacement policy for DRAM. We assume that the raw BER of DRAM is at most 10 -12 . This is obtained by extracting the Failures In Time (FIT) of DRAM systems from [12] and computing the BER from FIT using the method in [10] . The BER of PRAM is a function of DST and NPC as described in Section II. 
A. Error Protection for the Hybrid Memory
The three candidate schemes are shown in Fig. 3 . Fig. 3 (a) illustrates a traditional system wherein only data in PRAM are protected from errors by ECC. Thus, such a system works only if the errors in DRAM are fairly low. Systems 2 and 3 have ECC protection for data in DRAM cache as well as PRAM, as shown in Fig. 3(b) and Fig. 3(c) , respectively. In System 2, data in DRAM cache and PRAM are protected by the same ECC unit and hence the ECC scheme should be stronger. In System 3, different ECC schemes are used for the DRAM and PRAM to better exploit the error characteristics of these two memories. The different ECC schemes applied to the three candidate systems are listed in Table I . We focus on Systems 1, 2, 3 here; Systems 1*, 2*, 3* are for higher DST applications and will be described later. Hamming is abbreviated to capital 'H' in this table. For System 1, shortened Hamming code (265,256) is used for odd blocks of PRAM. However, BCH with multiple error correction capability, such as BCH (274,256) with t=2, is required for even blocks. System 2 also needs error correction capability with t=2 since it has to correct for both DRAM and PRAM errors. Since System 2 operates on 512 bit blocks, it uses BCH (532,512) code. As for System 3, shortened Hamming code (522,512) is employed to protect 512-bits information in DRAM cache against errors occurring with BER of 10 -12 . Note that while System 3 adds a simple Hamming code in DRAM cache, it uses the same ECC scheme for PRAM as System 1 since most of the errors are due to PRAM. The three candidate systems have different latencies and lifetimes ---their trade-offs will be described in Section IV.
B. Block Failure Rates for Three Systems
We use BFR as the reliability metric and set a constraint of BFR= 10 -8 [8] . While BFR is a function of block usage and is thus application dependent, we assume that the baseline PRAM employs a simple wear-leveling mechanism [2] to guarantee uniform usage of different blocks in the PRAM.
We used MATLAB [13] to build a simulation environment for analyzing the reliability of the hybrid systems. Errors are introduced into the DRAM corresponding to BER=10 -12 and into the PRAM according to BEReven,hard, BERodd,soft and BEReven,soft (from Fig. 2 ). For each system, the MATLAB simulation engine generates the BER after decoding of the odd bits (BERodd) and even bits (BEReven). In Systems 1 and 3, these BERs are used to calculate the BFRodd and BFReven using binomial distribution as shown in eqn. (1)- (2). The failure rate of 512-bit blocks is due to errors in even blocks, odd blocks or both even and odd blocks, and this has been taken into account in the calculation of BFR of System 1 described in eqn. (3) . The BFR equations for System 3 are the same as System 1. However, since the BERodd and BEReven of the Systems 1 and 3 are different, the final BFRs are different. System 2 has a single ECC decoder that operates on 512 bit blocks (instead of 256 bit even and odd blocks). BERodd and BEReven are added to generate the BER after decoding of the 512 bit block. This is then used to calculate BFR of System 2 using eqn. (4). . Since a simple wear leveling scheme [2] is employed, for a 1GB PRAM and 4MB DRAM cache, DST can be as long as 10 4 s if PRAM is updated after every 100 programming cycles. For larger DRAM, the number of PRAM accesses is lower and the DST is larger. Thus in this study we choose DST=10 4 s -10 5 s. The BFRs of System 2 and System 3 increase monotonically as NPC increases with DST of 10 4 s. The BFR of System 1 stays unchanged in the beginning since most errors are due to DRAM cache which has a fixed BER. However, as NPC increases, the PRAM errors increase and the BFR increases. When DST grows to 10 5 s, the BFRs of System 1 and System 3 remain constant in the beginning and then increase after NPC exceeds a certain value. This is because when NPC is low and DST is high, BEReven,hard is quite low and so BFR is dominated by BERodd,soft which is only protected by a weak ECC scheme (Hamming code).
As Fig. 4 shows, System 1 has the highest BFR since it cannot recover from any errors occurring in the DRAM cache. Thus employing ECC schemes to protect data in PRAM when data has already been corrupted in the DRAM cache is not an effective way to reduce error rates. The BFR curves of System 2 and System 3 have a cross over point. For instance, for DST=10 4 s (Fig.4 (a) ), System 2 has lower BFR compared to System 3 till NPC=10 5.86 cycles, and higher BFR after that. We also find that this crossover point shifts right as DST increases. Thus we can see that System 3 outperforms System 2 at a lower DST or a larger NPC.
When the BFR constraint is fixed at 10 -8 , which is shown by dashed line in Fig. 4 , we can see that the different systems have different lifetimes (in terms of NPC). System 1 has the shortest lifetime due to its highest BFR. When DST is 10 4 s (see Fig. 4(a) ), the lifetime of System 3 is longer than System 2 by 28.8%, while when DST is 10 5 s (see Fig. 4(b) ), the lifetime of System 3 is longer by 20.2%. With more stringent BFR constraint (< 10 -8 ), we see that stronger ECC schemes are necessary for both Systems 1 and 3. In order to increase the lifetime when DST is larger, we propose stronger ECC schemes as shown in Table I and marked as System 1*, System 2* and System 3*. For System 1* and System 3*, BCH (274, 256) with t=2 is used instead of Hamming to enhance the error correction in the odd blocks while ECC schemes remain unchanged in the even blocks. Since BFR of System 1 and System 3 is dominated by BERodd,soft, when DST is high, use of stronger ECC scheme in odd blocks improves the lifetime of both systems. System 2* uses a stronger t=3 code, BCH (542, 512), and thus has better error performance than the other systems. Figure 5 shows the comparison of the BFRs between the original system and the systems with stronger ECC marked by system*. As shown in Fig. 5 , System 1* achieves a large enhancement on lifetime relative to System 1 since it overcomes System 1's limitation of error-correcting capability in odd blocks. The BFR of System 2* is significantly lower than other systems since it can detect and correct three errors in the hybrid main memory regardless of whether the errors occurred in the PRAM or in the DRAM. System 3* also has significantly better performance compared to System 3. Its lifetime is improved by 17.5% compared with System 3, and it has a much lower BFR than System 3 at lower NPC. For example, at NPC of 10 5.9 , the BFR of System 3* is three orders of magnitude lower than System 3.
I. EVALUATION
A. BCH Based ECC schemes
All ECC schemes are based on BCH. The 2t-folded SiBM architecture [11] is used to minimize the circuit overhead of Keyequation solver at the expense of increase in latency. The syndromes are calculated in parallel and a parallel factor of 8 is used for calculations in the Chien search blocks [8] . The BCH encoders and decoders are synthesized in 45 nm technology using Nangate cell library [16] and Synopsys Design Compiler [17] .
B. Simulation setup 1) CACTI Setup
We obtained the PRAM cell memory circuit parameters, such as write/read current, resistance, and access latency using HSPICE [7] , and embedded them into CACTI [5] . Since PRAM is a resistive memory, the equations for bitline energy and latency had to be modified as well. The rest of the parameters are the same as the default parameters used in DRAM memory simulator with ITRS Low Operation Power (LOP) setting used for peripheral circuits. 256 cells corresponding to a 512 bit block were simulated for write/read operations. Table II shows CACTI results for write energy of all transitions and read energy (8 steps and 60ns current pulse width) [8] . For the DRAM cache, we use CACTI in low power mode. Table III lists read and write energy and leakage power for different size of DRAM: 2MB, 4MB and 8MB.
2) GEM5 SETUP
We use a single core setting in GEM5 [6] to simulate the performance of a system with PRAM based main memory. Our workload includes the benchmarks of SPEC CPU INT 2006 [14] and DaCapo-9.12 [15] . For GEM5 simulations, the PRAM memory latency obtained by CACTI and ECC latency obtained through synthesis using 45nm technology are expressed in number of cycles corresponding to the processor frequency of 2GHz. Read latency from hybrid memory includes 95 cycles of wire routing delay, memory read operation latency and ECC decoder latency.
C. DRAM Cache Size
We investigated the impact of DRAM cache size on the energy and latency of the system. Fig. 6 presents normalized energy of the hybrid memory, wherein the PRAM has a size of 1GB and the size of DRAM cache is varied from 512KB to 8MB.We see that (1) energy of PRAM goes down while energy of DRAM cache goes up as the number of read and write accesses to PRAM decreases with increase in DRAM cache size; (2) total energy of hybrid system drops when the size of DRAM cache increases since decrease in energy of PRAM is more than increase in energy of DRAM cache. We also find that the latency of PRAM read increases while latency of DRAM cache decreases with increase in size of DRAM cache and that the total latency of the system reduces mildly with increasing DRAM cache size. Since DRAM caches with 2MB, 4MB and 8MB have similar energy and latency, we study these three sizes of DRAM cache in the rest of the paper.
D. System Performance 1) System Performance at DST of 10
4 s In this section, IPC, energy and lifetime are used to evaluate the different hybrid systems. The lifetime of different systems is obtained from the BFR vs NPC curves. Basically, lifetime is defined as the NPC corresponding to BFR=10 -8 . Energy of read and write operations for DRAM and PRAM is obtained from CACTI and access (read/write) times to DRAM and PRAM are obtained from GEM5. The ECC encoding and decoding latencies and energy are obtained from synthesis results using Synopsys. Total energy includes PRAM read/write energy along with energy consumed by parity storage, ECC encoding/decoding energy and leakage energy of DRAM. It is worth mentioning that the ECC encoding/decoding energy is trivial compared to read/write energy of the system. Table IV compares the lifetime, IPC and energy of different systems at DST of 10 4 s for a hybrid system with PRAM memory of size 1GB and 4MB DRAM cache. As shown in Table IV , System 1 has the highest IPC but poor lifetime owing to the absence of ECC protection for DRAM cache. Compared to System 1, System 3 obtains 51.4% enhancement in lifetime at the expense of only 0.1% loss of IPC. All systems consume comparable energy since the ECC energy is insignificant relative to the PRAM energy. We conclude that System 3, which has stronger ECC, is the best choice at DST of 10 4 s after weighing in all three metrics. Figure 7 describes the lifetime, IPC and energy of different systems at DST of 10 5 s with 4MB DRAM cache. IPC and energy of System 1 have not been shown since its lifetime (<10 5.7 NPC) is a lot smaller than others. As Fig. 7(a) shows, System 2* has a much longer lifetime compared to other systems due to its stronger error correction capability but has the lowest IPC because of the long latency of the BCH t=3 code. It has higher redundancy and thus consumes slightly higher energy. Note that System 3 and System 3* have the same IPC since they utilize the same ECC schemes in the critical path, which determines the decoding latency. Similarly, System 1 and System 1* share the same IPC. Fig. 7(b) shows that System 3* achieves 17.5% longer lifetime, while only incurring 1.2% more energy consumption with no change in IPC compared to System 3. Fig. 7 also shows that longer lifetime can be achieved at the price of lower IPC. For example, compared to System 2*, System 3* has 10.7% increase in IPC at the price of 20.6% loss in lifetime. 
3) System Performance for different cache sizes
In order to find the trends of energy and IPC with increasing DRAM cache size, 16MB and 32MB of DRAM size are also considered in the following analysis. Fig . 8(a) shows IPC for different systems with increasing DRAM cache size. We find that the IPC monotonically increases with increase in DRAM cache size due to increasing hit rate in DRAM cache. However, the increase in IPC is quite low when the DRAM cache size increases beyond 16 MB. This is because for large cache size, improvement in hit rate of DRAM cache is limited by block size, etc. Finally, IPC for System 2 is about 10% lower than the other two systems because System 2 uses stronger ECC resulting in longer coding latency for access to either DRAM cache or PRAM memory. Fig. 8(b) shows that the energy of all systems decreases with increasing DRAM cache size. This is because larger DRAM cache size results in fewer accesses to PRAM, whose read and write energies dominate the total energy. When the DRAM cache becomes very large, leakage energy plays a more dominate role. The increase in the leakage energy can then outweigh the decrease in the PRAM access energy.
Compared to the hybrid main memory system with 1GB DRAM cache and 32 GB PRAM which has average lifetime of 9.7 years [2] , our system consisting of 8MB DRAM and 1GB PRAM has an average lifetime of 2.1 years at DST of 10 5 s for the test benchmarks. When DRAM cache size increases to 32MB, the average lifetime of our system is close to 13 years for these benchmarks. The lifetime would improve even further if a larger PRAM was used. The enhancement in lifetime is due to the multilevel approach [8] that we used to reduce the BER of the PRAM.
4) System Area Overhead
The area overhead of the candidate systems is due to parity storage in PRAM and DRAM, and ECC encoding /decoding circuitry, which is comparably negligible. When the DRAM is of size 4MB, the parity storage of the different systems are < 7.5%. In fact, all systems have comparable area overhead, ranging from 4.70% to 7.43%.
V. CONCLUSION
In this paper we present three PRAM+DRAM systems that satisfy the same reliability constraint but differ in what type of ECC is used for PRAM and DRAM. We analyze the lifetime, in terms of NPC, IPC and energy of the systems and we find that (1) System 3 with two-layers of ECC for DRAM cache and PRAM outperforms other systems; it has longer lifetime, higher IPC and less energy at DST of 10 4 s; (2) for longer DST (when DST equals to 10 5 s), stronger ECC is required for all systems and System 3* achieves 10.7% higher IPC with penalty of 20.6% loss in lifetime compared to System 2* which uses only one ECC scheme for the whole system.
