Abstract-Most server-grade systems provide Chipkill-Correct error protection at the expense of power and performance. In this paper we present a low overhead solution to improving the reliability of commodity DRAM systems with no change in the existing memory architecture. Specifically, we propose five erasure and error correction (E-ECC) schemes that provide at least Chipkill-Correct protection for x4 (Schemes 1, 2 and 3), x8 (Scheme 4) and x16 (Scheme 5) DRAM systems. All schemes have superior error correction performance due to the use of strong symbol-based codes. Synthesis results in 28 nm node show that the decoding latency of these codes is negligible compared to the DRAM access latency. In addition, we make use of erasure codes to extend the lifetime of the DRAM systems. Specifically, once a chip is marked faulty due to persistent errors, all E-ECC schemes correct erasures due to that faulty chip and also correct an additional random error in a second chip. Evaluation with SPEC2006 workloads show that compared to x4 Chipkill-Correct schemes, Scheme 5 has the highest IPC improvement (mean of 7 percent) and Scheme 4 has the largest power reduction (mean of 18 percent) and the largest increase in energy efficiency (mean of 25 percent).
Ç

INTRODUCTION
M EMORY reliability is a major challenge in the design of large scale computing systems. More than 40 percent of hardware related failures are attributed to memory systems [1] , and this number is projected to increase in the future. Memory systems are vulnerable to different kinds of faults (e.g., hard, intermittent or random) [2] , [3] . These faults manifest as single bit errors, multiple errors along a row and/or along a column of a chip and even a whole chip failure. The challenge is in designing schemes that have higher reliability than current systems but with lower power and performance overhead.
High performance servers are typically expected to have Chipkill-Correct level protection, that is, the ability to correct errors due to failures of a DRAM chip with 12.5 percent storage overhead [4] , [5] . Chipkill-Correct was first implemented by striping data across multiple chips so that single bit error correction and double bit detection (SEC-DED) code could be used to correct errors due to a chip failure [4] . The bit-level Chipkill-Correct code had high power consumption and low system performance and so symbolbased Chipkill-Correct was proposed for x4 DRAM systems in [6] , [7] . Such a scheme activated 36 chips across two ranks for every memory access and thus also had high power consumption.
To reduce the power consumption, many systems moved to x8 or x16 DRAMs, which activate fewer chips per memory access. For instance, V-ECC [8] activates 18Â8 chips across two ranks while LOT-ECC [9] and Multi-ECC [10] only activate nine chips per rank. Although LOT-ECC and Multi-ECC can reduce the memory power by an average of more than 25 percent compared to Chipkill-Correct, they cannot fully correct a chip failure at run time.
In order to activate fewer chips and maintain high reliability, many systems employ two-tier schemes [8] , [9] , [10] , [11] , [12] , where the first tier is used for error detection and the second tier is used for error correction. For example, V-ECC [8] uses two check symbols in the first tier to perform error detection and uses the third check symbol in the second tier to perform error correction. The second tier is usually cached to reduce the read latency for error correction and to reduce the number of writes for updating the tier-2 ECC symbols.
Existing schemes such as those in [9] , [10] also rely on replacement of faulty chips to extend the lifetime of the DRAM memory systems. Some commercial systems use memory sparing or bit-steering [10] , [13] , [14] to re-route the faulty bits or re-route data from a faulty chip to a healthy chip. This not only reduces the usable physical memory size but also increases the overhead required to perform re-routing or mapping.
In this paper, we propose a very different approach to providing at least Chipkill-Correct error protection for commodity DRAM systems. Our approach is based on the use of stronger symbol based codes which are chosen to handle the constraints of the different memory systems. Use of a stronger code adds very little overhead to existing systems. Synthesis results in 28 nm node show that the decoding latencies of these codes are very small and do not affect the DRAM timing performance. Moreover, unlike the existing multi-tiered schemes, our schemes do not require extra memory read/write operations to access data for error correction.
Furthermore, instead of employing chip sparing to increase the lifetime of memory systems, we make use of erasure correction, where an erasure is defined as an error whose location is known. We utilize the machine check architecture (MCA) [2] , [14] , [15] to record the error information of each chip. Once the number of errors in a certain chip increases beyond a threshold, this chip is marked as faulty and errors due to this chip are treated as erasures.
In the rest of this paper, we present five erasure and error correction (E-ECC) schemes that provide at least ChipkillCorrect reliability for x4, x8 and x16 DRAM systems. All the proposed E-ECC schemes can correct errors due to a chip failure on the fly and can correct one more random error when the chip is marked as faulty. We analyze the tradeoffs between reliability and system performance (timing, power and energy) of these proposed schemes. Overall, this paper makes the following key contributions:
For x4 DRAM systems, we present three schemes that all have 12.5 percent storage overhead but differ in the number of ranks being activated (one or two). The specific ECC codes used in these three schemes are rotational (144,128) code over GF (2 4 ) and RS (36, 32) code over GF (2 8 ). The schemes that activate two ranks per memory access have lower timing, power and energy performance but higher reliability compared to the one that activates only one rank. For x8 DRAM systems, we propose a scheme which is also based on the RS (36, 32) code over GF (2 8 ); it has the lowest power consumption and highest energy efficiency among all five schemes. For x16 DRAM systems, we propose a scheme which is based on the RS (20, 16) code over GF (2 8 ); its storage overhead is 25 percent but has the highest timing performance among all the schemes. Finally, compared to existing schemes, the proposed E-ECC schemes have superior reliability. They all achieve at least Chipkill-Correct reliability and one of our x4 E-ECC schemes can even correct errors due to two chip failures. Our x8 E-ECC scheme has similar timing and power performance compared to [8] but with higher reliability and lower storage overhead. Compared to [9] and [10] , it has slightly lower power/energy efficiency but stronger reliability since LOT-ECC cannot handle row failures and Multi-ECC cannot handle column failures.
The rest of the paper is organized as follows. Section 2 introduces the DRAM architecture, DRAM error characteristics, existing methods and our strategy. Section 3 presents the proposed schemes; details of the decoding algorithms are given in the Appendix. Section 4 includes the synthesis results of the E-ECC decoders. Section 5 discusses the timing, power, energy and reliability of the proposed schemes. Section 6 concludes this paper.
BACKGROUND
DRAM Memory Systems
A DRAM memory system is organized into channels, ranks, chips and banks [16] . The DRAM memory controller (MC) acts as an interface between the last level cache and the DRAM. It can access data from one or more channels. Each channel consists of dual in-line memory modules (DIMMs), each of which consists of one or more ranks. A rank is the minimum unit that is activated in a read or write access. Each rank is composed of multiple chips (also called devices) and the number of chips to be activated depend on the size of the data bus width. For DDR3, the I/O width (N) is typically 4, 8 or 16 bits. Since the 64 bit data path is fixed, a rank consists of 64/N chips. A DRAM system built with sixteen x4 DRAM chips is referred to as a 16 x4 system. The common DRAM system configurations are 16 x4, 8 x8 and 4 x16 (number of chips per rank x data I/O width). In an x4 system, there are 8 extra ECC bits for every 64 bits data resulting in two extra ECC chips per rank. In an x8 system, one extra chip is used for ECC chip per rank. The DRAM architecture for a x8 system is shown in Fig. 1 .
In DDR3 systems, data is accessed in the burst mode; typically, burst length is eight or four (chopped burst) [17] . A burst length of eight means that eight beats of data are transferred per memory access [16] . Some DRAM systems operate in the lock-step mode [8] , [16] . In such a mode, two physical channels operate as a single logical channel. A single 64B cache line is fetched using two memory channels; one half of the cache line is accessed from the first channel while the second half is obtained from the second channel.
DRAM Error Characteristics
DRAM errors can be broadly classified into soft errors and hard errors. Soft errors are caused by transient faults which occur randomly and cause incorrect data to be read from a memory location; they disappear when the location is overwritten. Hard errors are caused by permanent faults or intermittent faults. A permanent fault causes a memory location to consistently return an incorrect value, such as a stuck-at-0 fault. An intermittent fault causes a memory location to sometimes return incorrect values. Note that a single fault can result in multiple error instances [2] , [3] .
DRAM errors have been analyzed in detail in [2] , [3] , [18] , [19] , [20] . The study in [2] , [18] , [19] shows that a large fraction of errors are hard errors and these manifest as repeating errors occurring at the same address, row, column or chip. In addition, permanent faults tend to be clustered [18] ; these errors have strong correlations in space and time. The repeated errors contaminate the nearby rows and columns and increase dramatically in the presence of prior errors. The study in [18] , [19] also shows that the number of errors in any memory system increase over time. A more recent analysis of DRAM faults performed over a period of 15 months shows that while the failure rate due to transient faults increases mildly, the failure rate due to permanent faults is higher in the beginning and becomes almost the same as transient faults around months six to eight [3] . It is projected that the failure rate would again increase towards the end of the device's lifetime.
In general, if a chip has persistent errors, then that chip can be marked as faulty and all data from that chip can be treated as erasures. Note that erasures are defined as errors whose locations are known [21] . Thus if erasures can be corrected, the faulty chip can continue to be used instead of being retired. In this work, we utilize the error recording mechanism of machine-check architecture [2] , [14] , [15] to mark a chip as faulty. The MCA has registers to store the error address, time or type (corrected or uncorrected) of errors. Error events are recorded both during memory scrubbing [13] , [22] and during normal read operation. Now if the number of errors is larger than a threshold value (the threshold value is system-dependent), the chip is marked as faulty by the MC.
Existing ECC Mechanisms
Chipkill-correct is the most common error protection scheme for DRAM memory systems [4] , [5] , [19] , [23] . It can correct errors due to failure of one chip and also detect errors due to two chip failures. The original Chipkill-Correct solution from IBM [4] used single bit error correction and double bit error detection code (SEC-DED). Current systems such as Sun UltraSPARC-T1/T2 [24] and AMD Opteron systems use symbol based Chipkill-Correct codes. An example of such a code is the rotational (144,128) code [6] , which is a (36,32) code over GF (2 4 ). In an x4 memory system, this code results in activation of 36 devices across 2 ranks and thus consumes a lot of power. Next we describe several methods that try to achieve a balance between reduction in power consumption and Chipkill-Correct reliability.
Virtualized ECC (V-ECC) [8] provides Chipkill-Correct capability for x4 and x8 DRAM systems. It is based on a 3 check symbol code, where 2 check symbols are used for detection (tier-1) and a third check symbol is used for correction (tier-2). In an x8 system, V-ECC activates 18 chips in 2 ranks. It caches the tier-2 symbols to reduce the read latency and write frequency. However, it still incurs extra read/write operations to perform error correction or to update the ECC bits. The storage overhead of V-ECC is 18.75 percent.
Localized and tiered ECC (LOT-ECC) [9] activates only nine chips per rank in x8 DRAM systems to reduce power consumption. It uses multiple layers of XOR operations to deal with memory errors. Data along with local and global ECC parity bits are stored in the same DRAM row to improve access efficiency. If errors are detected, global parity bits are read by a second access. LOT-ECC is not a Chipkill-Correct solution since it can only correct a stuck-at-0 or stuck-at-1 chip failure. The storage overhead of LOT-ECC is 26.5 percent, which is higher than the existing schemes.
Multi-line error correction (Multi-ECC) [10] also activates only nine chips per rank in x8 DRAM systems. Multi-ECC uses a different approach where errors are first detected along rows and then column checksums are used to locate the errors. The row parity bits are then used to correct these errors. This method requires a large number of data reads when an error is detected. In addition, since Multi-ECC uses one's complement for column checksums, it cannot fully detect errors due to column failures. The storage overhead of this method is only 12.9 percent, which is a small increase compared to Chipkill-Correct.
Adaptive reliability chipkill correct (ARCC) [25] also provides two tiers of error protection. It reduces the power consumption by activating only one rank when there are no errors. When errors are detected, ARCC adaptively adjusts the ECC strength by combining adjacent codewords to perform error correction. Once two ranks are combined, the cache line size is increased from 64 to 128B. ARCC does not increase the ECC storage overhead; the only overhead is that the last level cache needs to be modified to accommodate both 64 and 128B cache lines.
Bamboo-ECC [26] is a recently proposed single-tier error protection scheme that provides good system reliability with low storage overhead. It uses a 8-bit symbol based RS code for x4 DRAM systems to handle error events ranging from correcting single bit errors with 3.1 percent storage overhead to correcting errors due to two chip failures with 25 percent storage overhead. Furthermore, by grouping per-pin data to form ECC symbols, it is able to correct double pin failures and thus provides higher error protection compared to Chipkill-Correct.
Apart from the academic solutions described above, there are several commercial solutions. Many server systems use page retiring or chip sparing to improve reliability [13] , [18] , [27] . For example, Intel systems use double device data correction (DDDC) to correct double device errors sequentially. In each rank, one DRAM device is reserved as a spare chip; when a chip is marked faulty, the spare chip is utilized. IBM ProteXion [14] uses redundant bit steering to re-route the faulty bits to backup bits to deal with bit failures. Instead of using eight bits to protect 64 bits of data it uses only six bits and uses the remaining two bits as backup bits.
Our method does not use any spare bits/chips or re-routing, instead we handle errors due to faulty chips through erasure correction. The decoding circuit for erasure correction is quite small and its latency is negligible compared to the DRAM access latency (Section 4). Thus we believe erasure correction is a more efficient way to deal with errors due to chips that have been marked faulty.
PROPOSED E-ECC SCHEMES
In this section we describe the proposed E-ECC schemes for x4 DRAM systems in Section 3.1, for x8 DRAM systems in Section 3.2 and for x16 DRAM systems in Section 3.3. For each scheme, we describe the data access pattern and the decoding flowchart. The decoding algorithms are described in the Appendix and the corresponding synthesis results are included in Section 4.
x4 DRAM Systems
E-ECC Scheme 1
Chipkill-Correct uses 4-check symbol codes to correct errors due to a single chip failure and detect errors due to two chip failures. We use rotational (144,128) code [6] as the representative Chipkill-Correct code in this paper. Here, 36 devices are activated across two ranks in each access. Each device provides 4 bits of data per beat, that is, 36 Â 4 = 144 bits per beat, to the ECC decoding unit. Each set of 144 bits is decoded to obtain 128 data bits and a total of 4 Â 128 = 512 data bits is sent to the last level cache.
The rotational (144,128) code has a minimum distance of 4 and so this code can support the following cases: (i) single symbol correction and double symbol detection, (ii) single erasure correction, (iii) single erasure and single error correction, (iv) double erasure correction and (v) double erasure and single error detection. Current Chipkill-Correct x4 systems implement only single symbol correction and double symbol detection (case (i)). Here, we propose an enhancement which makes use of the same (144,128) code to handle erasures; we refer to it as E-ECC Scheme 1.
If a chip is marked as faulty, it leads to a single erasure and Scheme 1 performs single erasure correction (case (ii)). When one more random error occurs in another chip, Scheme 1 can still correct it (case (iii)). The decoder first checks if it is a single erasure event. If so, case (ii) is executed; otherwise, case (iii) is activated. If a second chip fails, the MC marks it as faulty. The decoder first checks if it is a double erasure event. If so, double erasure correction (case (iv)) is implemented; otherwise, double erasure and single error detection (case (v)) is activated. The decoding flowchart for Scheme 1 is shown in Fig. 2 and the details of the decoding algorithm are included in the Appendix.
E-ECC Scheme 2
To enhance the error correction capability of x4 DRAM systems, we investigate ECC codes operating in higher finite field. We combine data from two beats to obtain 256 data bits and 32 ECC bits. Since there is no RS (72,64) code over GF (2 4 ), we move to GF (2 8 ). We propose using RS (36, 32) in GF (2 8 ), which can be derived from RS (255,251). RS (36, 32) can provide double error correction, that is, it can correct double chip failures on the fly instead of only detecting them as in Scheme 1. We refer to this method as E-ECC Scheme 2.
In Scheme 2, two ranks (with 18 chips per rank) are activated per access. In each read/write, two consecutive 4-bit symbols from the same bank form a single 8-bit symbol, and thus a total of 36 symbols are read out. Fig. 3a illustrates the data access pattern. The proposed RS (36,32) E-ECC code has a minimum distance of 5 and supports the following cases: (i) single error correction, (ii) double error correction, (iii) single erasure correction, (iv) single erasure and single error correction (v) double erasure correction and (vi) double erasure and single error correction.
The default state of Scheme 2 is single error and double error correction. Since RS code has a special algebraic structure, the decoder can use the syndrome to distinguish between case (i) and case (ii) efficiently [28] . When a chip fails, it leads to a single error in the E-ECC codeword and a single error can be corrected easily. When MC marks this chip as faulty, the error becomes an erasure in an ECC codeword and the decoding circuitry for single erasure correction (case (iii)) is activated. Correcting single erasure is simpler compared to correcting single error (see Appendix). Furthermore, if there is an additional error in another chip, it can be corrected as well (case (iv)). We assume that the errors build up over time and so a second faulty chip can start generating repeated errors. When MC marks the second chip as faulty, the E-ECC decoder activates double erasure correction (case (v)). Once two chips are marked as faulty, it can correct one more random error (case (vi)). The decoding flowchart is shown in Fig. 3b and details of the decoding algorithm for each case is given in the Appendix.
E-ECC Scheme 3
Although Scheme 2 improves the reliability compared to Chipkill-Correct and E-ECC Scheme 1, it still activates 36 devices and consumes significant amount of power. This motivates us to find another scheme that activates fewer chips at the cost of some loss in reliability. We investigate codes in GF(2 4 ) and GF( 2 8 ) with the constraint that 18 chips can be activated in each access. Under this constraint, in each beat, 72 bits (64 data bits + 8 ECC bits) are accessed from 18 chips. Since there is no RS (18,16) code over GF(2 4 ), we move to GF(2 8 ). In GF (2 8 ), the only available code is RS (9, 8) , which has a minimum distance of 2 and so cannot even correct one single error. However, if we combine data from two beats (144 bits), there are two candidates: one is the rotational (144,128) code and the other is the RS (18,16) code over GF (2 8 ). Rotational code cannot correct one chip failure (there are 2 x4 symbol errors due to a chip failure) and hence it is not suitable. The RS (18, 16) code has minimum distance of 3 and can perform single error correction. It seems to be used in AMD Chipkill-Correct [26] . Although this code provides Chipkill-Correct capability, we choose a stronger code whose error correction capability is competitive with the other proposed schemes. Specifically, we choose RS (36, 32) code over GF (2 8 ); the corresponding scheme is referred to as Scheme 3.
In Scheme 3, four consecutive 4-bit symbols from the same bank contribute towards a codeword. Fig. 4a describes the corresponding data access pattern. As mentioned earlier, the RS (36,32) code has a minimum distance of 5 and supports the following cases: (i) single error correction, (ii) double error correction, (iii) double erasure correction, (iv) double erasure and single error correction (v) double erasure and double error detection.
The default state of Scheme 3 is also single error and double error correction. When a chip fails, it leads to two errors in the E-ECC codeword and these two errors can be corrected on the fly. When MC marks a chip as faulty, the decoding circuitry for double erasure correction (case (iii)) is activated. The double erasure correction methods for Schemes 2 and 3 are different. In Scheme 2, two chip failures lead to two erasure symbols in a codeword, which may not be adjacent. In Scheme 3, one chip failure leads to double erasures and these two erasure symbols are adjacent in a codeword.
Correcting two erasures is simpler compared to correcting two errors (see Appendix) and these two erasure addresses are consecutive in Scheme 3. Furthermore, if there is an additional error in another chip, it can be corrected as well (case (iv)). If a second chip fails, there are two erasures (from the chip marked faulty) and two errors (from the other faulty chip) which can be detected but not corrected (case (v)). Note that the system cannot use case (iv) and case (v) at the same time and a choice has to be made. For DRAM systems whose errors increase over time, the MC can activate case (iv) decoder, and once the number of single errors is larger than a threshold, it can activate the circuitry for case (v). The decoding flowchart is shown in Fig. 4b and details of the decoding algorithm for each case is given in the Appendix.
x8 DRAM systems -E-ECC Scheme 4
In an x8 DRAM system, nine chips (8 data chips + 1 ECC chip) are accessed every time and since fewer chips are activated compared to x4 DRAM, the power consumption is lower. Hence, a lot of research effort has focused on designing Chipkill-Correct x8 DRAM systems [8] , [9] , [10] , [25] , [29] .
Deriving a Chipkill-Correct solution for x8 DRAM with low overhead is still quite a challenge. If only one rank is activated, in each beat, nine symbols are accessed from nine chips. A possible choice is the RS (9,8) code over GF (2 8 ). Unfortunately, this code can only detect a single symbol error and thus is not suitable. If two beats of data are combined, RS (18,16) code can be used over GF (2 8 ) to perform single error correction. However, since one faulty chip leads to two error symbols per codeword, this code can not correct them. To provide Chipkill-Correct protection, either an extra ECC chip (3-check symbol code) has to be used or the extra ECC symbols have to be stored in another rank like [8] .
Our solution is to operate the x8 DRAM system in lockstep mode as in Intel [13] , HP [22] and Dell [27] systems. In the lock-step mode, two ranks are activated and in each beat, 144 bits (128 data bits + 16 ECC bits) are accessed from 18 chips across two ranks. If we use the rotational (144,128) code, when a chip fails, the faulty chip leads to 2 x4 error symbols in a codeword and since this code cannot correct two random errors, it is not Chipkill-Correct. In GF(2 8 ), the RS (18, 16) code is a possible choice. It can correct errors due to a chip failure. When MC marks a chip as faulty, it can correct single erasure and detect one more random error in another chip. However, we choose a stronger code, namely, RS (36, 32) , since it has better error correction capability compared to RS (18, 16) and the same 12.5 percent storage overhead. We refer to this scheme as E-ECC Scheme 4 and this scheme was also presented in [30] .
In each beat, 2 Â 9 = 18 symbols are read out from two ranks and a total of 2 Â 18 = 36 symbols are accessed in two beats. Fig. 5 demonstrates the data access pattern of Scheme 4. As mentioned earlier, the RS (36,32) code supports: (i) single error correction, (ii) double error correction, (iii) double erasure correction, (iv) double erasure and single error correction (v) double erasure and double error detection. Scheme 4 uses the same RS (36,32) code as Scheme 3, though their access patterns are different (Scheme 4 is designed for x8 DRAM systems while Scheme 3 is for x4 DRAM systems). The decoding flow of Scheme 4 is the same as Scheme 3 (shown in Fig. 4b ). When none of the chips are marked as faulty, Scheme 4 performs cases (i) and (ii). When a chip is marked as faulty, both symbols from that chip are treated as erasures. The decoder checks whether it is a double erasure event. If so, it launches double erasure correction (case (iii)); otherwise, it activates double erasure and single error correction (case (iv)) or double erasure and double error detection (case (v)). Note that case (iv) and case (v) cannot be chosen at the same time.
x16 DRAM Systems -E-ECC Scheme 5
x16 DRAM activates fewer number of chips per rank compared to x4 and x8 DRAM systems and thus has even lower power. Since each rank has four chips for data, there has to be at least one additional chip per rank to provide ChipkillCorrect reliability. Thus, such a scheme results in a storage overhead of 25 percent. If only one rank is activated, five 16-bit symbols are read out in each beat. This can be configured into ten 8-bit symbols and the RS (10, 8) Fig. 4b ), though its access pattern is quite different. As before, when none of the chips are marked as faulty, Scheme 5 performs cases (i) and (ii). When a chip is marked as faulty, one 16-bit symbol failure from that chip is treated as two 8-bit erasures. The decoder checks whether it is a double erasure event. If so, Scheme 5 launches double erasure correction (case (iii)); otherwise, it activates double erasure and single error correction (case (iv)) or double erasure and double error detection (case (v)). Note that case (iv) and case (v) cannot be used at the same time.
SYNTHESIS RESULTS
We implemented the E-ECC decoders in Verilog hardware description language and synthesized them with a 28 nm industrial process. Recall that Scheme 1 is based on the rotational (144,128) code, Schemes 2, 3, 4 are based on the RS (36,32) code over GF (2 8 ) , and Scheme 5 is based on the RS (20,16) code over GF (2 8 ). The latency, power consumption and area of each block ini the three E-ECC codes are presented in Tables 1, 2 and 3, respectively. The two main hardware components are finite field multiplication and finite field inversion. We used a fully parallel implementation for finite field multiplication. We implemented finite field inversion using a look-up table of size 16 Â 4 for GF(2 4 ) and 256 Â 8 for GF (2 8 ). The syndrome calculation unit is activated in every read operation and so it is important that its latency be minimized. We implemented this unit using 144 GF(2 4 ), 144 GF(2 8 ) and 80 GF(2 8 ) multiplications for rotational (144,128) code, RS (36, 32) code and RS (20,16) code, respectively, followed by a tree of XOR gates.
Latency. The syndrome calculation unit of Scheme 1 (based on the (144,128) code) has a latency of 0.41 ns. If the syndrome vector is not zero and the error event is classified as single error, correcting the error takes an additional 0.41 ns. This low latency comes at the cost of additional area due to parallelization. If a chip is marked as faulty, single erasure correction takes 0.3 ns. In addition to the erasures, if there is one more random error, it takes another 1.1 ns. If two chips are marked as faulty, Scheme 1 takes 0.39 ns to do double erasure correction. If there is one more random error, Scheme 1 can detect this error but can not correct it, and the corresponding timing delay is also 0.39 ns.
The syndrome calculation unit of Scheme 2 (based on RS (36, 32) Power. We list the dynamic power and static power consumption for each E-ECC unit in Table 2 . The dynamic power is based on input switching probability of 50 percent. The syndrome calculation unit, which is activated in every 2 respectively. Scheme 2 has the largest number of decoding units and hence it has the largest area. Schemes 3 and 4 both have the same area since they activate the same set of decoding units. Scheme 5 has smaller area compared to Schemes 3 and 4 because it uses a smaller sized RS code. Overall the area of the E-ECC decoders is quite small. The largest E-ECC has an area of only 0.025 mm 2 , which is fairly small compared to a typical die size (% 100 mm 2 ).
EVALUATION
We evaluate the timing performance of the different ECC schemes by using an open source full-system simulator, gem5 [31] . We model a 4-way out-of-order processor with a two-level cache hierarchy. The L1 instruction and data caches are private to each core while the L2 caches are shared. We also model a detailed DDR3 DRAM with data rate 1,600 (MT/s). Table 4 describes the configuration of our setup.
We use a detailed cycle-based DRAM simulator, DRAMSim2 [32] , to evaluate DRAM power consumption. We generate DRAM access traces from the aforementioned gem5 simulation setup. We then estimate the power consumption of the five different schemes by simulating at least 2 million memory accesses from these traces. We use the Micron 4 Gb data sheets [33] to obtain the input parameter values for DRAMSim2.
Workloads
We evaluate the timing and power performance impact of the E-ECC schemes using both sequential and multiprogrammed workloads. We use 11 DRAM-sensitive sequential applications from the SPEC2006 benchmark suite (Table 5) . We use Simpoints [34] to identify a single, 250-million instruction representative region for each sequential workload. In addition, we also evaluate the multi-core system performance. We create two 4-core multiprogrammed workload mixes of the SPEC2006 benchmarks to realistically model the multiprogrammed application execution scenarios. The workload mixes are also summarized in Table 5 . For our multiprogrammed simulations, in order to enable the start of different benchmarks in each workload at the same time, we fast-forward two billion instructions from the program start and simulate in detail until all the benchmarks have simulated for at least 250 million instructions. We collect the statistics after the slowest benchmark has completed simulating 250 million instructions.
Timing, Power and Energy Results
In this part, we analyze the timing, power and energy performance of the five E-ECC schemes and the three existing schemes (V-ECC, LOT-ECC and Multi-ECC). In order to obtain the timing performance of V-ECC, LOT-ECC and Multi-ECC, we also synthesized their corresponding ECC units in 28 nm technology. The delay of the syndrome calculation unit in V-ECC (using RS (19,16) code over GF(2 8 )) is 0.42 ns, LOT-ECC (using multi-level one's complement addition) is 0.52 ns, and Multi-ECC (using RS (9,8) code over GF (2 16 )) is 0.24 ns. Since the latencies of syndrome calculation of all schemes are within one memory cycle (1.25 ns), we add one additional cycle in the memory read access time in gem5 simulation.
We present the simulation results of all schemes in an error-free system. V-ECC activates two x8 ranks per access and has the same configuration as E-ECC Scheme 4. LOT-ECC and Multi-ECC both activate one x8 rank per access and so we have added simulation results for this configuration. V-ECC, LOT-ECC and Multi-ECC all use ECC cache to store the tier-2 ECC symbols. We provide the timing performance for the best case scenario when the ECC cache has a hit rate of 100 percent. All performance values are normalized to those of the x4 Chipkill-Correct baseline scheme. Since Schemes 1 and 2 activate 36 Â 4 chips in two ranks as in the baseline, they have identical timing, power and energy performance and are not shown in the figures. Fig. 7 shows the IPC performance of all schemes for a subset of the sequential and multiprogrammed workloads. We measure the performance of the multiprogrammed workloads using the weighted speedup, which is given by P [35] , [36] . We use the geometric mean to report the average values.
Timing Performance Comparison
Overall, the IPC performance increases as the data width increases from x4 to x8 to x16. The performance improvement can be attributed to higher rank-level parallelism. For example, Schemes 1 and 2 operate on one logical rank whereas Schemes 3 and 4 operate on two logical ranks and thus have better IPC performance. Fig. 7 shows that Scheme 5, which operates on four logical ranks, has the best IPC performance among all the schemes. The improvement is 7 percent average, 21 percent maximum, in multiprogrammed workloads and 18 percent in sequential workloads.
Of the existing schemes, since V-ECC has the same configuration as E-ECC Scheme 4 and V-ECC uses ECC cache to reduce the number of write operations, we project that V-ECC has timing performance similar to E-ECC Scheme 4. LOT-ECC and Multi-ECC both activate one x8 rank per memory access. While the tier-2 ECC symbols of LOT-ECC are located in the same row as the accessed data, they are located in another row in Multi-ECC. Since both schemes also use ECC cache, their performance is likely to be the same. Overall, we can expect the order of timing performance improvement (high to low) to be Scheme 5, followed by LOT-ECC and Multi-ECC, followed by Scheme 4 and V-ECC, followed by Scheme 3, followed by Scheme 1 and 2 and Chipkill-Correct.
Power Performance Comparison
We compare the DRAM power consumption of the candidate ECC schemes; the memory configurations are provided in detail in Table 6 . We generate the power consumption for each scheme by running the simulations in DRAMSim2. Fig. 8 shows the power consumption of the sequential and multiprogrammed SPEC workloads normalized to baseline. The power consumption is due to the off-chip DRAM since the power consumption due to ECC decoding in logic die is negligible (20 mW compared to 3 W for DRAM). Scheme 4 obtains the best power reduction among the five schemes; it achieves an average of 18 percent, a maximum of 23.5 percent in multiprogrammed workloads and a maximum of 21.3 percent in sequential workloads power reduction compared to baseline. Scheme 5 does not perform as well due to its higher storage overhead (25 percent compared to 12.5 percent) and its 2 kB row buffer compared to 1 kB row buffer used in x4 and x8 DRAM systems. Even though x16 DRAM activates fewer number of DRAM chips, the larger row buffer size adversely affects its power performance. Of the existing schemes, V-ECC has power performance comparable to our E-ECC Scheme 4. LOT-ECC and Multi-ECC have the lowest power consumption since they only activate one x8 rank per memory access. Overall, we can expect that the power efficiency (high to low) to be LOT-ECC and Multi-ECC, followed by Scheme 4, followed by Scheme 3, followed by Scheme 5m followed by Scheme 1 and 2 and Chipkill-Correct.
Energy Performance Comparison
We next present our energy efficiency comparison for the different ECC schemes. We derive the energy number by multiplying cycle per instruction (CPI) and power for each benchmark with 250 million instructions and normalizing it to the energy number of the baseline. Overall, Scheme 4 outperforms all other proposed schemes in energy efficiency. Fig. 9 shows that Scheme 3, Scheme 4 and Scheme 5 improve the energy efficiency by a mean of 17.4, 25.4 and 22 percent, respectively. Scheme 5 outweighs Schemes 3 or 4 only when the system is heavily used. However, if the system is under utilized, Scheme 5 performs worse compared to other schemes. We conclude that even though Scheme 5 has the best timing performance, it does not have the best power performance and thus not the best energy performance.
Of the existing schemes, V-ECC has energy efficiency comparable to our E-ECC Scheme 4. LOT-ECC and Multi-ECC have the best energy efficiency; they outperform E-ECC Scheme 4 by around 15 percent. Although LOT-ECC and Multi-ECC have attractive energy performance, they are not as reliable as Chipkill-Correct schemes.
Reliability
In this section, we analyze the reliability of the competing schemes including Chipkill-Correct, V-ECC [8] , LOT-ECC [9] , Multi-ECC [10] and our E-ECC Schemes. Table 7 summarizes the error detection and correction capability of all schemes. For each error event, we provide the rates for detectable and correctable errors (DCE), detectable but uncorrectable errors (DUE) and silent data corruption (SDC). These rates are calculated by performing 10 million Monte Carlo simulations.
First of all, all schemes can correct single bit error with 100 percent probability. If there are multiple bit errors in a row, LOT-ECC uses 7-bit checksum to perform local error detection and can only detect a whole row being stuck-at-0 or stuck-at-1. However, if there are multiple random bit errors in a row, LOT-ECC cannot fully detect it. The DCE rate of this scheme is 87.5 percent and the SDC rate is 12:5 percent as given in [9] . The rest of the schemes can correct multiple random bit errors in a single row.
Next we consider multiple bit errors in a single column. Multi-ECC uses one's complement to generate the column check sum and so if there are even number of random errors or combinations of stuck-at-0 and stuck-at-1 failures in a single column, Multi-ECC can detect the errors using row parity bits but cannot use the column checksums to locate them correctly. If every bit in a single column is flipped with 50 percent probability, then for 10 million runs, Multi-ECC has DCE of 75.01 percent and DUE of 24.99 percent. In contrast, the rest of the schemes can deal with any combination of random errors or permanent errors within a single column. LOT-ECC and Multi-ECC have performance comparable to ECC x8 one rank and V-ECC has performance comparable to E-ECC x8 Scheme 4.
TABLE 7 Error Detection and Correction Comparison
ChipkillCorrect V-ECC [8] LOT-ECC [9] Multi-ECC [10] E When a chip fails, it results in multiple column failures or row failures. LOT-ECC can detect a chip failure with DCE less than 87.5 percent, which is the same as its row failure event. Similarly, Multi-ECC has a chip failure probability which is the same as its column failure probability. In contrast, all five E-ECC schemes can correct errors due to a chip failure. When one chip fails and a single bit failure occurs in another chip, Chipkill-Correct and V-ECC can fully detect this event. LOT-ECC can detect this event with less than 87:5 percent probability while Multi-ECC can detect with 1 À 2 À16 % 99:9985% probability. Scheme 1 can detect this event with 100 percent probability and Scheme 2 can correct with 100 percent probability. However, Schemes 3, 4 and 5 cannot handle this error event.
If the MC marks a chip as faulty and there is one more random error in another chip, all 5 E-ECC Schemes can correct it. V-ECC can also be designed to correct one erasure and one random error. LOT-ECC, which uses XOR operation to recover from a faulty chip, cannot handle this error event. Multi-ECC can correct the extra error if a spare chip is used to replace the faulty chip. If the MC marks a chip as faulty and there is one more chip failure, Schemes 1 and 2 can fully correct the errors.
Of all the schemes, E-ECC Scheme 2 has the highest error protection capability since it can correct double chip failures on the fly. In addition, when two chips are marked as faulty, it can correct errors due to a third chip failure.
Overhead Comparison
All proposed E-ECC Schemes (except for Scheme 5) have 12.5 percent storage overhead while V-ECC, LOT-ECC and Multi-ECC have storage overheads more than 12.5 percent (18.5, 26.5 and 12.9 percent). Although Scheme 5 has 25 percent storage overhead, it only utilizes one additional chip per four chips in x16 DRAM. Since direct implementations of V-ECC, LOT-ECC and Multi-ECC incur extra reads or writes to access or update tier-2 ECC symbols, all three schemes use ECC cache to store the tier-2 ECC symbols and avoid performance degradation. To support use of ECC cache, OS needs to be modified and extra hardware units (ECC address mapping) have to be added as indicated in [8] , [9] , [10] .
The E-ECC schemes rely on MCA to record the corrected errors in each chip. MCA is already used in several servers [2] , [14] , [15] and does not add to the overhead. When errors are caused by permanent faults, V-ECC, LOT-ECC and Multi-ECC also rely on MCA logs to decide when to use spare rows, page retirement and chip replacement. However, use of these methods results in significant overhead to reroute, remap or retire the data in DRAM. Overall, the proposed E-ECC schemes use more logic die area but have the lowest storage and infrastructure overhead compared to the existing schemes. A comparison of the overhead of the competing schemes is given in Table 8 .
CONCLUSION
In this paper, we present five erasure and error correction (E-ECC) schemes that provide superior error protection for x4, x8 and x16 DRAM systems. All our schemes use strong symbol based codes and provide higher error protection compared to existing systems. Furthermore, when a chip is marked faulty, our schemes make use of erasure correction to increase the lifetime of the memory system with no additional cost. Synthesis results show that the decoding latency of these codes is very small and the additional latency does not affect the timing performance of the memory system. Also, our schemes require no extra memory accesses to perform error correction and more importantly, do not require any change in the memory architecture.
All the proposed schemes can correct errors due to a chip failure, and when the chip is marked faulty, they can correct one more random error. Of all these schemes, Scheme 2 that is designed for x4 systems and uses RS (36, 32) code, has the highest reliability. Simulations on SPEC 2006 benchmarks show that Schemes 3, 4 and 5 have better timing, power and energy performance compared to x4 Chipkill-Correct solutions. Of these schemes, Scheme 4 that is designed for x8 systems and uses the RS (36, 32) code, has the lowest power consumption and highest energy performance while Scheme 5 that is designed for x16 systems and uses the RS (20, 16) code, has the best timing performance. Overall, our proposed schemes provide a low cost solution to increasing the lifetime of commodity DRAM memory systems with lower power and performance overhead.
APPENDIX
Decoding algorithm of the E-ECC schemes based on the (144,128) rotational code: This is a (36,32) code over GF (2 4 ) that has a minimum distance of 4 and supports the following cases: (i) single error correction and double error detection, (ii) single erasure correction, (iii) single erasure and single error correction, (iv) double erasure correction and (v) double erasure and single error detection.
The first step is syndrome calculation, where the syndrome vector S ¼ ðs 0 ; s 1 ; s 2 ; s 3 Þ T is calculated by multiplying the parity check matrix H with the codeword. The parity check matrix of the rotational (144,128) code is of size 4 Â 36, where each entry is a 4 bit symbol, and is given in [6] . Case (i) is the traditional single symbol correction and double symbol detection. The decoder compares the syndrome vector with all columns of the parity check matrix to determine the error Cases (ii) to (v) involve correction of erasure symbols. Note that since the location of the faulty chip is known, the corresponding symbols are marked as erasure symbols. Each erasure symbol is replaced with the zero symbol in the received codeword and then the syndrome vector is generated. For case (ii), the syndrome vector is compared with the column in the parity check matrix corresponding to the erasure address. If the erasure address is i, the decoder checks if S ¼ e i h i for column i in H. If it holds, the decoder recovers the erasure value e i in address i of codeword. Otherwise, the single erasure and single error correction (case (iii)) unit is activated.
For case (iii), assume that the erasure address is i and the erasure value is e i ; similarly, the error address is j and error value is e j , where i 6 ¼ j. The decoder needs to check whether the syndrome vector, S, is a linear combination of h i and h j , where j is from 0 to 35 and i 6 ¼ j. The hardware consists of 36 sub-decoders, where the jth decoder has h j embedded in it. The MC feeds column h i to all but the ith sub-decoder. If the ith and jth columns of the parity check matrix are h i ¼ ðh i0 ; h i1 ; h i2 ; h i3 Þ T and h j ¼ ðh j0 ; h j1 ; h j2 ; h j3
equations are used to obtain e i and e j . The decoded e i and e j values are substituted back to calculates 2 ¼ e i Á h i2 þ e j Á h j2 ands 3 ¼ e i Á h i3 þ e j Á h j3 . Ifs 2 ¼ s 2 ands 3 ¼ s 3 both hold, the decoder declares that the error is located in location j of the received codeword and corrects it. For case (iv), the MC sends two erasure addresses (assume i and j) to the decoder and the decoder extracts the two columns, h i and h j , corresponding to the two erasure addresses. It uses these two columns to check whether the syndrome is a linear combination of h i and h j . If the condition holds, the decoder recovers the two erased symbols. Otherwise, it declares that there are two erasures and one error, which is case (v).
Decoding algorithm of the E-ECC schemes based on the RS (36,32) over GF (2 8 where a is a primitive element of GF(2 8 ). For case (i) and case (ii), the decoder implements single error and double error correction. The decoder can classify the two cases by using the syndrome vector [28] . Let the syndrome vector be S ¼ ðs 0 ; s 1 ; s 2 ; s 3 Þ T . The condition
corresponds to a single error event; otherwise, it is a double error event. We implement the double error correction based on the method used in [37] . To solve the error locator polynomial, we use a deterministic way to solve the roots rather than using the Chien search method. The error locator polynomial can be simplified to y 2 þ y þ K. To solve y, a deterministic way by using linearlized polynomials is shown in [37] .
Case (iii) is trivial and easier than case (i). Suppose h i is the ith column of H and the erasure position is at i. The decoder compares S ¼ e i Á h i or not. If it holds, the decoder declares it is single erasure event. If it does not hold, it activates case (iv). Suppose the erasure address is i (known) and the error address is j (unknown). The decoder checks whether the syndrome vector, S, is a linear combination of a i þa j and e i ¼ s 0 þ e j . The decoder first finds the error address j, then derives e i and e j .
In E-ECC Scheme 2, if two chips are marked faulty, then there are two erasures (case (v)). Assume the positions of two erasure symbols are i and j, where i and j are from 0 to 35 and i 6 ¼ j. The relation between syndrome vector and the columns corresponding to the erasure positions in H is as follows:
where e i and e j are the erasure values for positions ith and jth. The value e j is calculated by e j ¼ s 0 a i þs 1 a i þa j and e i is obtained by e i ¼ s 0 þ e j . The decoded e i and e j values are used to calculates 2 ands 3 , namely,s 2 ¼ e i Á a 2i þ e j Á a 2j ands 3 ¼ e i Á a 3i þ e j Á a 3j . Ifs 2 ¼ s 2 ands 3 ¼ s 3 hold, the decoder declares a double erasure event and corrects these two erased symbols (case (v)). Otherwise, the decoder activates the double erasure and single error correction unit corresponding to case (vi).
In case (vi), let the address of the error symbol be k, the error value be e k and the addresses of the erasure symbols are i and j, where i 6 ¼ j, j 6 ¼ k and i 6 ¼ k. The relation between syndrome vector and double erasure and one error is given as follows: In E-ECC Schemes 3 and 4, if a single chip is marked faulty, there are two erasures but those erasures are in consecutive locations. So if the position of the first erasure is 2i the position of the second erasure is 2i þ 1, where i ¼ 0; 1; . . . ; 17. The procedure for case (v) and case (vi) in E-ECC Schemes 3 and 4 are very similar to these in E-ECC Scheme 2 and are not described here.
The RS (20, 16) code over GF (2 8 ) has smaller parity check matrix but it has the same decoding algorithm as the RS (36, 32) code. Hence, we skip the description of its decoding units of RS (20, 16) here.
Hsing-Min Chen received the BS and MS degrees in computer science from National Chiao Tung University, Taiwan. Currently, he is working toward the PhD degree in electrical engineering at Arizona State University. He is supervised by Dr. Chakrabarti. His research interests includeerror correction codes, reliable memory design, memory system performance evaluation, and front-end RTL design.
Supreet Jeloka received the BTech degree from NIT, Warangal, India and the MS degree in electrical engineering in 2013, from the University of Michigan, Ann Arbor, where he is currently working toward the PhD degree. His current research interests include-low power circuits, memory design, memory-based computing, interconnect fabrics, and hardware security.
Akhil Arunkumar received the MS degree in electrical engineering from the University of North Carolina at Charlotte in 2012. He is currently working toward the PhD degree in computer science at Arizona State University. His main research interests include memory hierarchy design and computer architecture. He is a student member of the IEEE and ACM.
David Blaauw received the BS degree in physics and computer science from Duke University in 1986, and the PhD degree in computer science from the University of Illinois, Urbana, in 1991. After his studies, he was with Motorola, Inc. in Austin, TX, where he was the manager of the High-Performance Design Technology group. Since August 2001, he has been a professor at the University of Michigan. He has published more than 450 papers and holds 40 patents. His work has focussed on VLSI design with particular emphasis on ultra-low power and high-performance design. He was the technical program chair and general chair for the International Symposium on Low Power Electronic and Design. He was also the technical program cochair of the ACM/ IEEE Design Automation Conference and a member of the ISSCC Technical Program Committee. He is a fellow of the IEEE.
Carole-Jean Wu received the BS degree in electrical and computer engineering from Cornell University in 2006, and the MA and PhD degrees in electrical engineering from Princeton University, in 2008 and 2012, respectively. She is currently an assistant professor with the School of Computing, Informatics, and Decision Systems Engineering, Arizona State University. Her research interests are in the areas of processor architectures and memory hierarchy designs to achieve high performance and Improved energy efficiency.
Trevor Mudge received the PhD degree in computer science from the University of Illinois, Urbana. He is now the Bredt family professor of computer science for pioneering contributions to low-power computer architecture and received the University of Illinois Distinguished Alumni Award. He is a life fellow of the IEEE, a member of the ACM, the IET, and the British Computer Society.
Chaitali Chakrabarti received the BTech degree in electronics and electrical communication engineering from the Indian Institute of Technology, Kharagpur, India, in 1984, and the PhD degree in electrical engineering from the University of Maryland, College Park, in 1990. He is a professor with the School of Electrical, Computer and Energy Engineering, Arizona State University, Tempe. Her research interests include VLSI algorithm-architecture co-design of signal processing and communication systems and all aspects of low-power embedded systems design. He is a fellow of the IEEE.
" For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.
