64-bit ARM based server with a Linux OS. In order to experiment under dierent DRAM temperatures, we also implemented a thermal testbed that allows to ne tune the temperature of each DIMM on the server.
• We perform characterization of the power-reliability trade-o of 72 server grade DRAM chips under scaled refresh period using conventional data-pattern micro-benchmarks (DPBenches) as well as single-and multi-threaded HPC workloads. Our study reveals that the DRAM error behavior varies among the workloads and within execution of these workloads.
• We demonstrate that the total memory power could be reduced by 11.5 % on average in such server with a complete software stack by relaxing the refresh period by 35⇥. This was possible without compromising the system availability since SingleError-Correction Double-Error-Detection Error Correction Code (SECDED ECC) was adequate for correcting all manifested errors up to 60 C, while avoiding any system crashes.
The rest of the paper is organized as follows. Section 2 describes the background and the open challenges, while Section 3 presents our experimental framework. Section 4 analyses the results of our experimental campaign. Finally, conclusions are drawn in Section 5.
DRAM BACKGROUND AND CHALLENGES
A main memory sub-system based on DRAMs is organized hierarchically into channels supporting a number of DRAM modules. Each Dual In-line Memory Module (DIMM) usually has two ranks, each of which consists of DRAM chips. Within each chip, DRAM cells are organized into banks, which are two-dimensional arrays, addressed based on rows and columns as shown in Figure 1 . The main drawback of the DRAM technology is the limited retention time [11] of the cell's charge. To avoid any error induced by the charge leakage over time, DRAM employs an Auto-Refresh mechanism that periodically recharges all cells in the array based on the worst retention time across them. Conventionally, all DDR technologies adopt today a refresh period, T REF P , of 64 ms for refreshing periodically each cell of the DIMM even if in reality many cells may have a much higher retention time than T REF P and the conditions in the eld after deployment may not be as bad as the ones assumed [7] .
Existing Studies
Several experimental studies have revealed the large spatial distribution of cells based on their retention time and tried to exploit it for relaxing T REF P for most of the cells [10, 12, 16] . Typically, such approaches rely on the characterization of the retention time of cells on custom FPGA based setups using i) a set of worst-case DPBenches and ii) elevated temperatures, which are the main factors that excite worst-case circuit-level eects on DRAMs [8, 10, 11, 17] .
DPBench. Typically, a set of worst-case static (all 1's, all 0's, checkerboard) and dynamic (random, uniformly distributed data) DPs are used during characterization since the retention time of each cell depends on the value of the stored data within each cell but also in the neighbouring cells. In fact, recent studies have shown that various circuit-level crosstalk eects are being excited by the stored data and may cause the retention time of each cell to vary [11] . Such ndings deemed ineective any existing approach that relied on oine characterization since the number and position of the identied weak cells could not be guaranteed after the DRAM deployment.
Temperature. Indeed many studies have shown that the retention time of DRAM cells decreases exponentially as the DRAM temperature T rises, described by Ae 0055⇥T + C [8, 11, 17] . In all existing studies the DRAM temperature is controlled by placing the whole FPGA system in a thermal chamber [8, 10, 11, 17] , or more lately by using heating elements on DIMMs [7] .
To address such eects some recent studies suggested the use of error correction codes [10, 16] , strengthening the conventional SECDED ECC [6] . Alternatively, other works have tried to mask any error by exploiting the inherent error-resilient features of some applications or the implicit refresh incurred by every memory access but they did so mainly within simulators [1, 13] , rather than on a real system, basing their fault injection schemes on the retention time discovered by the DPBenchs.
Challenges and Objectives
Prior existing studies have revealed various circuit level data-and temperature-dependent eects and suggested error mitigation methods. However, they still left a number of questions unanswered. First, existing studies were performed on custom FPGA setups or simulators, which may help simplify characterization and evaluation, but do not study the impact of system level eects, which may be excited within a server and aect the DRAM reliability, such as reuse time in [1] .
Several parameters of the deep memory hierarchies on servers, like the organization, the size of the caches and the supported memory bandwidth, can directly aect the number and frequency of accesses to DRAM and thus its reliability. The thorough study of the DRAM error behavior requires the execution of relevant single-and multi-threaded workloads on servers. Furthermore, real workloads are expected to dynamically change the stored DPs and load dierent components of a system and thus may cause various system level eects that may not excite the worst-case scenarios targeted by conventional DPs.
As temperature has a signicant eect on the DRAM reliability [8, 11, 17] , it is also essential to develop a special thermal testbed and investigate the DRAM reliability running real workloads under high temperatures, which may be obtained in the worst case scenarios.
Finally, there is a need to investigate if the available ECC on server grade DRAMs or more strong ECCs [6] are required to ensure 
EXPERIMENTAL FRAMEWORK
To address the aforementioned challenges, we develop an experimental framework on a server as described below.
Server Details
The basis of our experimental framework is a state-of-the-art commodity 64-bit ARMv8 based server, the X-Gene2 Server-on-a-Chip, which is the latest generation of the X-Gene family of chips used in the popular HP Moonshoot servers [20] . As depicted on Figure 1 , the X-Gene2 SoC consists of four Processor modules (PMDs), each with two 64-bit ARMv8 cores running at 2.4GHz. The implemented memory hierarchy is representative of any modern high performance system consisting of a 32 KB L1 data cache and a 32 KB L1 instruction cache per core, a private 256 KB L2 cache shared between the two cores of each PMD and an 8 MB L3 cache shared across all four PMDs through the cache-coherent Central Switch (CSW).
The X-Gene2 has two Memory Controller Bridges (MCBs) which are connected to the CSW providing access to DRAM. In turn, each MCB is connected to two DDR3 Memory Controller Units (MCUs). Each MCU has one channel of DDR3 memory and support up to two DIMMs with two ranks each. In our campaign, we are experimenting with 4 Micron DDR3 8GB DIMMs at 1866 MHz [15] , one DIMM per MCU. In total, we are characterizing 72 chips of 4Gb x8 DDR3 [14] , since each DIMM includes 16 and 2 DRAM chips for data storage and ECC, respectively.
The X-Gene2 provides access to a separate Scalable Light-weight Intelligent Management Processor (SLIMpro), a special management core, which is used to boot the system and provide access to on-board sensors for measuring the temperature and power of the SoC and DRAM. The SLIMpro also reports to the Linux kernel all memory errors corrected or detected by SECDED ECC, providing information about the DIMM, bank, rank, row and column that the error occurred. The available ECC can detect and correct single-bit errors in a 64-bit word, which we refer to as correctable errors (CEs) and detect two-bit errors that cannot be corrected, which we refer to as uncorrectable errors (UEs). Finally, SLIMpro allows to congure the parameters of the MCUs, such as timings and T REF P , specically from the nominal 64 ms to 2.283 s that is the maximum allowed T REF P in the X-Gene2 server. The server runs a fully-edged OS based on CentOS 7 with the default Linux kernel 4.3.0 for ARMv8 and support for 64KB pages.
DRAM reliability analysis
We use the aforementioned error reporting mechanisms to record the CE and UE manifested within each 64-bit word under relaxed
In addition, to account for any potential errors of more than two-bits in a 64-bit word that cannot be detected by ECC, we compare the output of each execution with a golden reference output obtained when DRAM is operating under the nominal T REF P . In this way, essentially we were able to measure any Silent Data Corruption (SDC) that could go undetected by SECDED ECC.
At the end of the experimental campaign, we analyse the results and quantify DRAM reliability using a set of metrics.
We calculate the 64-bit error rate, W ER, for the amount of memory used by an application as:
SizeO f Memor U sedInW ords
(1) W ER shows the probability of a bit being erroneous independently of the memory size allocated by the application.
To compare any dierences between the number of error-prone locations discovered by the DPBenchs and the HPC workloads, we calculate the so-called coverage of unique erroneous locations detected when running a specic workload over the total number of error-prone locations, discovered by all benchmarks, as:
, where #o f U niqError (j) is the number of unique erroneous locations in terms of 64-bit words discovered when running the specic benchmark j. Furthermore, we calculate the rate of change of the Co in time, as Co (X , t ) = Co (X , t ) Co (X , t T step ), where T step = 10 minutes. This allow us to investigate how fast each benchmark discovers error-prone locations and when Co convergences to a stable value, which can be used as an indicator that we reached an acceptable number of experiments.
DRAM Thermal Testbed on a Server
To perform the experiments under controlled temperature, we implement a temperature-controlled testbed for DRAMs on a server. Our approach is based on heating elements, similar to [7] . Each adapter consists of a resistive element, thermally conductive tape transferring the heat of the element to all the chips of a DIMM in uniform way and a thermocouple to measure the temperature. The temperature of each element is controlled by a controller board which contains eight solid state relays controlling the resistive elements of each DIMM and rank independently. During our experiments, the maximum deviation from the set temperature is less than ±1 C.
EXPERIMENTAL EVALUATION
In this section, we present the results of our characterization campaign of 72 DRAM chips using the described framework under the maximum allowed T REF P , i.e. 2.283 s, in the X-Gene2 server at 50 C and 60 C.
Considered Benchmarks
We choose to run the conventional DPBenchs based on: all 0s, all 1s, checkerboard and random DPs as in most existing studies [9, 11] . We select also 4 HPC applications from the Rodinia Benchmark Suite, which are typically used for benchmarking parallel systems [3] . In particular, we use backprop, nw, srad and kmeans to cover a range of domains, i.e. Machine Learning, Bioinformatics, Image Processing and Data Mining. We run HPC benchmarks with 1 and 8 threads to evaluate how parallel execution aects the DRAM error behavior. In our study, we are tyring to understand whether DRAM operating under T REF P could be characterized with DPBenchs. To investigate this, we need to identify all cells that have a small retention time, and thus are more prone to failures. This is achieved by executing DPBenchs ensuring that we cover all memory available for the user space, i.e. 28 GB on our experimental setup, while the remaining 4 GB is used by Linux for the kernel, drivers and other system services. For HPC workloads, we limit the datasize of the benchmarks to 8 GB, while each execution has a random allocation in memory as it goes through the normal allocation policies of Linux. We intentionally do not use all available memory at the user space to consider in our study a possible eect of these policies on the DRAM reliability and explore ability of DPBenchs to cover error-prone locations discovered when running real workloads.
Characterization with DPBenches
We start with running each one of the considered DPBenchs allocating 8 and 28 GB memory for 120 minutes to ensure that each DPBench makes at least 16 rounds, as discussed in [11] . In all our experiments with DPBenches under 50 C and 60 C, we observe only CEs, which were corrected by the available SECDED ECC, and no UEs or SDCs.
Coverage.
In order to identify if the dierent DPBenches excite dierent error locations for DRAM operating at 2.283 s T REF P , we calculate Co of error locations discovered by each DPBench, as seen in Figure 2a and 2b. The highest Co is observed for the random DPBench which is above 70 % for both temperatures, while the static DPBenches discover less than 25 % of all reported error-prone locations. These results are consistent with the observations made by Liu [11] , where authors also reported that the highest Co is observed in case of the random DPBench. In Figure 2a and 2b, the circle's size and the number indicates the Co (10minutes) for the last 10 minutes. We see that the Co (10minutes) does not exceed 2 % for each DPBench at the end of the execution which implies that 120 minutes are sucient for identifying the majority of error-prone locations. Note that for the calculation of Co above, we only consider the errors incurred by the DPBenchs.
WER.
We calculateW ER for the DPBenches as can be seen on Figure 3 . Similarly to our experiments with Co , we observe that the random DPBench does have the highest W ER compared to checkerboard, all 0s and all 1s, specically 6.7 ⇤ 10 7 and 1.2 ⇤ 10 5 at 50 C and 60 C, respectively. Comparing W ER of DPBenches in case of 8 and 28 GB memory allocations, we get similar W ERs for both temperatures. This shows that W ER does not change with the size of the allocated memory, which is also an indication of the small variation of manifested errors and weak cells across the considered DIMMs.
Characterization with HPC benchmarks
Next, we execute single-and multi-threaded versions of the memory intensive HPC benchmarks under relaxed T REF P for 2 hours as in case of DPBenches. Similarly to the DPBench experiments, we observe only CEs, which were corrected by the available SECDED ECC, and no UEs or SDCs 4.3.1 Coverage. Figure 2c shows Co for DPBenchs and the HPC benchmarks for DRAM operating under relaxed at T REF P 50 C and 60 C. In this gure, we note the number of used threads after the name of each HPC benchmark. We see that random exhibits the highest coverage, 48 % and 51 % (at 50 C and 60 C), and essentially excites more error-prone locations than most of the HPC benchmarks. However, it is lower than Co for random discovered in the previous experiment when we consider only error-prone locations detected with the DPBenches(see Figure 2) . We also observe that kmeans(8) has the highest Co among HPC benchmarks at 50 C, which is even slightly higher than Co achieved by the random DPBench allocating 8GB of memory. These results imply that real workloads induce errors in more locations in comparison to the static DPBenches and, in some cases, even more than the random benchmark. Finally, as no applications demonstrate high Co , we conclude that each benchmark can trigger errors in memory locations which are not detected when running other benchmarks.
WER.
The obtained W ERs for HPC workloads are depicted in Figure 3 . We observe that the HPC benchmarks incur a higher W ER than the static DPBenches, but less than the W ER manifested by random and random(28GB) at 60 C. Nonetheless, W ER incurred by kmeans(8) 50 C is higher than W ER obtained for random benchmarks allocating 8GB and 28 GB. As follows, real benchmarks may trigger errors in memory locations which are not covered by DPBenchs which implies that DRAM cannot be fully characterized with DPBenchs. We also see that W ER vary across benchmarks: for example, W ER incurred by nw is 2.5x higher than W ER obtained for kmeans (8) . What is more, we found that W ER incurred by kmeans at 50 C grows in 2.77x if we run it in parallel(8 threads). Thus we may conclude that the DRAM error behavior is workload-dependent and it may signicantly change with the level of concurrency in programs.
DRAM error behavior variation within workloads
In our study, we also discovered that DRAM error behavior may vary signicantly within an application run. For example, Figure 4a shows how the number of CEs induced by kmeans(8) changes in time for DRAM operating under 2.283 S T REF P at 60 C. We can indentify two dierent phases in this benchmark: at the rst phase the average number of CEs per second is less than 20, while at the second phase this number achieves almost 300 per second. By proling and analysis of this benchmarks, we found that the rst phase corresponds to the I/O phase of kmeans where the input data for this benchmark is retrieved from a le. While at the second phase, the benchmark processes the input data reading and writing to DRAM intensively which explains the high error rate obtained at this phase. Nonetheless, Figure 4c shows that the average number of CEs per second detected at this phase drops down to 50 for the single-threaded version of this benchmark, while the duration of this phase increases in 7.8⇥. Note that we also observe variations of the DRAM error behavior within the nw benchmark. Figure 4b and Figure 4d depict the number of CEs per second detected for the parallel and 1-thread versions of nw respectively. We see two distinct phases with a low and a high error rates. Moreover, similarly to kmeans, the error rate is lower for the 1-thread version than for the parallel version of this benchmark.
Based on these grounds, we conclude that the DRAM behavior may vary not only across workloads but also within a workload run. n w ( 8 ) n w ( 1 ) k m e a n s ( 8 ) k m e a n s ( Figure 5 depicts the memory power and the reduction of power for each benchmark averaged across the DIMMs and the two temperatures when we relax T REF P . These saving can be attributed to the 35⇥ less refresh operations issued by MCUs, caused by the same change of the T REF P . We observe the greatest reduction of power for nw(8) at 27.3 %, while many workloads have reduction close to 10 %. On average, we observe that we can save 11.5 % of the total memory power.
Ecacy of ECC

Power gain
CONCLUSIONS
In this paper, we present a comprehensive characterization of 72 DRAM chips under relaxed refresh period and under various temperatures within a commodity server using conventional datapatterns along with a set of HPC workloads. To enable our study, we develop an experimental framework based on a state-of-the-art 64-bit ARM based server integrated with a thermal testbed. We demonstrate that the number of excited error-prone locations vary from workload to workload and may dier from the ones discovered by the conventional data-patterns. Moreover, we show that the DRAM error behavior may also change within a workload run. Finally, we show that the DRAM refresh period can be relaxed by 35x on such a commodity system with all errors being corrected by the available ECC up to 60 C DRAM temperatures, resulting in 11.5% power savings.
