Abstract-Today's rapid generation of data and the increased need for higher memory capacity has triggered a lot of studies on aggressive scaling of refresh period, which is currently set according to rare worst case conditions. Such studies analysed in detail the data-dependent circuit level factors and indicated the need for online DRAM characterization due to the variable cell retention time. They have done so by executing few test data patterns on FPGAs under controlled temperatures by using thermal testbeds, which however cannot be available in the field. Moreover, the existing studies were not able to reveal any system level effects, which may be excited under the execution of workloads on real systems and directly or indirectly affect DRAM reliability. In this paper, we develop an experimental framework based on a state-of-the-art 64-bit ARM based server with Linux OS, in which we enabled the DRAM characterization under relaxed refresh period by executing conventional test data patterns as well as popular HPC and Cloud workloads. Our results indicate that common test patterns are ineffective in identifying error-prone locations at low DRAM temperatures. Furthermore, we reveal that there is a strong correlation between the SOC utilization and DRAM reliability. By exploiting such findings, we developed a benchmark, which can indirectly stress the DRAM temperature and thus used for characterization in the field without needing any complicated thermal equipment. Our study shows that the refresh period can be relaxed by 35 times on such a commodity system with all errors being corrected by the available error correcting codes, resulting in 11.5% power savings on average.
I. INTRODUCTION
The rapid growth of connected Internet-of-Things (IoT) is estimated to generate by 2020 24.3 exabytes of data [1] creating immense needs for more data storage capacity and aggressive scaling of DRAMs. Such needs have already turned DRAM based subsystems into one of the main power consumers, especially in servers, with estimations indicating that soon they will be accountable for almost half of the consumed power [2] .
The main drawback of the DRAM technologies is the limited retention time [3] of the cell's charge. To avoid any error induced by the charge leakage over time, DRAM employs an Auto-Refresh mechanism that periodically recharges all cells in the array. Conventionally, all DDR technologies adopt today a refresh period, TREF P , of 64 ms for refreshing periodically each cell of the DIMM based on the worst case retention time across all cells. However, many cells have a much higher retention time and the operating conditions may not be as bad as the ones assumed [4] , [5] .
Many studies tried to relax the refresh period and the associated power consumption, estimated to incur 40 % power overhead in future 64Gb densities [6] . The majority of the proposed schemes rely on offline identification of the weak cells using few known data patterns and the adoption of different refresh periods for various cells, rows and pages [6] , [7] , [8] , [9] , [10] , [11] . However, recent studies have proven such schemes ineffective, since they revealed that the cell retention time varies dynamically due to data-dependent circuit level crosstalk effects [6] , [12] . Such findings urged the need for DRAM characterization under dynamically scaled refresh rate based on the actual data executed in the field [12] after deployment.
The majority of the existing characterization campaigns were performed on FPGAs under elevated DRAM temperatures [12] , [3] , [13] , [4] , not allowing them to study any dynamic system level effects which may be excited by any executed application within a server and directly or indirectly affect DRAM reliability. Such effects on DRAM reliability caused indirectly by the system utilization were also suspected in a long term study in a Google data center [14] but have never been thoroughly studied. Investigating such effects will require execution of applications on a server, while not controlling the DRAM temperature, collection and analysis of system performance and temperature.
This paper attempts to address such challenges by making the following contributions:
• We develop an experimental framework for characterizing DRAMs under relaxed refresh period, within a state-of-the-art 64-bit ARM based server.
• We perform a characterization of the power-reliability trade-off of 144 server grade DRAM chips under scaled refresh period using conventional data patterns as well as High-Performance Computing (HPC) and Cloud workloads. Our results reveal that the identified erroneous locations and the DRAM temperature vary across applications, which has not been considered in the previous FPGA studies, where the DRAM temperature was fixed during experiments.
• We demonstrate that the System-on-Chip (SOC) utilization may indirectly affect the DRAM temperature and thus its reliability • We develop a micro-benchmark that combines the random data pattern and a higher SOC utilization to indirectly increase the DRAM temperature, providing an effective mechanism for online DRAM characterization without any thermal testbed.
• We demonstrate for the first time that the total memory power could be reduced by 11.5 % on average in such a server by relaxing the refresh period by 35x. This was possible without compromising the system availability since SECDED Error Correcting Codes (ECC) was adequate for correcting all manifested errors and avoiding any system crashes. The rest of the paper is organized as follows. Section II presents our experimental framework, while Section III analyses the results of our experimental campaign. Finally, conclusions are drawn in Section IV.
II. EXPERIMENTAL CHARACTERIZATION FRAMEWORK

A. Infrastructure Details
The basis of our experimental framework is a state-of-the-art commodity 64-bit ARMv8 based server, APM X-Gene2 [15] . The X-Gene2 server has four Memory Control Units (MCUs). Each MCU has one channel of DDR3 memory and support up to two DIMMs. In our campaign, we are experimenting with two different sets of 4 Micron DDR3 8GB DIMMs at 1866 MHz [16] , one DIMM per MCU. In total, we are characterizing 144 chips of 4Gb x8 DDR3 [17] . The X-Gene2 has a special management core (SLIMpro) which provides access to on board sensors, including temperature and power sensors, and reports to the Linux kernel all errors detected by SECDED ECC. The available ECC can detect and correct single-bit errors in a 64-bit word, which we refer to as correctable errors (CEs) and detect twobit errors that cannot be corrected, which we refer to as uncorrectable errors (UEs). Finally, SLIMpro allows to configure the parameters of the MCUs, such as TREF P .
B. Characterization Benchmarks
For our characterization campaign, we selected a set of data patterns micro-benchmarks (DPBenchs) which were used in all previous retention characterization studies [3] , [18] . Apart from the DPBenchs, we also used a set of server workloads: Speckle Reducing Anisotropic Diffusion (srad) and NeedlemanWunsch (nw) kernels from the Rodinia HPC Benchmark Suite [19] . In our experiments, we run single-threaded (srad(1), nw (1)) and eight-threaded (srad (8), nw (8)) versions of the benchmarks to evaluate how parallel access patterns affect memory reliability. Furthermore, we selected to use popular Cloud workloads from the CloudSuite [20] : memcached, graph-analytics and web-search workloads deployed with Docker.
C. DRAM Characterization Flow
To characterize the DRAM within a commodity server, we relax the TREF P from 64 ms to 2.283 s that is the maximum allowed TREF P in the X-Gene2 server, while keeping the supply voltage to the default 1.5V . Note that in our experiments, the ARM cores operate at the default frequency of 2.4GHz. The server is placed in a rack within a room with controlled ambient temperature of 18
• C. Such an environment is representative of the operating conditions within any data center, which may vary from 15
• C to 22
• C [5] . Error Accounting. We used the aforementioned error reporting mechanisms to record the CE and UE manifested within each 64-bit word under relaxed TREF P . In addition, to account for any potential errors of more than two-bits in a 64-bit word that cannot be detected by ECC, we compared the output of each execution with a golden reference output obtained when DRAM is operating at the nominal TREF P . In this way, essentially we were able to measure any Silent Data Corruption (SDC) that could go undetected by SECDED ECC.
We run each DPBench for a number of rounds. We run each DPBench for two hours to ensure that each benchmark makes at least 16 rounds as in [3] . Since we are interested in comparing the DRAM behaviour, under the execution of the DPBenchs and the real workloads, we run also each HPC and Cloud workload for two hours.During our experiments, we allocated 28 GB memory for each DPBench, the maximum available to user space. In case of the HPC and Cloud workloads, we chose to configure all of them with a memory allocation of 8 GB to enable a fair comparison of the number of triggered errors between them.
D. Analysis Phase
At the end of the experimental campaign, we analyse the results and quantify the DRAM reliability profile using a set of metrics.
First of all, we calculate the percentage of error-prone locations discovered when running a DPBench or workload over the total number of error-prone locations detected by all benchmarks as:
, where Num X locations is the number of unique error-prone locations, i.e. in terms of 64-bit words discovered when running the benchmark X. We also calculate the fraction of failing 64-bit words for a specific benchmark using:
, where size X memory is the size of memory measured in 64-bit words allocated by the benchmark X.
III. CHARACTERIZATION RESULTS
In this section, we present the results of our 3 month experimental campaign focusing on the ones obtained under relaxing the TREF P to the maximum allowed on the server, i.e. 2.283 s.
A. Experiments with Benchmarks under non-controlled temperature
Initially, we present the total number of errors, P DEL and F F W obtained after running the DPBenchs, HPC and Cloud workloads without controlling the DRAM temperature to a fixed value.
DPBenchs: Figure 1 depicts how P DEL, the total number of errors and F F W vary across the DPBenchs for the two DIMM sets. Surprisingly, the P DEL is zero in case of random, all1s and all0s DPBenchs since they do not essentially manifest any errors on neither of the two DIMM sets. In case of checkerboard, the P DEL is also very small since it triggered only one CE during all our experiments. Note also that we have not discovered any UEs or SDCs in experiments with the DPBenchs.
HPC and Cloud Workloads: Similarly to our experiments with the DPBenchs, we discovered only CEs and no UEs or SDCs when we executed the considered HPC and Cloud workloads. At the same time, we see in Figure 1 that HPC and Cloud workloads trigger many more errors than the DPBenchs. Moreover, it is observed that the same workloads may manifest different errors, i.e. location and number, on separate DIMM sets. For example, nw (8) and srad (8) trigger errors only on the second DIMM set, which can be attributed to manufacturing process variations. Meanwhile, the graph-analytics benchmark has the highest P DEL, the highest total number of reported errors and the highest F F W among all benchmarks on both DIMM sets, which implies a certain dependence between a running application and the DRAM error behaviour.
B. DRAM Temperature Variation across Benchmarks
To further investigate these results, we measure the temperature per DIMM slot (one DIMM per MCU), by using the temperature sensor on the SPD chip [21] of each DIMM across all benchmarks. Figure 2 shows the DRAM temperature averaged over the experiments with two DIMM sets with 95 % confidence intervals. We observe that the 
DRAM temperature averaged over DIMMs varies from 33
• C up to 43
• C.
In our experiments, we found that there is a strong connection between the DRAM temperture reached during the execution of each benchmark and the measured errors. In particular, we observe that DPBenchs, which reached 33
• C on average during the execution, manifested only one CE. On the other hand, benchmarks such as graph-analytics which led to an elevated DRAM temperature (i.e. 41.8
• C) during their execution, resulted in a higher number of errors. Particularly, we observe 28 error-prone memory locations (F F W is 9.31 × 10 −9 ) for this benchmark. These findings indicate that if the conventional DPBenchs are used to characterize the DRAMs online, without any mechanism to elevate the temperature, then they will be ineffective.
In our study, we run experiments by placing the available DIMMs on different slots. We found that DIMMs placed on the 1st memory slot have the highest temperature, which explains the fact that the majority of errors were reported for this slot. Figure 3 shows the spatial and density distribution of the errors between memory slots and memory ranks aggregated over two DIMM sets when we run the Cloud workloads. We present the distribution as a polar plot where Θ − axis specifies DIMM slot and rank, while ρ − axis reflects the number of errors. We see that the highest number of errors occurred in the DIMM from the 1st slot, which we found to be the same for the HPC benchmarks.
We observe that the DIMM in the 1st slot has the highest temperature for all runs (see Figure 2 ). Yet all memory accesses should be equally distributed between DIMMs due to the implemented interleaving mechanism and thus we expect that the temperature of all DIMMs to be similar.
C. SOC Impact on DRAM Temperature
We suggest that the temperature of DIMMs is highly correlated with the SOC temperature, which can be seen on Figure 2 . Table  I shows the Spearman's rank correlation coefficient (rs) between P DEL, number of errors, the SOC utilization, the SOC temperature and the DRAM temperature averaged over all DIMMs. This coefficient reflects the monotonic relationship between two variables [22] . We found that the SOC and the DRAM temperature are highly correlated as the correlation coefficient rs is 0.74, which indicates a positive direction of the correlation, i.e. if the SOC temperature rises then the temperature of all DIMMs also increases. We explain this by the fact that all memory slots are placed near the chip. a corresponding thermal photograph of the board. We see that the 1st and 3rd slots are closer to the chip than other slots. Thus, their temperatures should be higher than those of the 2nd and 4th slots. However, our study shows that DIMMs from the 2nd slot often have a higher temperature than the DIMMs placed on the 3rd slot which can be attributed to the air flow or the proximity to other heat sources.
Note that the SOC temperature is determined by the SOC utilization, the correlation coefficient rs is 0.7. Thus these findings imply that the SOC temperature affects the temperature of DIMMs, especially of those that are closer to the SOC, and thus DRAM reliability. Finally, we also observe a correlation between the SOC utilization and the DRAM temperature on Intel R Xeon based servers, such as [23] , which adopt a similar layout with DIMMs being placed adjacent to the SOC on the motherboard. However, the level of this correlation depends on the available cooling system within each server. for each thread j in #Of T hreads do 
D. Temperature Stress Benchmark
As discussed above, the temperature has a significant impact on DRAM reliability [12] . To exploit our findings concerning the indirect impact of the SOC activity on the DRAM temperature, we implemented a stress benchmark (SBench, see Algorithm 1) which incurs many L2 cache accesses and invokes parallel threads across all the cores apart from one core where the DPBenchs are being executed for stressing directly the data-dependent circuit level factors on the DRAM. In this benchmark, we specifically stress L2 caches since we found that it increases significantly the SOC temperature and power, as it was also shown in [24] . Figure 5 shows P DEL, SOC and DRAM temperatures measured after running 2 hours of each benchmark, including DPBenchs with the co-running SBench. The DRAM and SOC temperatures measured for the DPBenchs grow above 40
• C and 85
• C correspondingly when running with SBench (see Figure 5 ). After executing each DPBench in parallel with SBench which spawns 7 threads, we observe the manifestation of 759 CEs in total for both DIMM sets. We see that random-stress has the highest P DEL, however it covers less than 50 % of all discovered error-prone memory locations. These findings suggest that real applications trigger errors in a few memory locations which are not covered by DPBenchs. Note that, even though the DRAM temperature grows above 40
• C for all DPBenchs running in parallel with SBench, random-stress covers many more error-prone locations than other micro-benchmarks. This difference is attributable to specific data and memory access patterns which may significantly affect memory error behaviour, as previous research studies shown [12] .
Overall, these results imply that SBench could be effectively applied to increase the DRAM temperature and stress DRAM reliability without using any complicated thermal testbed that will not be available in the field during an online characterization.
E. Power Reduction
To measure the power reduction, we run all benchmarks at the nominal TREF P and relaxed TREF P and take DRAM power measurements using on board DIMM power sensors. By relaxing TREF P from 64 ms up to 2.283 s, we manage to reduce the total memory power by 11.5 % on average without compromising reliability as all errors were corrected by ECC.
IV. CONCLUSION
In this paper, we present a comprehensive study on DRAM reliability characterization under relaxed refresh period within a commodity server using conventional data patterns along with a set of HPC and Cloud workloads under non-controlled temperature. Our results suggest that conventional data patterns are not effective for an online DRAM characterization without any temperature stress mechanism. In addition, we quantify for the first time the indirect impact of system level factors, such as of the SOC utilization. These facts led to the development of a stress benchmark, which raises the temperature when executed in parallel with data patterns and significantly increases the number of discovered error-prone locations, thus facilitating DRAM characterization under scaled refresh period in the field without any thermal testbed. Finally, we show that the DRAM refresh period can be relaxed by 35x on such a commodity system with all errors being corrected by the available ECC, resulting in 11.5% power savings.
