Abstract-In order to provide high system resilience, it is important to understand the nature of the faults that occur in the field. This study analyzes fault rates from a production system that has been monitored for five years, capturing data for the entire operational lifetime of the system. The data show that devices in this system did not show any sign of aging during the monitoring period, suggesting that the lifetime of a system may be longer than five years. In DRAM, the relative incidence of fault modes changed insignificantly over the system's lifetime: the relative rate of each fault mode at the end of the system's lifetime was within 1.4 percentage point of the rate observed during the first year. SRAM caches in the system exhibited different fault modes including cache-way fault and single-bit faults. Overall, this study provides insights on how fault modes and types in a system evolve over the system's lifetime.
I. INTRODUCTION
The key to designing and improving a reliable system is to first understand the nature of faults and errors that will occur. One way to do this is to study the faults occurring in current production systems. Server architects, system designers, and data center administrators can use insights obtained from such a study to improve the resilience of future systems by identifying new issues and developing stronger mitigation techniques, operational policies, and application designs.
Unfortunately, investigating fault and error characteristics on operational production systems is often challenging. First, obtaining recent and statistically significant field data is difficult. Often field data are not released or are collected only for short periods -limiting the possibility to perform statistically sound analysis. Second, collecting, storing, and managing the large amount of data produced by large-scale production systems requires a non-trivial effort. Finally, accurate interpretation of collected data is difficult because of the complexity of the system architectures under investigation and the interaction among different factors responsible for faults.
To address these challenges, this paper presents an analysis of DRAM and SRAM faults occurring in a production environment. Data used in this study were collected over the five-year operational lifetime of the Cielo supercomputer, an 8,500 compute node supercomputer at Los Alamos National Laboratory. Our study examines corrected errors that occurred in the main memory (DRAM) and CPU structures (SRAM). To our knowledge, this is the first study that captures data collected over the entire operational life of a system. Such data provide insights on the evolution of faults in a system, and is a good indicator of overall system reliability over time. The results of our study can inform future decisions about the architecture, design, operation and decommissioning of leadership-class systems and data centers.
The specific findings of this paper are:
• There was no observed sign of an increase in fault rate during the operational lifetime of the Cielo supercomputer. This suggests that the operational lifetime of the processors and memory used in Cielo may be longer than five years.
• The type of DRAM faults experienced by the system changes substantially during the second year of operational lifetime, shifting from primarily permanent faults to primarily transient faults.
• The DRAM fault modes experienced by the system does not change significantly over the system's lifetime. The incidence of each fault mode in the final year of system operation is within 1.4 percentage points of the incidence of that fault mode during the first year of operation.
• SRAM caches in the system experienced multi-bit way faults. However, the rate of these faults are exceptionally low when compared to the rate of single-bit faults. The rest of this paper is organized as follows. Section II defines the terminology used in this paper. Section III describes the system configurations of Cielo. Section IV describes our experimental setup. Section VI and VII present the fault analysis results along with the insights we observed from DRAM and SRAM structures in the system respectively. Section VIII discusses the related work. Finally, Section IX concludes this paper.
II. TERMINOLOGY
In this paper, we distinguish between faults and errors as follows [2] :
• A fault is a state corruption in a memory system where one or more bits become corrupted, for instance due to hardware defects or particle strikes.
• When a faulty bit is accessed, the outcome of that access is called an error. Errors may be detected and possibly corrected by higher level mechanisms such as parity or error correcting codes (ECC). They may also go uncorrected, or in the worst case, completely undetected. Hardware faults can further be classified as:
• Transient faults, which cause incorrect data to be read from a memory location until the location is overwritten with correct data. These faults occur randomly and are not indicative of device damage [3] . Particle-induced upsets (soft errors), which have been extensively studied in the literature [3] [18], are one type of transient fault.
• Hard faults, which cause a memory location to consistently return an incorrect value (e.g., a stuck-at-0 fault). Generally, hard faults can be repaired only by disabling the component in question or by replacing the faulty device [5] .
• Intermittent faults, which cause a memory location to sometimes return incorrect values. Unlike hard faults, intermittent faults occur only under specific conditions such as elevated temperature [4] . Unlike transient faults, however, an intermittent fault is indicative of device damage or malfunction. Distinguishing a hard fault from an intermittent fault in a running system requires knowing the exact memory access pattern to determine whether a memory location returns the wrong data on every access. In practice, this is impossible in a large-scale field study such as ours. Therefore, we group intermittent and hard faults together in a category of permanent faults.
III. SYSTEM CONFIGURATION
Our study comprises data from Cielo, a decommissioned production system located at Los Alamos National Laboratory, New Mexico Our data collection process started during June 2011, Cielo's third month of operation, and lasted until the system was decommissioned in April 2016. Cielo contained 8,500 compute nodes. Each node contained two 8-core AMD Opteron TM processors based on 45-nm process technology. Each processor had eight 32KB L1 data caches, eight 512KB L2 caches, and one 12MB L3 cache. Each node had eight 4GB DDR-3 registered DIMMs for a total of 32GB of DRAM.
Cielo contained DRAM from three different DRAM vendors, referred to in this study as vendors A, B, and C. All DIMMs on Cielo (from all vendors) were physically identical: Each DIMM was double-sided, DRAM devices were laid out in two rows of nine devices per side, and DIMMs had no heatsinks.
The processor used in Cielo employs a DRAM scrubber that is configured to periodically access every memory location to correct any correctable ECC errors resident in DRAM. The DRAM scrub interval on Cielo was 24 hours. The processor also employs scrubbers on its L2 and L3 caches. In Cielo, the L2 SRAM scrub interval was 10 seconds, and the L3 SRAM scrub interval was 129 seconds.
IV. EXPERIMENTAL SETUP
For our analysis we use two different data sets -corrected error messages from console logs and hardware inventory logs. These two logs provided the ability to map each error message to specific hardware present in the system at that point in time.
Corrected error logs contain events from nodes at specific time stamps. Each node in the system has a hardware memory controller that logs corrected error events in registers provided by the x86 machine check architecture (MCA) [1] . Each node's operating system is configured to poll the MCA registers once every few seconds and record any events it finds to the node's console log. Console logs contain a variety of other information, including the physical address and ECC syndrome associated with each error. These events are decoded further using configuration information to determine the physical DRAM/SRAM location associated with each error. For each DRAM error, we decoded the location to show the DIMM, as well as the DRAM bank, column, row, and chip. For SRAM errors, we decoded the source SRAM structure (e.g., L3 cache, L2 cache), as well as the index and way in error in the caches.
Hardware inventory logs are separate logs and provide snapshots of the hardware present in each machine at different points in its lifetime. They contain an explicit description of each host's hardware, including configuration information and information regarding each DIMM such as the vendor and part number. For confidentiality purposes, we anonymize all DIMM vendor information.
Our data include 50 months of corrected errors, from June 2011 to April 2016, consisting of approximately 85 billion DRAM device-hours of operation. Previous studies have observed a significant DRAM vendor effect on the fault rate trends [17] . Figure 1 shows that the per-vendor device usage rates are constant over time on Cielo. This implies that trends in fault rates are not due to changes in vendor composition over time (e.g., as system operators perform DIMM replacements). Figure 1 also shows that DRAM from each vendor was present in the system for its entire lifetime, allowing aging assessment for each vendor.
Our observation period consists of 12.63, 52.41, and 19.68 billion device-hours for DRAM vendors A, B, and C, respectively, for a total of approximately 85 billion DRAM devicehours of operation. Therefore, we have enough operational hours on each vendor to make statistically meaningful measurements of each vendor's fault rate. Time periods where the system was not in a consistent state or was in a transition state were excluded from our data , as depicted by the gaps in timeline in Figure 1 .
V. ANALYSIS METHODOLOGY
In this section, we discuss our methodology for extracting fault rates from error logs and look into DIMM/CPU replacement information for Cielo.
A. Determining Fault Rate
Sridharan et al. showed that fault rates are a better predictor of system health than error rates [15] . Therefore, we choose to report fault rates in this work. Cielo includes a hardware scrubber on both its DRAM and SRAM subsystems. We can classify a fault as permanent if it survives a scrub (write) operation; we classify a fault as permanent when a DRAM/SRAM device generates errors that are separated in time by at least one scrub operation. A fault on a device that has errors that fall entirely within a single scrub interval are classified as transient.
This process is depicted in Figure 2 . For each device, we group time into epochs. An epoch begins with an error and lasts for one full scrub interval. If a device reports errors only within one epoch, we classify the fault as a potential transient fault. If a device reports errors in multiple epochs, we classify the fault as permanent.
The methodology to extract DRAM fault modes used in this paper is similar to that used by Sridharan et al. [15] , with modifications due to the much longer monitoring period in this study. DRAM fault modes include both single-bit and multi-bit fault modes [16] . Unlike previous studies, we did not assume that multiple errors from different locations in a single DRAM device were always caused by a single multi-bit fault. Instead, we assume that multiple single-bit faults can occur in a single device due to our longer monitoring period.
We use a two-pass fault classification algorithm. We first assume that each DRAM device has a single fault, and classify faults into different fault modes. We then examine every multibit fault in our data and determine the number of bits in error. Prior studies have shown that single-row, single-column, and single-bank faults tend to have hundreds or thousands of bits in error [16] . Therefore, if a DRAM device has a small number of bits in error, it is likely due to multiple single-bit faults occurring in the same device and not due to a multi-bit fault. In our study, if a device contained two, three, or four bits in error, we classified these as independent single-bit faults rather than as a multi-bit fault. The resulting single-bit faults can be permanent or transient, depending on the error pattern. If a device contains more than four bits in error, we classified the fault as a multi-bit fault and determine its mode based on the error pattern. We choose the threshold of five single-bit faults (i.e., an average of one single-bit fault per year) based on the measured rate of single-bit faults in our study. The likelihood of encountering five single-bit faults is extremely low; it is more likely that they result from a multi-bit fault.
The methodology to extract SRAM fault modes is similar to the DRAM process with modifications to account for the different fault modes and rates that occur in SRAM. Our algorithm analyzes the logged physical addresses and MCA status registers to associate each error to a specific cache way and cache index, creating a "map" of faulty locations in each cache and allowing us to infer a fault mode in each cache.
We identify three fault modes across all SRAMs: 1) Single-bit: All errors map to a single bit (same cache way, same cache index); 2) Cache-way: All errors map to a single cache way; 3) Cache-index: All errors map to a single cache index. Just like DRAM, we use a two-pass fault classification algorithm and use error counts and error timestamps per structure to distinguish between a multi-bit fault and multiple single-bit faults. However, the error threshold is not disclosed in this paper for confidentiality reasons.
B. DIMM/CPU Replacement
DIMM and CPU replacements are an important factor in the observed behavior of a system because a high rate of device replacements can affect the observed fault rates over time. Unfortunately, we do not have DIMM or CPU replacement information for Cielo.
We do, however, know the replacement policies used by the system operators. A DIMM replacement will occur if a DIMM experiences an uncorrected error or if it consistently generates corrected errors over time. Therefore, we can bound the overall number of DIMM replacements and determine if this replacement rate will substantially change the observed fault rate over time.
First, we find out how many DIMMs in Cielo have ever reported faults and try to estimate a proxy of the replacement data. Table I showed a higher fault rate due to aging, the effects would be visible at a system level. Similar conclusions hold for CPU replacements, although absolute fault rates are not disclosed for confidentiality reasons. Figure 3 shows the aggregate fault rates per fault type for each vendor across the entire operational lifetime of Cielo. Overall, 55% of DRAM faults were transient faults, and 45% were permanent faults. However, the transient fault rate for vendor A's DRAM is approximately 5.29x and 11.2x times higher than the transient fault rate for DRAM from vendors B and C. This indicates that the mix of DRAM fault types experienced by a system are heavily dependent on the mix of DRAM vendors in that system. Figure 4 shows the DRAM fault rate over time. We omit the first two months of the data set to avoid overcounting permanent faults that developed between the beginning of the system's lifetime (April 2011) and the start of our measurement interval (June 2011). The figure shows that the fault rate declined in the first two years of operation and then remained roughly constant for the duration of the system's lifetime. Figure 4 shows that this trend was driven entirely by the change in permanent fault rate over time; the transient fault rate in the system remained approximately constant over the entire operational lifetime of the system. The figure also shows that DRAM faults shifted from primarily permanent to primarily transient faults after approximately 18 months of system operation.
VI. DRAM FAULTS AND ERRORS

A. Fault Types
Figures 5a, 5b and 5c show the DRAM fault rate over time for vendors A, B, and C, respectively. The figure shows that the permanent fault rate declined for DRAM devices from all three vendors. DRAM from vendors B and C show a sharp decline in permanent fault rate, whereas DRAM from vendor A shows a slight decline over the first few months. Transient faults are fairly constant over time for all vendors.
The data suggest that the DRAM in Cielo did not experience any increase in fault rate over time, which would be expected if the devices were near the end of their operational lifetime [7] . This implies that the operational life of systems may be longer than five years, as suggested by prior work [11] . 
B. Fault Modes
Our analysis showed that DRAM in Cielo experienced the same fault modes identified by previous work, includ- ing single-bit, single-word, single-row, single-column, singlebank, multi-bank (full-chip), and multi-rank faults [16] . Figure  6 shows the rates of these fault modes as a percentage of all DRAM faults on Cielo. The figure shows that 67.8% of faults in Cielo are single-bit faults, while 32.2% are multi-bit faults. These values are consistent with the results observed in a study of the first 15 months of operation of Cielo, which showed 67.7% single-bit and 32.3% multi-bit [17] . We studied the trend of different DRAM fault modes over time. Primarily, we looked for: i) how the relative incidence of fault modes changed over time, and ii) whether newer faults have any correlation with existing faults or fault locations. Figure 7 shows the change in different fault modes over time. Note the break in the y-axis. The figure shows that the relative incidence of each fault mode in the final year was no more than 1.4 percentage points of the incidence observed in the first year. We also zoom in to the first year and observe the relative incidence per month in Figure 8 . The figure shows that there is variability in relative incidence during the first few months, but after the sixth month the changes in incidence are less than 1 percentage point.
These figures show each fault mode occurs at a approximately constant incidence relative to other fault modes. The implication is that the first few months (e.g. 6 months) of a system's lifetime can be used to classify the expected fault modes for the remainder of its operational life. In addition, fault modes appear to be uncorrelated over time. Third, DRAM faults appear to have a uniform random distribution in a device, implying that DRAM faults are equally likely to occur in any region of any DRAM device.
VII. SRAM FAULTS AND ERRORS
In this section, we look at SRAM faults occurring in the L3 cache on Cielo. We first classify the faults as permanent or transient. Figure 9 shows an example of a transient fault and a permanent fault in the L3 cache. The x-axis plots the time and y-axis plots the cache indices in error. Every 'x' denotes one or more errors. A transient fault occurred on cache index 100 that occurred during the fourth month. A permanent fault occurred on cache index 103 results in multiple errors over time.
Once we classify the fault type, we classify each fault into one of the three fault modes: cache-way fault, cache-index fault, and single-bit fault. Figure 10 is an example of a cacheway fault where many errors occur across different cache indices over time but in the same cache way (cache way 2). The x-axis plots the time of the error and y-axis plots the cache indices in error. We have not observed any cache-index faults in Cielo. A single-bit fault is where error(s) occur in the same cache way and cache index tuple. Figure 11 is an example of a single-bit fault in a cache. This fault is in cache way 2 and cache index 101.
99.36% of the faults in the L3 cache are transient, and 0.64% are permanent. This ratio of transient faults to permanent faults matches the existing literature on SRAM faults [17] .
Next we look at the fault modes in the L3 cache. Table II shows the distribution of different fault modes across different fault types in L3 cache. We find that 99.98% of the faults are single-bit faults and 0.02% of the faults are multi-bit cacheway faults. There are no cache-index faults. We also looked at the distribution of the fault modes across fault types. 99.36% of the faults are single-bit transient faults, 0.62% of the faults are single-bit permanent faults, and 0.02% of the faults are cache-way permanent faults. There are no cache-way transient faults in the data. 
VIII. RELATED WORK
There have been several studies examining failures in production systems over the past years. Schroeder et al. studied failures in supercomputer systems at LANL [12] and published a large-scale field study using Google's server fleet [13] . Li et al. published a study of memory errors on three different data sets, including a server farm of an Internet service provider [9] and published another expanded study of memory errors on the same farm and other sources [8] . Hwang et al. presented an expanded study on Google's server fleet, as well as two IBM Blue Gene clusters [6] . Siddiqua et al. published a study of DRAM failures from client and server systems [14] . Sridharan and Liberty presented a study of DRAM failures in a high-performance computing system [16] . Sridharan et al. presented a study of DRAM/SRAM faults, with a focus on positional and vendor effects [17] and presented another DRAM/SRAM study with the focus of reliability impact of hardware resilience schemes from high-performance computing system [15] . Meza et al. analyzed the memory errors from a Facebook server fleet and show that errors follow a Pareto distribution with decreasing hazard rate [10] . Our work distinguishes itself from these studies because it is the first study to our knowledge to present a study of over 85 billion DRAM hours over a five year lifetime of a production supercomputer.
IX. CONCLUSIONS
Reliability is a first-class problem for server systems, and must be treated as a first-class constraint by every component of the system. Moreover, systems may encounter reliability issues in the field that were unknown at design time. In this paper, we presented data on DRAM and SRAM faults collected on a production system over a five year period. Our findings demonstrate how DRAM and SRAM fault modes, types, and rates change over time and gives system architects, designers, and operators a better understanding of system reliability behavior over time.
X. ACKNOWLEDGEMENT AMD R , the AMD Arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. Other product 
