Large-scale computing systems today are assembled by numerous computing units for massive computational capability needed to solve problems at scale, which enables failures common events in supercomputing scenarios. Considering the demanding resilience requirements of supercomputers today, we present a quantitative study on fine-grained failure modeling for contemporary and future large-scale computing systems. We integrate various types of failures from different system hierarchical levels and system components, and summarize the overall system failure rates formally. Given that nowadays system-wise failure rate needs to be capped under a threshold value for reliability and cost-efficiency purposes, we quantitatively discuss different scenarios of system resilience, and analyze the impacts of resilience to different error types on the variation of system failure rates, and the correlation of hierarchical failure rates. Moreover, we formalize and showcase the resilience efficiency of failure-bounded supercomputers today.
Introduction
Due to the demanding need of High Performance Computing (HPC) and the fast-advancing HPC technology, large-scale computing systems today are assembled by a large amount of computing units equipped with supporting components for an extremely computational and reliable HPC eco-system. Various studies showcase that failures are not rare events in such HPC systems due to the numerous interconnected components. Regardless of the fact that the growing number of components in HPC systems aggregate failure rates overall, root causes of failures in supercomputers include radiation-induced effects such as particle strikes from cosmic radiation, circuit aging related effects, and faults due to chip manufacturing defects and design bugs [1] . Most failures remain undetected during post-silicon validation and eventually manifest themselves during the operation of HPC systems, e.g., in runs of HPC applications, and in upgrades or maintenance of system software and devices. As process technology continues to shrink and HPC systems today tend to operate at low supply voltage for power efficiency purposes, e.g., near-threshold voltage computing [2] , hardware components of supercomputers become more susceptible to all types of faults at a greater rate. Therefore, Mean Time To Failure (MTTF) of the system is expected to dramatically decrease for forthcoming exascale supercomputers.
Resilience of HPC systems to various types of failures has become a firstclass citizen in building scalable and cost-efficient HPC systems. In general, it is expensive to detect and correct such failures in large-scale computing systems in the presence of resilience techniques due to: (a) software costs, e.g., performance loss of the applications due to additional resilience code for calculating checksums/residues and saving checkpoints, and (b) hardware costs, e.g., extra components needed for modular redundancy like ECC memory and more disk space for checkpoint storage. Generally, resilience requires different extent of redundancy at various system levels in both time and space. Numerous studies have been conducted to improve the efficiency of existing resilience techniques for HPC systems. State-of-the-art solutions include Algorithm-Based Fault Tolerance (ABFT) [3] and scalable multi-level checkpointing (SCR) [4] . However, there exists lack of investigation to holistic failure analysis and fine-grained failure quantification for large-scale computing systems, covering realistic resilience scenarios in supercomputers up to date, which is definitely beneficial to understand failure pattern/layout of operational supercomputers today for better devising more effective and efficient fault tolerance solutions.
In this paper, we propose to discuss various types of errors and failures from different architectural levels of supercomputer architectures today, and quantify them into an integrated failure model to summarize overall system failure rates hierarchically, in different HPC scenarios. The primary contributions of this paper include: (a) study the quantitative correlation of failure rates among different components, different failure types, and different system layers of supercomputers, under specific overall system failure rate bounds, (b) discuss the quantitative impacts of system resilience levels (referred to as significance index in the later text) to overall system failure rates, and (c) formalize the resilience efficiency of failure-bounded HPC systems.
The remainder of the paper is organized as follows: Section 2 introduces background knowledge. Section 3 discusses empirical failure models used in this work, and a holistic quantitative study (a refined failure model included) on failure-bounded supercomputers is presented in Section 4. Section 5 discusses related work, and Section 6 concludes.
Background
Supercomputers today is an extremely parallel and complex integration of numerous components, primarily categorized into computing units, network, storage, and supporting devices, e.g., cooling infrastructure, cables, and power supply. Figure 1 overviews the hardware architecture of contemporary supercomputers hierarchically (taking the supercomputer Trinity [5] at Los Alamos National Laboratory for example, which ranks 10 th in the latest TOP500 list [6] ): From top to down, a supercomputer is comprised of a number of cabinets (or racks), denoted as N cabinet ; each cabinet is comprised of a number of chassis (or blades), denoted as N chassis ; each chassis is comprised of a number of compute nodes, denoted as N node (without loss of generality, we ignore that there may exist a very small number of head nodes in the system that mostly do the management work). Finally, each computer node consists of hardware components including processors, storage, network, SRAM (on-chip), and DRAM (off-chip). According to the TOP500 list, top-ranked supercomputers up to date have hundreds of cabinets, thousands of chassis, and hundreds of thousands of compute nodes overall. In the figure, we only illustrate component details for one node (interconnects and other devices between nodes, chassis, and cabinets are omitted due to space limitation), and assume that all nodes and counterpart components (e.g., all cables) in the system are homogeneous (and thus have equivalent susceptibility to failures) to simplify our discussion. Note that in Figure 1 , we use simplified terms to demonstrate the node configuration. Specifically, processors can be CPU and/or accelerators such as GPU and co-processors, which include functional units and control units. 
Failure Model
Based on the system architecture shown in Figure 1 , we denote the failure rate of each level of system hierarchy as λ sys , λ cabinet , λ chassis , and λ node individually.
As formulated in Equation (1), it is straightforward to calculate the overall failure rate λ sys for an HPC system illustrated in Figure 1 based on the probability theory. It essentially shows failures are distributed over all available nodes in the system. We assume there are no idle nodes from each level of hierarchy when we consider failures, and thus all nodes are probabilistically equivalent for all types of errors.
For a compute node in supercomputers as shown above, there are two types of induced faults by nature: soft errors and hard errors. The former are transient (e.g., memory bit-flips and logic circuit miscalculation), while the latter are usually permanent (e.g., node crashes from dysfunctional hardware and system abort from power outage). We denote the failure rate of soft errors and hard errors as λ sof t and λ hard respectively. In Equation (2), we formulate the nodal failure rate as the integration (⊕) of λ sof t and λ hard (note that instead the mathematical addition (+) is not used here, given the different nature between soft errors and hard errors).
The parameters α and β by λ sof t and λ hard individually are referred to as the significance index (SI) of failure rates. For various HPC systems equipped with different hardware and software resilient techniques, the SI of λ sof t and λ hard varies. In general, SI represents the resilience to failures of a given system, and it has a negative correlation with failure coverage of the resilient techniques employed in the system, i.e., the more resilient the system is, the more failures can be recovered, the less SI value is. Consequently in Equation (2) the nodal failure rate λ node changes accordingly, with the SI values introduced.
Due to the demanding requirements of system-wise power efficiency and resilience as the goal of US Department of Energy (DOE) for the upcoming exascale computers [7] , current and future large-scale HPC systems needs to be not only power-bounded, but also failure-bounded, which means the overall system failure rate needs to be capped under a threshold value λ cap sys , provided a power budget [8] . For simplicity of discussion, we define ⊕ by explicitly summing up soft and hard error rates. Therefore, based on Equations (1) and (2), we can reformulate the capped failure rates for soft errors and hard errors, under the specified expected system failure rate cap λ cap sys below:
According to the definition of soft errors and hard errors, node-wise we assume that processors, on-chip SRAM, and off-chip DRAM are the primary sources of soft errors, and storage and network are the main contributors to hard errors (in practice power supply contributes to hard errors considerably as well, which will be covered in the refined failure model in Section 4.3 where we assume power supply faults occur at chassis and cabinet levels). Without loss of generality (more components can be incorporated if needed), we look into the components above within a node, and formulate λ sof t and λ hard more specifically as follows:
Failure-bounded Quantitative Study
In this section, we conduct exploratory quantitative discussion on several common scenarios in state-of-the-art HPC systems. With the established failure models above, our goals include: (a) given acquired failure data of system components, make some inferences on unknown failure rate caps of other components, and (b) speculate the system-/component-wise failure rate ranges under some known failure rate caps.
Capping Failures by Types
Per the mechanism of detection and correction, soft errors can be categorized 
Given that N node N chassis N cabinet is a constant number that refers to the total number of active compute nodes system-wide, and the assumed values for α and β, for the three remaining variables in Equation (6), we can easily solve one provided the other two.
The failure rate λ can be expressed in terms of either Mean Time To Failure (MTTF) [10] or Failure In Time (FIT) [11] . FIT is inversely proportional to MTTF and is defined as a failure rate of 1 in a billion hours. Here we adopt FIT as the calculation unit due to its additive nature, different from MTTF. Existing studies demonstrate that for HPC architectures nowadays, SRAM failure rates range from 10 FIT to 100 FIT [12] , and DRAM failure rates are of the order of magnitude of 100 FIT [13] . Therefore, without loss of generality, assume that there is a supercomputer of 100,000 nodes, with λ cap sof t = 200 FIT. Meanwhile, as a premise, λ cap sys cannot exceed 5,000,000 FIT as required for system-level resilience. With the parameters already known, we can solve λ cap hard as below:
which indicates that in order to achieve λ cap sys no greater than 5,000,000 FIT, given λ cap sof t = 200 FIT and the above α and β values, the threshold value of λ cap hard is 25 FIT. (3). This scenario represents HPC systems that have higher resilience to soft errors, compared to hard errors. We can see that although overall λ cap sys is linear to λ cap sof t and λ cap hard respectively, the system characteristic of higher resilience to soft errors makes λ cap sys be affected more by the variation of λ cap hard , compared to that of λ cap sof t . Figure 2 also shows that this trend remains the same for all λ cap sof t and λ cap hard values. (3), which reflects HPC system with higher resilience to hard errors instead of soft errors. Likewise, due to the higher tolerance to hard errors, the curve shows the trend that λ cap sys tend to be impacted more by λ cap sof t instead of λ cap hard , i.e., with the same amount of change between λ cap sof t and λ cap hard , λ cap sys varies greater with the change of λ cap sof t , as shown in Figure 3 .
Capping Failures by Components
Instead of capping failure rates of soft/hard errors at system level, HPC systems today also have resilience requirements for specific components. Given the system-wise failure cap and some acquired failure data from other components, we can obtain the capped failure rates for the interested components. (4) and (5) Likewise, N node N chassis N cabinet is a constant number. We employ the same hypothesized failure rate SI values as Scenario 1, α = 0.2 and β = 0.4, and the same premise of an HPC system of 100,000 nodes with λ cap sys = 5,000,000 FIT. In addition, we assume that from system logs historically, failure data of processor, SRAM, and network are acquired as follows: λ processor = 90 FIT, λ cap DRAM + 2λ cap storage = 50 (9) which indicates that in order to preserve the assumed failure rates, the quantitative relationship between λ cap DRAM and λ cap storage in (9) must be satisfied.
Substituting Equations

Refining Failure Model from System Hierarchy
Although an HPC system is comprised of compute nodes, failures may happen not only at local nodes, but also interconnects between nodes, power supply and other devices at chassis or cabinet level. When such failures occur at higher levels rather than at a single node, all nodes at the related levels are affected.
For example, if the power supply at cabinet level fails, all nodes within the affected cabinets will be down. Consider the occurrence of failures hierarchically at different system layers, we refine the failure models as follows: (10) note that in Equation (10) the parameters λ node chassis and λ node cabinet are failure rates of non-node devices/components at chassis and cabinet levels respectively, the parameters α ′ , β ′ , and γ ′ are the SI of node, chassis, and cabinet failure rates respectively, and the constants N total node , N total chassis , and N total cabinet individually refer to the total number of nodes, chassis, and cabinets in the system overall. From previous models, we have: Table 1 [14] . We can group all off-node failures in terms of λ chassis and λ cabinet given the specific location of failures. For simplicity of discussion, failures occurred between chassis and between cabinets are considered into λ chassis and λ cabinet respectively. In order to study the relationship among the failure rates at node, chassis, and cabinet level, under a predefined system failure rate cap and with resilience techniques employed, we consider the following scenario: Figure 4 shows the node, chassis, and cabinet failure rate curve for another HPC system scenario, where we assume that the node, chassis, and cab-inet failures in this system are tolerated to some extent by employed resilience techniques individually, and consequently α ′ = 0.2, β ′ = 0.6, and γ ′ = 0.5.
We adopt the same system architectural configuration as previous examples: 100,000 nodes (100 nodes per chassis, 10 chassis per cabinet, and 100 cabinets in the system), with λ cap sys = 5,000,000 FIT. Therefore, Equation (10) 
Specifically, Figure 4 is an illustrated version of Equation (11) . We can see that as λ node chassis and λ node cabinet change, the variation of λ node is comparatively small, i.e., λ node chassis and λ node cabinet both range from 0 to 500 FIT, while λ node ranges only from 230 to 250 FIT. This is because there exist much more nodes compared to chassis and cabinets in the system overall. However, statistically, failure rates of a single node are smaller than failure rates of a single chassis or a single cabinet. In general, with a capped system failure rate, the growing of failure rates of any hierarchy level (node, chassis, or cabinet), leads to the decreasing of failure rates of the other two levels. We can also see that the variation of λ node chassis has a greater impact on the variation of λ node , compared to the variation of λ node cabinet .
Failure-bounded HPC System Time Usage and Resilience Efficiency
Regarding the impacts of resilience on HPC systems, the breakdown of system time usage by functionality (e.g., system in idle, operation, computation, or I/O) is highly beneficial since fine-grained efficiency analysis is feasible. Figure 5 overviews the general time usage of typical HPC systems today [15] . We can clearly see that the time used for resilience purposes T r is a part of the system run time T u , while the other part is solve time T s which is in general application-specific. Without loss of generality, we assume that in Figure 5 the highlighted time components (T o , T p , T u , T r , and T s ) account for the majority of the total system time T sys . Furthermore, the resilience efficiency of an HPC system can be formalized as follows: It is well-studied that supercomputers today (up to petascale) are exposed to high failure rates due to various root causes, with MTTF ranging from 50 minutes to 230 minutes [16] . Forthcoming exascale supercomputers are expected to suffer from increased failure rates due to a greater number of components, with predicted MTTF ranging from 22 minutes to 120 minutes [7] . With the expected failure rates, we can speculate the resilience efficiency of future exascale supercomputers using our models. Assume that there is an exascale system in operation of 10,000 hours, and one failure occurs every 120 minutes, with 40% hard errors and 60% soft errors. The employed resilience techniques can successfully capture every failure and take 0.7 hour and 0.2 hour to detect and recover from hard errors and soft errors individually. Using Equation (12), we calculate the resilience efficiency below:
= 1400 + 600 10000 = 20% (15) From the calculation shown above, we can see that in order to obtain higher resilience efficiency for failure-bounded HPC systems in this era, we need to develop more cost-effective resilience techniques, or increase the MTTF of future supercomputers.
Related Work
Modeling methods have been extensively used for large-scale computing systems, for the purposes of failure prediction [17] [18], trade-off optimization [19] [20], and vulnerability reduction [21] [22] . Gainaru et al. [17] proposed to characterizing the normal and faulty behavior of HPC systems by using signal analysis to model the flow of each state event during HPC system lifetime. The extracted models accurately reflected system outputs and improved the effectiveness of fault prediction. The subsequent work [18] leveraged data mining techniques to offer an adaptive failure prediction module for accurate fault prediction, and was evaluated on two large-scale systems for prediction precision and recall impacts. Instead of focusing on analyzing the system state data (referred to as system events in [17] and [18] ), our work investigates failure rate correlation at different system hierarchical levels and system components levels.
Rafiev et al. [19] studied the interplay between critical dimensions in HPC, i.e., performance, energy, and reliability using a modeling framework based on a resource-driven graph representation. The layer-agnostic models applied efficiently to large-scale systems and diverse types of concurrency. Tan et al. [20] quantitatively modeled the integrated energy efficiency in terms of performance per Watt and showcased the trade-offs among typical HPC parameters, by extending the Amdahls Law and the Karp-Flatt Metric. The proposed models were evaluated to help find the optimal HPC configuration for the highest integrated energy efficiency with resilience. This work focuses on the resilience of HPC systems only and our failure model is based on the probability theory. Casas et al. [21] presented an approach that analyzes the vulnerability of sparse scientific applications to hardware faults at large scales, and reduced their vulnerability by protecting the most vulnerable components and failure prediction. Leveraging register vulnerability, Tan et al. [22] investigated the va-lidity of failure rates in HPC systems at near-threshold voltage, and empirically evaluated the power saving opportunities without incurring observable number of soft errors during HPC runs. Our work differs from them since the proposed model here is for better understanding failure pattern of operational supercomputer architectures today and thus devising more feasible resilience solutions accordingly.
Conclusions
Due to the expansion of HPC systems in size and duration in use, it is critical to maintain the resilience of supercomputers today. For resilience purposes, it is beneficial to quantify failures in existing failure-bounded HPC systems in a fine-grained fashion. In this paper, we conduct an exploratory quantitative study on holistic failure modeling for contemporary large-scale computing systems, which also sheds light on understanding potential failures on forthcoming supercomputers in the exascale era, and helps better devise more feasible resilience solutions at scale. Specifically, we integrate different failures from the perspective of system hierarchy, and summarize the overall system failure rate formally. We also discuss various scenarios of HPC system resilience categorized by error types, system components, and hierarchical levels, and formalize the significance index of failure rates and the resilience efficiency of supercomputers today under a system failure rate cap.
