Towards Adaptive Resilience in High Performance Computing by Ghiasvand, Siavash & Ciorba, Florina M.
ar
X
iv
:1
70
6.
04
34
5v
1 
 [c
s.D
C]
  1
4 J
un
 20
17
Towards Adaptive Resilience in
High Performance Computing
Siavash Ghiasvand
Center for Information Services
and High Performance Computing
Technische Universita¨t Dresden, Germany
siavash.ghiasvand@tu-dresden.de
Florina M. Ciorba
Department of Mathematics
and Computer Science
Universita¨t Basel, Switzerland
florina.ciorba@unibas.ch
With the current growth in computing capabilities of high
performance computing (HPC) systems, Exascale1 HPC sys-
tems are expected to arrive by 2020 [1]. As systems be-
come larger and more complex, they also become more error
prone [2]. The failure rate of HPC systems rapidly increases,
such that, failures become the norm rather than the excep-
tion. Therefore, in such unreliable environment, to maintain
HPC systems operational, they must be resilient to failures.
Different approaches in HPC systems have been introduced to
prevent failures (e.g., redundancy) or at least to minimize their
impacts (e.g., via checkpoint and restart). In most cases, when
these approaches are employed to increase the resilience of
certain parts of a system, performance significantly degrades,
and/or energy consumption rapidly increases.
In general, failures can be divided into two groups: avoid-
able and unavoidable. Since there is no ‘eternal’ hardware, in
theory, failures can not be truly ‘avoided’. However, one can
significantly decrease the probability of their occurrence, or
in some cases postpone them for a certain amount of time.
Avoidable failures are defined in this work as failures that
can be hidden from a specific system layer. In contrast, the
unavoidable failures are failures that cannot be hidden from a
specific system layer. We analogize the failure sources in an
HPC system to a tree, which has its roots in the lowest system
layer and its leaves in the highest layers. As we traverse the
system from top to bottom, the diversity of failures decreases
while their impacts increase. During propagation across system
layers, failures may retain their original characteristics or they
may morph into other types of failures. Therefore, each system
layer requires its own protection to prevent the propagation of
specific types of failures to the upper layers.
While protection layers are added between system layers
to identify, address, and prevent failures from propagating
upwardly, certain overheads are imposed on the system. As
long as the failure protection layers are in place, they impose
overheads, regardless of the presence or absence of failures.
In certain cases, adding overheads might not be worthwhile to
provide fault tolerance.
This work in progress, proposes an approach that employs
a probabilistic failure predictor to estimate the situations in
1Capable of performing 108 floating point operations per second.
which failures will occur in future. Based on the probability
of failures, it can be decided whether the available failure
protection layers need to be activated or not . Via modeling
general HPC systems and estimating their failure rate, it was
observed that applying a logical topology, may reduce failure
probability, and that the logical topology also constitutes a
uniform topology for any HPC system, such that the proposed
failure probability predictor can be applied. There are three
main goals to achieve using the proposed approach: Improved
resilience, progress in computation, and energy saving.
In this approach, the HPC system is considered in its
entirety and resilience mechanisms (e.g., checkpointing, iso-
lation, and migration) are activated on-demand. Using this
approach, the unavoidable increase in total system perfor-
mance degradation and energy consumption is decreased com-
pared to the typical checkpoint/restart and redundant resilience
mechanisms. Our work aims to mitigate a large number of
failures occurring at various layers in the system, to prevent
their propagation, and to minimize their impact, all of this
in an energy-saving manner. In the case of failures that are
estimated to occur but cannot be mitigated using the proposed
approach (e.g., no surrogate resource is available), the system
administrators will be notified in view of performing further
investigations and reactions.
A resilient HPC system is a system which can complete the
users requests even if certain units2 of the system encounter
failures and are no longer functioning. A failure is a complete
outage of a given system unit. Thus, a malfunctioning or
misbehaving unit is not a failed unit and, therefore, it is
beyond the scope of this work. Failures and their impact
propagation have an effectiveness zone. After each failure one
can expect new failures, as part of the impact effectiveness
zone. A sequence of successive failures is called a failure
chain. Failures always propagate horizontally within a single
layer, as well as from bottom to top across the horizontal
system layers.
The common architecture of today’s computers is based on
the von Neumann description, first introduced by John von
Neumann in 1945 [3], [4]. In the von Neumann model, the
2A unit can be any component of the system, from a single transistor to an
entire rack of computers, depending on the assumed component granularity.
computer consists of three main components: central process-
ing unit (CPU), memory, and input/output devices (I/O).
HPC systems are also clusters of computing nodes which
are connected together via a network of switches. Therefore,
the entire HPC system can also be modeled via von Neumann
basic components. There are different physical topologies
to connect computational nodes together and form an HPC
system. The fat-tree topology is one of the most popular
topologies [5]. The indirect bidirectional multistage topology
of fat-tree, given its symmetric layered formation and lack
of root bottleneck, has a natural fault-tolerant property. In
specific HPC systems, the numbers of nodes, chassis, and
racks are chosen based on certain application-, budget-, and
space-related considerations to increase their performance and
reduce the acquisition costs. However this may decrease the
innate resilience of the fat-tree topology. As stated earlier,
an HPC system can be modeled via von Neumann basic
components. To reduce the system complexity and to facilitate
a generic model it is assumed that all components have equal
importance and that the impact of failure of one component on
other components is instantaneous, which in reality may not be
always the case. The three basic components of von Neumann
description, are serially connected to each other. However, a
computer may also have several parallel units. In such cases,
we can assume all similar parallel components as one super-
component. Based on these assumptions, the failure rate of a
computer can be estimated via Eq. (1), in which FCMPcmp
denotes the estimated failure probability of a single component
cmp.
FNn = (1−
last∏
cmp=first
(1− FCMPcmp)) (1)
Each chassis of an HPC system consists of a set of parallel
computers (hereafter, node). Each rack is a set of chassis, and
the HPC system is a set of racks. Based on this topology,
chassis is considered to be ”failed” when all of its internal
nodes fail. The same definition applies to the racks and the
HPC system. Thus, the proposed failure estimation model can
be recursively expanded from a single node, to chassis, to
racks, and to the entire HPC system.
Applying this model to the statistics obtained from [2]
led us to the following set of results: (1) The failure impact
of nodes’ internal components (e.g., CPU, and memory), in
comparison with the failure impact of external components
(e.g., network switches) on the whole system failure rate is
negligible; (2) Having less than 4 nodes in a chassis is not
efficient in the fat-tree logical topology; (3) Having more than
4 nodes in a chassis has no significant impact on reducing
failure probability; (4) In the fat-tree logical topology, using
more branches with fewer leaves on each branch is more
efficient than using few branches with many leaves on each
branch; (5) In the fat-tree logical topology only the two top
most component layers have significant impact on the whole
system failure probability.
Repeating the calculations on higher system layers, and
changing the granularity, provided the following results:
(1) Redundancy in higher layers is more beneficial; (2) Having
less than 4 chassis in a rack is not efficient for failure
prevention; (3) Having more than 8 chassis in a rack has no
significant impact on decreasing failure probability; (4) Having
more than 2 racks improves the failure rate; (5) Having more
than 8 racks has no significant impact on failure reduction;
(6) Expectedly the shared resources (e.g. network switches)
have the dominant impact on the resilience of HPC system.
Based on these results, the proposed approach: (1) Applies
a 4-4-4 logical fat-tree topology to the HPC system; (2) Based
on the von Neumann model of HPC system, it estimates
the system failure probability; and (3) Upon detection or
prediction of failures, it activates failure protection layers
within the effectiveness zone of that failure; (4) Via this
approach the formation of the failure chains in the HPC system
is prevented or minimized. Via these steps, this approach can
provide on-demand resilience for HPC systems, which paves
the way towards adaptive resilience.
In general, this work makes the following contributions.
(1) Proposes an approach to reduce the cost of resilience
in HPC systems. (2) Recommends a logical topology which
takes advantage of the built-in resilience of tree-like topolo-
gies, and increases system resilience at minimum overhead.
(3) Proposes a general model to estimate failure probability
in HPC systems based on von Neumann model. (4) To the
best of our knowledge, this paper is the first attempt to use
logical topology to decrease the cost of resiliency in HPC
systems. (5) It is also, to the best of our knowledge, the first
study which uses system-wide failure probability estimation
as the decision factor to provide on-demand resilience by
controlling failure protection mechanisms (e.g., reconfiguring
the checkpoint/restart and redundancy mechanisms) on HPC
systems.
For the future work, beside improving the failure predic-
tion model based on failure correlations [6], analyzing the
propagation pattern of failures in different system layers, and
quantifying failures’ impact within their effectiveness zone has
been planned.
REFERENCES
[1] W. E. Nagel, D. Hackenberg, G. Juckeland, H. Brunst, and H.-J. Bun-
gartz, “Planning for exascale systems: The challenge to be prepared,”
Algorithms and Scheduling Techniques for Exascale Systems - Dagstuhl
Reports, vol. 3, no. 9, p. 122, 2014.
[2] B. Schroeder and G. A. Gibson, “Understanding failures in petascale
computers,” Journal of Physics: Conference Series, vol. 78, p. 012022,
Jul. 2007.
[3] J. von Neumann, “First Draft of a Report on the EDAVAC,” Letter, no. 1,
1945.
[4] M. D. Godfrey and D. F. Hendry, “The computer as von Neumann planned
it,” IEEE Annals of the History of Computing, vol. 15, no. 1, pp. 11–21,
1993.
[5] Top500, “www.top500.org,” 2016. [Online]. Available:
http://top500.org/list/2014/06/
[6] S. Ghiasvand, F. M. Ciorba, R. Tschu¨ter, and W. E. Nagel, “Lessons
learned from spatial and temporal correlation of node failures in high
performance computers,” in 24th Euromicro International Conference on
Parallel, Distributed and Network-Based Processing, Heraklion, Crete,
Greece, Feb. 2016, pp. 377–381.
