Heterogeneity aware fault tolerance for extreme scale computing

Abstract

Upcoming Extreme Scale, or Exascale, Computing Systems are expected to deliver a peak performance of at least 10^18 floating point operations per second (FLOPS), primarily through significant expansion in scale. A major concern for such large scale systems, however, is how to deal with failures in the system. This is because the impact of failures on system efficiency, while utilizing existing fault tolerance techniques, generally also increases with scale. Hence, current research effort in this area has been directed at optimizing various aspects of fault tolerance techniques to reduce their overhead at scale. One characteristic that has been overlooked so far, however, is heterogeneity, specifically in the rate at which individual components of the underlying system fail, and in the execution profile of a parallel application running on such a system. In this thesis, we investigate the implications of such types of heterogeneity for fault tolerance in large scale high performance computing (HPC) systems. To that end, we 1) study how knowledge of heterogeneity in system failure likelihoods can be utilized to make current fault tolerance schemes more efficient, 2) assess the feasibility of utilizing application imbalance for improved fault tolerance at scale, and 3) propose and evaluate changes to system level resource managers in order to achieve reliable job placement over resources with unequal failure likelihoods. The results in this thesis, taken together, demonstrate that heterogeneity in failure likelihoods significantly changes the landscape of fault tolerance for large scale HPC systems

    Similar works