12 research outputs found
Algorithmic Based Fault Tolerance Applied to High Performance Computing
We present a new approach to fault tolerance for High Performance Computing
system. Our approach is based on a careful adaptation of the Algorithmic Based
Fault Tolerance technique (Huang and Abraham, 1984) to the need of parallel
distributed computation. We obtain a strongly scalable mechanism for fault
tolerance. We can also detect and correct errors (bit-flip) on the fly of a
computation. To assess the viability of our approach, we have developed a fault
tolerant matrix-matrix multiplication subroutine and we propose some models to
predict its running time. Our parallel fault-tolerant matrix-matrix
multiplication scores 1.4 TFLOPS on 484 processors (cluster jacquard.nersc.gov)
and returns a correct result while one process failure has happened. This
represents 65% of the machine peak efficiency and less than 12% overhead with
respect to the fastest failure-free implementation. We predict (and have
observed) that, as we increase the processor count, the overhead of the fault
tolerance drops significantly
Exploiting Data Representation for Fault Tolerance
We explore the link between data representation and soft errors in dot
products. We present an analytic model for the absolute error introduced should
a soft error corrupt a bit in an IEEE-754 floating-point number. We show how
this finding relates to the fundamental linear algebra concepts of
normalization and matrix equilibration. We present a case study illustrating
that the probability of experiencing a large error in a dot product is
minimized when both vectors are normalized. Furthermore, when data is
normalized we show that the absolute error is less than one or very large,
which allows us to detect large errors. We demonstrate how this finding can be
used by instrumenting the GMRES iterative solver. We count all possible errors
that can be introduced through faults in arithmetic in the computationally
intensive orthogonalization phase, and show that when scaling is used the
absolute error can be bounded above by one
Algorithmic Cholesky factorization fault recovery
Abstract—Modeling and analysis of large scale scientific systems often use linear least squares regression, frequently employing Cholesky factorization to solve the resulting set of linear equations. With large matrices, this often will be performed in high performance clusters containing many processors. Assuming a constant failure rate per processor, the probability of a failure occurring during the execution increases linearly with additional processors. Fault tolerant methods attempt to reduce the expected execution time by allowing recovery from failure. This paper presents an analysis and implementation of a fault tolerant Cholesky factorization algorithm that does not require checkpointing for recovery from fail-stop failures. Rather, this algorithm uses redundant data added in an additional set of processors. This differs from previous works with algorithmic methods as it addresses fail-stop failures rather than fail-continue cases. The implemen-tation and experimentation using ScaLAPACK demonstrates that this method has decreasing overhead in relation to overall runtime as the matrix size increases, and thus shows promise to reduce the expected runtime for Cholesky factorizations on very large matrices
Towards resilient parallel linear Krylov solvers: recover-restart strategies
The advent of extreme scale machines will require the use of parallel resources at an unprecedented scale, probably leading to a high rate of hardware faults. High Performance Computing (HPC) applications that aim at exploiting all these resources will thus need to be resilient, \emph{i.e.}, be able to compute a correct solution in presence of faults. In this work, we investigate possible remedies in the framework of the solution of large sparse linear systems that is often the inner most numerical kernel in many scientific and engineering applications and also one of the most time consuming part. More precisely, we present recovery followed by restarting strategies in the framework of Krylov subspace solvers where lost entries of the iterate are interpolated to define a new initial guess before restarting. In particular, we consider two interpolation policies that preserve key numerical properties of well-known solvers, namely the monotony decrease of the A-norm of the error of the conjugate gradient (CG) or the residual norm decrease of GMRES. We assess the impact of the recovery method, the fault rate and the number of processors on the robustness of the resulting linear solvers. We consider experiments with CG, GMRES and Bi-CGStab
Analysis and design of algorithm-based fault-tolerant systems
An important consideration in the design of high performance multiprocessor systems is to ensure the correctness of the results computed in the presence of transient and intermittent failures. Concurrent error detection and correction have been applied to such systems in order to achieve reliability. Algorithm Based Fault Tolerance (ABFT) was suggested as a cost-effective concurrent error detection scheme. The research was motivated by the complexity involved in the analysis and design of ABFT systems. To that end, a matrix-based model was developed and, based on that, algorithms for both the design and analysis of ABFT systems are formulated. These algorithms are less complex than the existing ones. In order to reduce the complexity further, a hierarchical approach is developed for the analysis of large systems
Recommended from our members
Toward Resilience and Data Reduction in Exascale Scientific Computing
Because of the ever-increasing execution scale, reliability and data management are becoming more and more important for scientific applications. On the one hand, exascale systems are anticipated to be more susceptible to soft errors ,e.g. silent data corruptions, due to the reduction in the size of transistors and the increase of the number of components. These errors will lead to corrupted results without warning, making the output of the computation untrustable. On the other hand, large volumes of highly variable data are produced by scientific computing with high velocity on exascale systems or advanced instruments, and the I/O time on storing these data is prohibitive due to the I/O bottleneck in parallel file systems. In this work, we leverage algorithm-based fault tolerance (ABFT) and error-bound lossy compression to tackle the two problems, in order to support efficient scientific computing on exascale systems.We propose an efficient fault tolerant scheme to tolerant soft errors in Fast Fourier Transform (FFT), one of the most important computation kernels widely used in scientific computing. Traditional redundancy approaches will at least double the execution time or resources, limiting the usage in practice because of the large overhead. Previous works on offline ABFT algorithms for FFT mitigate this problem by providing resilient FFT with lower overhead, but these algorithms fail to make progress in vulnerable environments with high error rates because they can only detect and correct errors after the whole computation finishes. We propose an online ABFT scheme for large-scale FFT inspired by the divide-and-conquer nature of the FFT computation. We devise fault tolerant schemes for both computational and memory errors in FFT, with both serial and parallel optimizations. Experimental results demonstrate that the proposed approach provides more timely error detection and recovery as well as better fault coverage with less overhead, compared to the offline ABFT algorithm.To alleviate the I/O bottleneck in the parallel file systems, we work on a prediction-based error-bounded lossy compressor to significantly reduce the size of scientific datasets while retaining the accuracy of the decompressed data, with adaptive prediction algorithms and compression models. We first propose a regression-based predictor for better prediction accuracy than traditional approaches under large error bounds, followed by an adaptive algorithm that dynamically selects between the traditional Lorenzo predictor and the proposed regression-based predictor, leading to very high compression ratios with little visual distortion. We further unify the prediction-based model and transform-baed model by using transform-based compressors as a predictor, with novel optimizations toward efficient coefficient encoding for both the two models. The proposed adaptive multi-algorithm design provides better compression ratios given the same distortion, significantly reducing storage requirements and I/O time.We further adapt the compression algorithms and compressors to different requirements and/or objectives in realistic scenarios. We leverage a logarithmic transform to precondition the data, which turns a relative-error-bound compression problem into an absolute-error-bound compression problem. This transform aligns two different error requirements while improving the compression quality, efficiently reducing the workload for compressor design. We also correlate the compression algorithm with system information to achieve better I/O performance compared to traditional single compressor deployment. These studies further improve the efficiency of lossy compression from the perspective of efficient I/O in the context of scientific simulation, making scientific applications running on exascale systems more efficient
Algorithm Based Fault Tolerance: A Perspective from Algorithmic and Communication Characteristics of Parallel Algorithms
Checkpoint and recovery cost imposed by checkpoint/restart (CP/R) is a crucial performance issue for high-performance computing (HPC) applications. In comparison, Algorithm-Based Fault Tolerance (ABFT) is a promising fault tolerance method with low recovery overhead, but it suffers from the inadequacy of universal applicability, i.e., tied to a specific application or algorithm. Till date, providing fault tolerance for matrix-based algorithms for linear systems has been the research focus of ABFT schemes. As a consequence, it necessitates a comprehensive exploration of ABFT research to widen its scope to other types of parallel algorithms and applications. In this thesis, we go beyond traditional ABFT and focus on other types of parallel applications not covered by traditional ABFT. In that regard, rather than an emphasis on a single application at a time, we consider the algorithmic and communication characteristics of a class of parallel applications to design efficient fault tolerance and recovery strategies for that class of parallel applications. The communication characteristics determine how to distributively replicate the fault recovery data (we call it the {\em critical data}) of a process, and the algorithmic characteristics determine what the application-specific data is to be replicated to minimize fault tolerance and recovery cost. Based on communication characteristics, parallel algorithms can be broadly classified as (i) embarrassingly parallel algorithms, where processes have infrequent or rare interactions, and (ii) communication-intensive parallel algorithms, where processes have significant interactions. In this thesis, through different case studies, we design ABFT for these two categories of algorithms by considering their algorithmic and communication characteristics. Analysis of these parallel algorithms reveals that a process contains sufficient information that can help to rebuild a computational state if any failure occurs during the computation. We define this information as critical data, the minimal application-level data required to be saved (securely) so that a failed process can be fully recovered from a most recent consistent state using this fault recovery data. How the communication dependencies among processes are utilized to replicate fault recovery data is directly related to the system’s fault tolerance performance. We propose ABFT for parallel search algorithms, which belong to the class of embarrassingly parallel algorithms. Parallel search algorithms are the well-known solution techniques for discrete optimization problems (DOP). DOP covers a broad class of (parallel) applications from search problems in AI to computer games, e.g., Chess and various games, traveling salesman problem, various AI search problems. As a case study, we choose the parallel iterative deepening A* (PIDA*) algorithm and integrate application-level fault tolerance with the algorithm by replicating critical data periodically to make it resilient. In the category of communication-intensive algorithms, we choose Dynamic programming (DP) which is a widely used algorithm paradigm for optimization problems. We choose parallel DP algorithm as a case study and propose ABFT for such applications. We present a detailed analysis of the characteristics of parallel DP algorithms and show that the algorithmic features reduce the cardinality of critical data into a single data in case of -data dependent task. We demonstrate the idea with two popular DP class of applications: (i) the traveling salesman problem (TSP), and (ii) the longest common subsequence (LCS) problem. Minimal storage and recovery overhead are the prime concern in FT design. On that regard, we demonstrate that further optimization in critical data is possible for particular DP class of problems, where the degree of dependency for a subproblem is small and fixed at each iteration. We discuss it with the 0/1 knapsack problem as a case study and propose an ABFT scheme where, instead of replicating the critical data, we replicate a bit-vector flag in peer process's memory which is later used to rebuild the lost data of a failed process. Theoretical and experimental results demonstrate that our proposed methods perform significantly better than the conventional CP/R in terms of fault tolerance and recovery overheads, and also in storage overhead in the presence of single and multiple simultaneous failures