84 research outputs found

    Algorithmic Based Fault Tolerance Applied to High Performance Computing

    Full text link
    We present a new approach to fault tolerance for High Performance Computing system. Our approach is based on a careful adaptation of the Algorithmic Based Fault Tolerance technique (Huang and Abraham, 1984) to the need of parallel distributed computation. We obtain a strongly scalable mechanism for fault tolerance. We can also detect and correct errors (bit-flip) on the fly of a computation. To assess the viability of our approach, we have developed a fault tolerant matrix-matrix multiplication subroutine and we propose some models to predict its running time. Our parallel fault-tolerant matrix-matrix multiplication scores 1.4 TFLOPS on 484 processors (cluster jacquard.nersc.gov) and returns a correct result while one process failure has happened. This represents 65% of the machine peak efficiency and less than 12% overhead with respect to the fastest failure-free implementation. We predict (and have observed) that, as we increase the processor count, the overhead of the fault tolerance drops significantly

    New-Sum: A Novel Online ABFT Scheme for General Iterative Methods

    Get PDF
    Emerging high-performance computing platforms, with large component counts and lower power margins, are anticipated to be more susceptible to soft errors in both logic circuits and memory subsystems. We present an online algorithm-based fault tolerance (ABFT) approach to efficiently detect and recover soft errors for general iterative methods. We design a novel checksum-based encoding scheme for matrix-vector multiplication that is resilient to both arithmetic and memory errors. Our design decouples the checksum updating process from the actual computation, and allows adaptive checksum overhead control. Building on this new encoding mechanism, we propose two online ABFT designs that can effectively recover from errors when combined with a checkpoint/rollback scheme. These designs are capable of addressing scenarios under different error rates. Our ABFT approaches apply to a wide range of iterative solvers that primarily rely on matrix-vector multiplication and vector linear operations. We evaluate our designs through comprehensive analytical and empirical analysis. Experimental evaluation on the Stampede supercomputer demonstrates the low performance overheads incurred by our two ABFT schemes for preconditioned CG (0:4% and 2:2%) and preconditioned BiCGSTAB (1:0% and 4:0%) for the largest SPD matrix from UFL Sparse Matrix Collection. The evaluation also demonstrates the exibility and effectiveness of our proposed designs for detecting and recovering various types of soft errors in general iterative methods

    Anatomy of High-Performance GEMM with Online Fault Tolerance on GPUs

    Full text link
    General Matrix Multiplication (GEMM) is a crucial algorithm for various applications such as machine learning and scientific computing, and an efficient GEMM implementation is essential for the performance of these systems. While researchers often strive for faster performance by using large compute platforms, the increased scale of these systems can raise concerns about hardware and software reliability. In this paper, we present a design for a high-performance GEMM with algorithm-based fault tolerance for use on GPUs. We describe fault-tolerant designs for GEMM at the thread, warp, and threadblock levels, and also provide a baseline GEMM implementation that is competitive with or faster than the state-of-the-art, proprietary cuBLAS GEMM. We present a kernel fusion strategy to overlap and mitigate the memory latency due to fault tolerance with the original GEMM computation. To support a wide range of input matrix shapes and reduce development costs, we present a template-based approach for automatic code generation for both fault-tolerant and non-fault-tolerant GEMM implementations. We evaluate our work on NVIDIA Tesla T4 and A100 server GPUs. Experimental results demonstrate that our baseline GEMM presents comparable or superior performance compared to the closed-source cuBLAS. The fault-tolerant GEMM incurs only a minimal overhead (8.89\% on average) compared to cuBLAS even with hundreds of errors injected per minute. For irregularly shaped inputs, the code generator-generated kernels show remarkable speedups of 160%∼183.5%160\% \sim 183.5\% and 148.55%∼165.12%148.55\% \sim 165.12\% for fault-tolerant and non-fault-tolerant GEMMs, outperforming cuBLAS by up to 41.40%41.40\%.Comment: 11 pages, 2023 International Conference on Supercomputin

    Analysis and design of algorithm-based fault-tolerant systems

    Get PDF
    An important consideration in the design of high performance multiprocessor systems is to ensure the correctness of the results computed in the presence of transient and intermittent failures. Concurrent error detection and correction have been applied to such systems in order to achieve reliability. Algorithm Based Fault Tolerance (ABFT) was suggested as a cost-effective concurrent error detection scheme. The research was motivated by the complexity involved in the analysis and design of ABFT systems. To that end, a matrix-based model was developed and, based on that, algorithms for both the design and analysis of ABFT systems are formulated. These algorithms are less complex than the existing ones. In order to reduce the complexity further, a hierarchical approach is developed for the analysis of large systems
    • …
    corecore