84 research outputs found
Algorithmic Based Fault Tolerance Applied to High Performance Computing
We present a new approach to fault tolerance for High Performance Computing
system. Our approach is based on a careful adaptation of the Algorithmic Based
Fault Tolerance technique (Huang and Abraham, 1984) to the need of parallel
distributed computation. We obtain a strongly scalable mechanism for fault
tolerance. We can also detect and correct errors (bit-flip) on the fly of a
computation. To assess the viability of our approach, we have developed a fault
tolerant matrix-matrix multiplication subroutine and we propose some models to
predict its running time. Our parallel fault-tolerant matrix-matrix
multiplication scores 1.4 TFLOPS on 484 processors (cluster jacquard.nersc.gov)
and returns a correct result while one process failure has happened. This
represents 65% of the machine peak efficiency and less than 12% overhead with
respect to the fastest failure-free implementation. We predict (and have
observed) that, as we increase the processor count, the overhead of the fault
tolerance drops significantly
New-Sum: A Novel Online ABFT Scheme for General Iterative Methods
Emerging high-performance computing platforms, with large component counts and lower power margins, are anticipated to be more susceptible to soft errors in both logic circuits and memory subsystems. We present an online algorithm-based fault tolerance (ABFT) approach to efficiently detect and recover soft errors for general iterative methods. We design a novel checksum-based encoding scheme for matrix-vector multiplication that is resilient to both arithmetic and memory errors. Our design decouples the checksum updating process from the actual computation, and allows adaptive checksum overhead control. Building on this new encoding mechanism, we propose two online ABFT designs that can effectively recover from errors when combined with a checkpoint/rollback scheme. These designs are capable of addressing scenarios under different error rates. Our ABFT approaches apply to a wide range of iterative solvers that primarily rely on matrix-vector multiplication and vector linear operations. We evaluate our designs through comprehensive analytical and empirical analysis. Experimental evaluation on the Stampede supercomputer demonstrates the low performance overheads incurred by our two ABFT schemes for preconditioned CG (0:4% and 2:2%) and preconditioned BiCGSTAB (1:0% and 4:0%) for the largest SPD matrix from UFL Sparse Matrix Collection. The evaluation also demonstrates the exibility and effectiveness of our proposed designs for detecting and recovering various types of soft errors in general iterative methods
Anatomy of High-Performance GEMM with Online Fault Tolerance on GPUs
General Matrix Multiplication (GEMM) is a crucial algorithm for various
applications such as machine learning and scientific computing, and an
efficient GEMM implementation is essential for the performance of these
systems. While researchers often strive for faster performance by using large
compute platforms, the increased scale of these systems can raise concerns
about hardware and software reliability. In this paper, we present a design for
a high-performance GEMM with algorithm-based fault tolerance for use on GPUs.
We describe fault-tolerant designs for GEMM at the thread, warp, and
threadblock levels, and also provide a baseline GEMM implementation that is
competitive with or faster than the state-of-the-art, proprietary cuBLAS GEMM.
We present a kernel fusion strategy to overlap and mitigate the memory latency
due to fault tolerance with the original GEMM computation. To support a wide
range of input matrix shapes and reduce development costs, we present a
template-based approach for automatic code generation for both fault-tolerant
and non-fault-tolerant GEMM implementations. We evaluate our work on NVIDIA
Tesla T4 and A100 server GPUs. Experimental results demonstrate that our
baseline GEMM presents comparable or superior performance compared to the
closed-source cuBLAS. The fault-tolerant GEMM incurs only a minimal overhead
(8.89\% on average) compared to cuBLAS even with hundreds of errors injected
per minute. For irregularly shaped inputs, the code generator-generated kernels
show remarkable speedups of and
for fault-tolerant and non-fault-tolerant GEMMs, outperforming cuBLAS by up to
.Comment: 11 pages, 2023 International Conference on Supercomputin
Analysis and design of algorithm-based fault-tolerant systems
An important consideration in the design of high performance multiprocessor systems is to ensure the correctness of the results computed in the presence of transient and intermittent failures. Concurrent error detection and correction have been applied to such systems in order to achieve reliability. Algorithm Based Fault Tolerance (ABFT) was suggested as a cost-effective concurrent error detection scheme. The research was motivated by the complexity involved in the analysis and design of ABFT systems. To that end, a matrix-based model was developed and, based on that, algorithms for both the design and analysis of ABFT systems are formulated. These algorithms are less complex than the existing ones. In order to reduce the complexity further, a hierarchical approach is developed for the analysis of large systems
Recommended from our members
Toward Resilience and Data Reduction in Exascale Scientific Computing
Because of the ever-increasing execution scale, reliability and data management are becoming more and more important for scientific applications. On the one hand, exascale systems are anticipated to be more susceptible to soft errors ,e.g. silent data corruptions, due to the reduction in the size of transistors and the increase of the number of components. These errors will lead to corrupted results without warning, making the output of the computation untrustable. On the other hand, large volumes of highly variable data are produced by scientific computing with high velocity on exascale systems or advanced instruments, and the I/O time on storing these data is prohibitive due to the I/O bottleneck in parallel file systems. In this work, we leverage algorithm-based fault tolerance (ABFT) and error-bound lossy compression to tackle the two problems, in order to support efficient scientific computing on exascale systems.We propose an efficient fault tolerant scheme to tolerant soft errors in Fast Fourier Transform (FFT), one of the most important computation kernels widely used in scientific computing. Traditional redundancy approaches will at least double the execution time or resources, limiting the usage in practice because of the large overhead. Previous works on offline ABFT algorithms for FFT mitigate this problem by providing resilient FFT with lower overhead, but these algorithms fail to make progress in vulnerable environments with high error rates because they can only detect and correct errors after the whole computation finishes. We propose an online ABFT scheme for large-scale FFT inspired by the divide-and-conquer nature of the FFT computation. We devise fault tolerant schemes for both computational and memory errors in FFT, with both serial and parallel optimizations. Experimental results demonstrate that the proposed approach provides more timely error detection and recovery as well as better fault coverage with less overhead, compared to the offline ABFT algorithm.To alleviate the I/O bottleneck in the parallel file systems, we work on a prediction-based error-bounded lossy compressor to significantly reduce the size of scientific datasets while retaining the accuracy of the decompressed data, with adaptive prediction algorithms and compression models. We first propose a regression-based predictor for better prediction accuracy than traditional approaches under large error bounds, followed by an adaptive algorithm that dynamically selects between the traditional Lorenzo predictor and the proposed regression-based predictor, leading to very high compression ratios with little visual distortion. We further unify the prediction-based model and transform-baed model by using transform-based compressors as a predictor, with novel optimizations toward efficient coefficient encoding for both the two models. The proposed adaptive multi-algorithm design provides better compression ratios given the same distortion, significantly reducing storage requirements and I/O time.We further adapt the compression algorithms and compressors to different requirements and/or objectives in realistic scenarios. We leverage a logarithmic transform to precondition the data, which turns a relative-error-bound compression problem into an absolute-error-bound compression problem. This transform aligns two different error requirements while improving the compression quality, efficiently reducing the workload for compressor design. We also correlate the compression algorithm with system information to achieve better I/O performance compared to traditional single compressor deployment. These studies further improve the efficiency of lossy compression from the perspective of efficient I/O in the context of scientific simulation, making scientific applications running on exascale systems more efficient
- …