15 research outputs found

    Correcting soft errors online in fast fourier transform

    Get PDF
    While many algorithm-based fault tolerance (ABFT) schemes have been proposed to detect soft errors offline in the fast Fourier transform (FFT) after computation finishes, none of the existing ABFT schemes detect soft errors online before the computation finishes. This paper presents an online ABFT scheme for FFT so that soft errors can be detected online and the corrupted computation can be terminated in a much more timely manner. We also extend our scheme to tolerate both arithmetic errors and memory errors, develop strategies to reduce its fault tolerance overhead and improve its numerical stability and fault coverage, and finally incorporate it into the widely used FFTW library - one of the today's fastest FFT software implementations. Experimental results demonstrate that: (1) the proposed online ABFT scheme introduces much lower overhead than the existing offline ABFT schemes; (2) it detects errors in a much more timely manner; and (3) it also has higher numerical stability and better fault coverage

    New-Sum: A Novel Online ABFT Scheme for General Iterative Methods

    Get PDF
    Emerging high-performance computing platforms, with large component counts and lower power margins, are anticipated to be more susceptible to soft errors in both logic circuits and memory subsystems. We present an online algorithm-based fault tolerance (ABFT) approach to efficiently detect and recover soft errors for general iterative methods. We design a novel checksum-based encoding scheme for matrix-vector multiplication that is resilient to both arithmetic and memory errors. Our design decouples the checksum updating process from the actual computation, and allows adaptive checksum overhead control. Building on this new encoding mechanism, we propose two online ABFT designs that can effectively recover from errors when combined with a checkpoint/rollback scheme. These designs are capable of addressing scenarios under different error rates. Our ABFT approaches apply to a wide range of iterative solvers that primarily rely on matrix-vector multiplication and vector linear operations. We evaluate our designs through comprehensive analytical and empirical analysis. Experimental evaluation on the Stampede supercomputer demonstrates the low performance overheads incurred by our two ABFT schemes for preconditioned CG (0:4% and 2:2%) and preconditioned BiCGSTAB (1:0% and 4:0%) for the largest SPD matrix from UFL Sparse Matrix Collection. The evaluation also demonstrates the exibility and effectiveness of our proposed designs for detecting and recovering various types of soft errors in general iterative methods

    Silent Data Corruption Resilient Matrix Factorizations on Distributed Memory System

    No full text
    The lack of efficient resilience solutions is expected to be a major problem for the coming exascale supercomputers, as the chance that a long running large scale computation can finish without faults is diminishing quickly. In this dissertation I try to develop algorithmic techniques to provide fault tolerance for the commonly used matrix factorization algorithms and its high performance implementation in distributed memory massively parallel systems, with very low overhead and high scalability.Specifically, I design numerical error correcting encoding of matrix and the corresponding algorithms to tolerate hardware faults during matrix factorizations. It is in common with error correcting codes (ECC) used widely in communication and storage systems that use codes to detect and correct errors occured during communication or at rest in storage cells. The salient difference is that while ECC protects invariable data, I need an ECC for variable matrix that is under factorization. My previous and current work covers the design of such algorithmic fault tolerance techniques for the six most widely used matrix factorizations — LU, QR, Cholesky, SVD, Hessenberg reduction, and tridiagonal reduction which comprise the core functionality of the de facto dense linear algebra package ScaLAPACK (Scalable Linear Algebra PACKage). The novel approach I used extensively is the on-line ABFT which not only designs the numerical codes but also modifies the algorithm to maintain the checksum in flight. For LU/QR/Cholesky factorizations, the on-line transformation results in vastly improved fault tolerance at a small extra cost. For SVD/Hessenberg/tridiagonal factorizations where no ABFT exist, the on-line ABFT fills this void and produces similarly highly scalable, resilient, and efficient algorithms and implementations

    Fail-Stop Failure Algorithm-Based Fault Tolerance for Cholesky Decomposition

    No full text

    Energy Efficient Parallel Matrix-Matrix Multiplication for DVFS-Enabled Clusters

    No full text
    Excessive energy consumption has become one of the major challenges in high performance computing. Reducing the energy consumption of frequently used high performance computing applications not only saves the energy cost but also reduces the greenhouse gas emissions. This paper focuses on developing energy efficient algorithms and software for the widely used matrix-matrix multiplication, so that it is able to consume less energy in a DVFS-enabled cluster with little sacrifice in performance. The state-of-the-art practical parallel matrix matrix multiplication algorithm in ScaLAPACK partitions matrices into small blocks and distributes matrices using a two dimensional block cyclic distribution approach. Experimental results demonstrate that our energy efficient matrix-matrix multiplication algorithm can save up to 26.35% of energy with about 1% performance penalty. And the modified PDGEMM of ScaLAPACK is able to save energy more than 20% with less than 2% of performance loss

    Fast Discrete Distribution Clustering Using Wasserstein Barycenter With Sparse Support

    No full text

    Correcting Soft Errors Online in Fast Fourier Transform

    No full text
    While many algorithm-based fault tolerance (ABFT) schemes have been proposed to detect soft errors offline in the fast Fourier transform (FFT) after computation finishes, none of the existing ABFT schemes detect soft errors online before the computation finishes. This paper presents an online ABFT scheme for FFT so that soft errors can be detected online and the corrupted computation can be terminated in a much more timely manner. We also extend our scheme to tolerate both arithmetic errors and memory errors, develop strategies to reduce its fault tolerance overhead and improve its numerical stability and fault coverage, and finally incorporate it into the widely used FFTW library - one of the today\u27s fastest FFT software implementations. Experimental results demonstrate that: (1) the proposed online ABFT scheme introduces much lower overhead than the existing offline ABFT schemes; (2) it detects errors in a much more timely manner; and (3) it also has higher numerical stability and better fault coverage
    corecore