Search CORE

15 research outputs found

Correcting soft errors online in fast fourier transform

Author: Chen Jieyang
Chen Zizhong
Li Hongbo
Li Sihuan
Liang Xin
Liu Yuanlai
Ouyang Kaiming
Song Fengguang
Tao Dingwen
Wu Panruo
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2017
Field of study

While many algorithm-based fault tolerance (ABFT) schemes have been proposed to detect soft errors offline in the fast Fourier transform (FFT) after computation finishes, none of the existing ABFT schemes detect soft errors online before the computation finishes. This paper presents an online ABFT scheme for FFT so that soft errors can be detected online and the corrupted computation can be terminated in a much more timely manner. We also extend our scheme to tolerate both arithmetic errors and memory errors, develop strategies to reduce its fault tolerance overhead and improve its numerical stability and fault coverage, and finally incorporate it into the widely used FFTW library - one of the today's fastest FFT software implementations. Experimental results demonstrate that: (1) the proposed online ABFT scheme introduces much lower overhead than the existing offline ABFT schemes; (2) it detects errors in a much more timely manner; and (3) it also has higher numerical stability and better fault coverage

IUPUIScholarWorks

New-Sum: A Novel Online ABFT Scheme for General Iterative Methods

Author: For full list of authors see publisher\u27s website.
Krishnamoorthy Sriram
Liang Xin
Song Shuaiwen Leon
Tao Dingwen
Wu Panruo
Zhang Eddy Z.
Publication venue: Scholars\u27 Mine
Publication date: 31/05/2016
Field of study

Emerging high-performance computing platforms, with large component counts and lower power margins, are anticipated to be more susceptible to soft errors in both logic circuits and memory subsystems. We present an online algorithm-based fault tolerance (ABFT) approach to efficiently detect and recover soft errors for general iterative methods. We design a novel checksum-based encoding scheme for matrix-vector multiplication that is resilient to both arithmetic and memory errors. Our design decouples the checksum updating process from the actual computation, and allows adaptive checksum overhead control. Building on this new encoding mechanism, we propose two online ABFT designs that can effectively recover from errors when combined with a checkpoint/rollback scheme. These designs are capable of addressing scenarios under different error rates. Our ABFT approaches apply to a wide range of iterative solvers that primarily rely on matrix-vector multiplication and vector linear operations. We evaluate our designs through comprehensive analytical and empirical analysis. Experimental evaluation on the Stampede supercomputer demonstrates the low performance overheads incurred by our two ABFT schemes for preconditioned CG (0:4% and 2:2%) and preconditioned BiCGSTAB (1:0% and 4:0%) for the largest SPD matrix from UFL Sparse Matrix Collection. The evaluation also demonstrates the exibility and effectiveness of our proposed designs for detecting and recovering various types of soft errors in general iterative methods

Missouri University of Science and Technology (Missouri S&T): Scholars' Mine

Silent Data Corruption Resilient Matrix Factorizations on Distributed Memory System

Author: Wu Panruo
Publication venue: eScholarship, University of California
Publication date: 01/01/2016
Field of study

The lack of efficient resilience solutions is expected to be a major problem for the coming exascale supercomputers, as the chance that a long running large scale computation can finish without faults is diminishing quickly. In this dissertation I try to develop algorithmic techniques to provide fault tolerance for the commonly used matrix factorization algorithms and its high performance implementation in distributed memory massively parallel systems, with very low overhead and high scalability.Specifically, I design numerical error correcting encoding of matrix and the corresponding algorithms to tolerate hardware faults during matrix factorizations. It is in common with error correcting codes (ECC) used widely in communication and storage systems that use codes to detect and correct errors occured during communication or at rest in storage cells. The salient difference is that while ECC protects invariable data, I need an ECC for variable matrix that is under factorization. My previous and current work covers the design of such algorithmic fault tolerance techniques for the six most widely used matrix factorizations — LU, QR, Cholesky, SVD, Hessenberg reduction, and tridiagonal reduction which comprise the core functionality of the de facto dense linear algebra package ScaLAPACK (Scalable Linear Algebra PACKage). The novel approach I used extensively is the on-line ABFT which not only designs the numerical codes but also modifies the algorithm to maintain the checksum in flight. For LU/QR/Cholesky factorizations, the on-line transformation results in vastly improved fault tolerance at a small extra cost. For SVD/Hessenberg/tridiagonal factorizations where no ABFT exist, the on-line ABFT fills this void and produces similarly highly scalable, resilient, and efficient algorithms and implementations

Ezid

eScholarship - University of California

Fail-Stop Failure Algorithm-Based Fault Tolerance for Cholesky Decomposition

Author: Doug Hakkarinen
Panruo Wu
Zizhong Chen
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref

Energy Efficient Parallel Matrix-Matrix Multiplication for DVFS-Enabled Clusters

Author: Chen Longxiang
Chen Zizhong
Ge Rong
Wu Panruo
Publication venue: e-Publications@Marquette
Publication date: 01/01/2012
Field of study

Excessive energy consumption has become one of the major challenges in high performance computing. Reducing the energy consumption of frequently used high performance computing applications not only saves the energy cost but also reduces the greenhouse gas emissions. This paper focuses on developing energy efficient algorithms and software for the widely used matrix-matrix multiplication, so that it is able to consume less energy in a DVFS-enabled cluster with little sacrifice in performance. The state-of-the-art practical parallel matrix matrix multiplication algorithm in ScaLAPACK partitions matrices into small blocks and distributes matrices using a two dimensional block cyclic distribution approach. Experimental results demonstrate that our energy efficient matrix-matrix multiplication algorithm can save up to 26.35% of energy with about 1% performance penalty. And the modified PDGEMM of ScaLAPACK is able to save energy more than 20% with less than 2% of performance loss

epublications@Marquette

Fast Discrete Distribution Clustering Using Wasserstein Barycenter With Sparse Support

Author: James Z. Wang
Jia Li
Jianbo Ye
Panruo Wu
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref

Symmetric Indefinite Linear Solver using OpenMP Task on Manycore Architectures

Author: Dongarra Jack
Kurzak Jakub
Wu Panruo
Yamazaki Ichitaro
Zounon Mawussi
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2018
Field of study

Crossref

The University of Manchester - Institutional Repository

Correcting Soft Errors Online in Fast Fourier Transform

Author: Chen Jieyang
For full list of authors see publisher\u27s website.
Li Hongbo
Li Sihuan
Liang Xin
Tao Dingwen
Wu Panruo
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 12/11/2017
Field of study

While many algorithm-based fault tolerance (ABFT) schemes have been proposed to detect soft errors offline in the fast Fourier transform (FFT) after computation finishes, none of the existing ABFT schemes detect soft errors online before the computation finishes. This paper presents an online ABFT scheme for FFT so that soft errors can be detected online and the corrupted computation can be terminated in a much more timely manner. We also extend our scheme to tolerate both arithmetic errors and memory errors, develop strategies to reduce its fault tolerance overhead and improve its numerical stability and fault coverage, and finally incorporate it into the widely used FFTW library - one of the today\u27s fastest FFT software implementations. Experimental results demonstrate that: (1) the proposed online ABFT scheme introduces much lower overhead than the existing offline ABFT schemes; (2) it detects errors in a much more timely manner; and (3) it also has higher numerical stability and better fault coverage

Missouri University of Science and Technology (Missouri S&T): Scholars' Mine