Search CORE

4 research outputs found

A Backward/Forward Recovery Approach for the Preconditioned Conjugate Gradient Method

Author: Fasi Massimiliano
Langou Julien
Robert Yves
Uçar Bora
Publication venue
Publication date: 04/11/2015
Field of study

Several recent papers have introduced a periodic verification mechanism to detect silent errors in iterative solvers. Chen [PPoPP'13, pp. 167--176] has shown how to combine such a verification mechanism (a stability test checking the orthogonality of two vectors and recomputing the residual) with checkpointing: the idea is to verify every

d

iterations, and to checkpoint every

c \times d

iterations. When a silent error is detected by the verification mechanism, one can rollback to and re-execute from the last checkpoint. In this paper, we also propose to combine checkpointing and verification, but we use algorithm-based fault tolerance (ABFT) rather than stability tests. ABFT can be used for error detection, but also for error detection and correction, allowing a forward recovery (and no rollback nor re-execution) when a single error is detected. We introduce an abstract performance model to compute the performance of all schemes, and we instantiate it using the preconditioned conjugate gradient algorithm. Finally, we validate our new approach through a set of simulations

arXiv.org e-Print Archive

HAL-ENS-LYON

Durham Research Online

Crossref

INRIA a CCSD electronic archive server

HAL Descartes

MIMS EPrints

Hal-Diderot

A backward/forward recovery approach for the preconditioned conjugate gradient method

Author: Anfinson
Aupy
Benoit
Benson
Benzi
Benzi
Bosilca
Bougeret
Bridges
Bronevetsky
Cappello
Cappello
Cappello
Chen
Chen
Chung
Daly
Davis
Du
Du
Edelman
Elliott
Elliott
Fasi
Fasi
Fiala
Hakkarinen
Heroux
Higham
Higham
Hoemmen
Hoemmen
Huang
Hwang
Kaya
Lu
Lyons
Mitzenmacher
Moody
Saad
Sao
Schroeder
Shantharam
Sloan
Stoyanov
Yao
Young
Publication venue: 'Elsevier BV'
Publication date
Field of study

Crossref

Efficient fault tolerance for selected scientific computing algorithms on heterogeneous and approximate computer architectures

Author: Schöll Alexander
Publication venue
Publication date: 01/01/2018
Field of study

Scientific computing and simulation technology play an essential role to solve central challenges in science and engineering. The high computational power of heterogeneous computer architectures allows to accelerate applications in these domains, which are often dominated by compute-intensive mathematical tasks. Scientific, economic and political decision processes increasingly rely on such applications and therefore induce a strong demand to compute correct and trustworthy results. However, the continued semiconductor technology scaling increasingly imposes serious threats to the reliability and efficiency of upcoming devices. Different reliability threats can cause crashes or erroneous results without indication. Software-based fault tolerance techniques can protect algorithmic tasks by adding appropriate operations to detect and correct errors at runtime. Major challenges are induced by the runtime overhead of such operations and by rounding errors in floating-point arithmetic that can cause false positives. The end of Dennard scaling induces central challenges to further increase the compute efficiency between semiconductor technology generations. Approximate computing exploits the inherent error resilience of different applications to achieve efficiency gains with respect to, for instance, power, energy, and execution times. However, scientific applications often induce strict accuracy requirements which require careful utilization of approximation techniques. This thesis provides fault tolerance and approximate computing methods that enable the reliable and efficient execution of linear algebra operations and Conjugate Gradient solvers using heterogeneous and approximate computer architectures. The presented fault tolerance techniques detect and correct errors at runtime with low runtime overhead and high error coverage. At the same time, these fault tolerance techniques are exploited to enable the execution of the Conjugate Gradient solvers on approximate hardware by monitoring the underlying error resilience while adjusting the approximation error accordingly. Besides, parameter evaluation and estimation methods are presented that determine the computational efficiency of application executions on approximate hardware. An extensive experimental evaluation shows the efficiency and efficacy of the presented methods with respect to the runtime overhead to detect and correct errors, the error coverage as well as the achieved energy reduction in executing the Conjugate Gradient solvers on approximate hardware