28 research outputs found

    Evaluating the Impact of SDC on the GMRES Iterative Solver

    Full text link
    Increasing parallelism and transistor density, along with increasingly tighter energy and peak power constraints, may force exposure of occasionally incorrect computation or storage to application codes. Silent data corruption (SDC) will likely be infrequent, yet one SDC suffices to make numerical algorithms like iterative linear solvers cease progress towards the correct answer. Thus, we focus on resilience of the iterative linear solver GMRES to a single transient SDC. We derive inexpensive checks to detect the effects of an SDC in GMRES that work for a more general SDC model than presuming a bit flip. Our experiments show that when GMRES is used as the inner solver of an inner-outer iteration, it can "run through" SDC of almost any magnitude in the computationally intensive orthogonalization phase. That is, it gets the right answer using faulty data without any required roll back. Those SDCs which it cannot run through, get caught by our detection scheme

    Resilience in Numerical Methods: A Position on Fault Models and Methodologies

    Full text link
    Future extreme-scale computer systems may expose silent data corruption (SDC) to applications, in order to save energy or increase performance. However, resilience research struggles to come up with useful abstract programming models for reasoning about SDC. Existing work randomly flips bits in running applications, but this only shows average-case behavior for a low-level, artificial hardware model. Algorithm developers need to understand worst-case behavior with the higher-level data types they actually use, in order to make their algorithms more resilient. Also, we know so little about how SDC may manifest in future hardware, that it seems premature to draw conclusions about the average case. We argue instead that numerical algorithms can benefit from a numerical unreliability fault model, where faults manifest as unbounded perturbations to floating-point data. Algorithms can use inexpensive "sanity" checks that bound or exclude error in the results of computations. Given a selective reliability programming model that requires reliability only when and where needed, such checks can make algorithms reliable despite unbounded faults. Sanity checks, and in general a healthy skepticism about the correctness of subroutines, are wise even if hardware is perfectly reliable.Comment: Position Pape

    Exploiting Data Representation for Fault Tolerance

    Full text link
    We explore the link between data representation and soft errors in dot products. We present an analytic model for the absolute error introduced should a soft error corrupt a bit in an IEEE-754 floating-point number. We show how this finding relates to the fundamental linear algebra concepts of normalization and matrix equilibration. We present a case study illustrating that the probability of experiencing a large error in a dot product is minimized when both vectors are normalized. Furthermore, when data is normalized we show that the absolute error is less than one or very large, which allows us to detect large errors. We demonstrate how this finding can be used by instrumenting the GMRES iterative solver. We count all possible errors that can be introduced through faults in arithmetic in the computationally intensive orthogonalization phase, and show that when scaling is used the absolute error can be bounded above by one

    Communication-Avoiding Krylov Subspace Methods

    Get PDF
    The cost of an algorithm includes both arithmetic and communication.We use "communication" in a general sense to mean data movement,either between levels of a memory hierarchy ("sequential") orbetween processors ("parallel"). Communication costs include bothbandwidth terms, which are proportional to the number of words sent,and latency terms, which are proportional to the number of messages inwhich the data is sent. Communication costs are much higher thanarithmetic costs, and the gap is increasing rapidly for technologicalreasons. This suggests that for best performance, algorithms shouldminimize communication, even if that may require some redundantarithmetic operations. We call such algorithms"communication-avoiding."Krylov subspace methods (KSMs) are iterative algorithms for solvinglarge, sparse linear systems and eigenvalue problems. Current KSMsrely on sparse matrix-vector multiply (SpMV) and vector-vectoroperations (like dot products and vector sums). All of theseoperations are communication-bound. Furthermore, data dependenciesbetween them mean that only a small amount of that communication canbe hidden. Many important scientific and engineering computationsspend much of their time in Krylov methods, so the performance of manycodes could be improved by introducing KSMs that communicate less.Our goal is to take s steps of a KSM for the same communication costas 1 step, which would be optimal. We call the resulting KSMs"communication-avoiding Krylov methods." We can do this under certainassumptions on the matrix, and for certain KSMs, both in theory (formany KSMs) and in practice (for some KSMs). Our algorithms are basedon the so-called "s-step" Krylov methods, which break up the datadependency between the sparse matrix-vector multiply and the dotproducts in standard Krylov methods. This idea has been around for awhile, but in contrast to prior work (discussed in detail in Section1.5), this thesis makes the following contributions:We have fast kernels replacing SpMV, that can compute theresults of s calls to SpMV for the same communication cost as onecall (Section 2.1).We have fast dense kernels as well, such as Tall Skinny QR (TSQR-- Section 2.3) and Block Gram-Schmidt (BGS -- Section 2.4), which can do the work of Modified Gram-Schmidt applied to s vectors for a factor of Theta(s^2) fewer messages in parallel, and a factor of Theta(s/W) fewer words transferred between levels of the memory hierarchy (where W is the fast memory capacity in words).We have new communication-avoiding Block Gram-Schmidt algorithmsfor orthogonalization in more general inner products (Section 2.5).We have new communication-avoiding versions of the followingKrylov subspace methods for solving linear systems: the GeneralizedMinimum Residual method (GMRES -- Section 3.4), bothunpreconditioned and preconditioned, and the Method of ConjugateGradients (CG), both unpreconditioned (Section 5.4) andleft-preconditioned (Section 5.5).We have new communication-avoiding versions of the following Krylovsubspace methods for solving eigenvalue problems, both standard (Ax =&lambda x, for a nonsingular matrix A) and "generalized" (Ax =&lambda Mx, for nonsingular matrices A and M): Arnoldi iteration(Section 3.3), and Lanczos iteration, both for Ax = &lambda x (Section4.2) and Ax = &lambda M x (Section 4.3).We propose techniques for developing communication-avoiding versionsof nonsymmetric Lanczos iteration (for solving nonsymmetric eigenvalueproblems Ax = &lambda x) and the Method of Biconjugate Gradients(BiCG) for solving linear systems. See Chapter 6 for details.We can combine more stable numerical formulations that use differentbases of Krylov subspaces with our techniques for avoidingcommunication. For a discussion of different bases, see Chapter 7.To see an example of how the choice of basis affects the formulationof the Krylov method, see Section 3.2.2.We have faster numerical formulations. For example, in ourcommunication-avoiding version of GMRES, CA-GMRES (see Section 3.4),we can pick the restart length r independently of the s-step basislength s. Experiments in Section 3.5.5 show that this abilityimproves numerical stability. We show in Section 3.6.3 that it alsoimproves performance in practice, resulting in a 2.23 × speedupin the CA-GMRES implementation described below.We combine all of these numerical and performance techniques in ashared-memory parallel implementation of our communication-avoidingversion of GMRES, CA-GMRES. Compared to a similarly highly optimizedversion of standard GMRES, when both are running in parallel on 8cores of an Intel Clovertown (see Appendix A), CA-GMRES achieves 4.3× speedups over standard GMRES on standard sparse test matrices(described in Appendix B.5). When both are running in parallel on 8cores of an Intel Nehalem (see Appendix A), CA-GMRES achieves 4.1× speedups. See Section 3.6 for performance results and Section3.5 for corresponding numerical experiments. We first reportedperformance results for this implementation on the Intel Clovertownplatform in Demmel et al. [78].We have incorporated preconditioning into our methods. Notethat we have not yet developed practical communication-avoidingpreconditioners; this is future work. We have accomplished the following:* We show (in Sections 2.2 and 4.3) what the s-step basis shouldcompute in the preconditioned case for many different types of Krylovmethods and s-step bases. We explain why this is hard in Section 4.3.* We have identified two different structures that a preconditionermay have, in order to achieve the desired optimal reduction ofcommunication by a factor of s. See Section 2.2 for details.We present a detailed survey of related work, including s-step KSMs(Section 1.5, especially Table 1.1) and other techniques for reducingthe amount of communication in iterative methods (Section 1.6)
    corecore