28 research outputs found
Evaluating the Impact of SDC on the GMRES Iterative Solver
Increasing parallelism and transistor density, along with increasingly
tighter energy and peak power constraints, may force exposure of occasionally
incorrect computation or storage to application codes. Silent data corruption
(SDC) will likely be infrequent, yet one SDC suffices to make numerical
algorithms like iterative linear solvers cease progress towards the correct
answer. Thus, we focus on resilience of the iterative linear solver GMRES to a
single transient SDC. We derive inexpensive checks to detect the effects of an
SDC in GMRES that work for a more general SDC model than presuming a bit flip.
Our experiments show that when GMRES is used as the inner solver of an
inner-outer iteration, it can "run through" SDC of almost any magnitude in the
computationally intensive orthogonalization phase. That is, it gets the right
answer using faulty data without any required roll back. Those SDCs which it
cannot run through, get caught by our detection scheme
Resilience in Numerical Methods: A Position on Fault Models and Methodologies
Future extreme-scale computer systems may expose silent data corruption (SDC)
to applications, in order to save energy or increase performance. However,
resilience research struggles to come up with useful abstract programming
models for reasoning about SDC. Existing work randomly flips bits in running
applications, but this only shows average-case behavior for a low-level,
artificial hardware model. Algorithm developers need to understand worst-case
behavior with the higher-level data types they actually use, in order to make
their algorithms more resilient. Also, we know so little about how SDC may
manifest in future hardware, that it seems premature to draw conclusions about
the average case. We argue instead that numerical algorithms can benefit from a
numerical unreliability fault model, where faults manifest as unbounded
perturbations to floating-point data. Algorithms can use inexpensive "sanity"
checks that bound or exclude error in the results of computations. Given a
selective reliability programming model that requires reliability only when and
where needed, such checks can make algorithms reliable despite unbounded
faults. Sanity checks, and in general a healthy skepticism about the
correctness of subroutines, are wise even if hardware is perfectly reliable.Comment: Position Pape
Exploiting Data Representation for Fault Tolerance
We explore the link between data representation and soft errors in dot
products. We present an analytic model for the absolute error introduced should
a soft error corrupt a bit in an IEEE-754 floating-point number. We show how
this finding relates to the fundamental linear algebra concepts of
normalization and matrix equilibration. We present a case study illustrating
that the probability of experiencing a large error in a dot product is
minimized when both vectors are normalized. Furthermore, when data is
normalized we show that the absolute error is less than one or very large,
which allows us to detect large errors. We demonstrate how this finding can be
used by instrumenting the GMRES iterative solver. We count all possible errors
that can be introduced through faults in arithmetic in the computationally
intensive orthogonalization phase, and show that when scaling is used the
absolute error can be bounded above by one
Communication-Avoiding Krylov Subspace Methods
The cost of an algorithm includes both arithmetic and communication.We use "communication" in a general sense to mean data movement,either between levels of a memory hierarchy ("sequential") orbetween processors ("parallel"). Communication costs include bothbandwidth terms, which are proportional to the number of words sent,and latency terms, which are proportional to the number of messages inwhich the data is sent. Communication costs are much higher thanarithmetic costs, and the gap is increasing rapidly for technologicalreasons. This suggests that for best performance, algorithms shouldminimize communication, even if that may require some redundantarithmetic operations. We call such algorithms"communication-avoiding."Krylov subspace methods (KSMs) are iterative algorithms for solvinglarge, sparse linear systems and eigenvalue problems. Current KSMsrely on sparse matrix-vector multiply (SpMV) and vector-vectoroperations (like dot products and vector sums). All of theseoperations are communication-bound. Furthermore, data dependenciesbetween them mean that only a small amount of that communication canbe hidden. Many important scientific and engineering computationsspend much of their time in Krylov methods, so the performance of manycodes could be improved by introducing KSMs that communicate less.Our goal is to take s steps of a KSM for the same communication costas 1 step, which would be optimal. We call the resulting KSMs"communication-avoiding Krylov methods." We can do this under certainassumptions on the matrix, and for certain KSMs, both in theory (formany KSMs) and in practice (for some KSMs). Our algorithms are basedon the so-called "s-step" Krylov methods, which break up the datadependency between the sparse matrix-vector multiply and the dotproducts in standard Krylov methods. This idea has been around for awhile, but in contrast to prior work (discussed in detail in Section1.5), this thesis makes the following contributions:We have fast kernels replacing SpMV, that can compute theresults of s calls to SpMV for the same communication cost as onecall (Section 2.1).We have fast dense kernels as well, such as Tall Skinny QR (TSQR-- Section 2.3) and Block Gram-Schmidt (BGS -- Section 2.4), which can do the work of Modified Gram-Schmidt applied to s vectors for a factor of Theta(s^2) fewer messages in parallel, and a factor of Theta(s/W) fewer words transferred between levels of the memory hierarchy (where W is the fast memory capacity in words).We have new communication-avoiding Block Gram-Schmidt algorithmsfor orthogonalization in more general inner products (Section 2.5).We have new communication-avoiding versions of the followingKrylov subspace methods for solving linear systems: the GeneralizedMinimum Residual method (GMRES -- Section 3.4), bothunpreconditioned and preconditioned, and the Method of ConjugateGradients (CG), both unpreconditioned (Section 5.4) andleft-preconditioned (Section 5.5).We have new communication-avoiding versions of the following Krylovsubspace methods for solving eigenvalue problems, both standard (Ax =&lambda x, for a nonsingular matrix A) and "generalized" (Ax =&lambda Mx, for nonsingular matrices A and M): Arnoldi iteration(Section 3.3), and Lanczos iteration, both for Ax = &lambda x (Section4.2) and Ax = &lambda M x (Section 4.3).We propose techniques for developing communication-avoiding versionsof nonsymmetric Lanczos iteration (for solving nonsymmetric eigenvalueproblems Ax = &lambda x) and the Method of Biconjugate Gradients(BiCG) for solving linear systems. See Chapter 6 for details.We can combine more stable numerical formulations that use differentbases of Krylov subspaces with our techniques for avoidingcommunication. For a discussion of different bases, see Chapter 7.To see an example of how the choice of basis affects the formulationof the Krylov method, see Section 3.2.2.We have faster numerical formulations. For example, in ourcommunication-avoiding version of GMRES, CA-GMRES (see Section 3.4),we can pick the restart length r independently of the s-step basislength s. Experiments in Section 3.5.5 show that this abilityimproves numerical stability. We show in Section 3.6.3 that it alsoimproves performance in practice, resulting in a 2.23 × speedupin the CA-GMRES implementation described below.We combine all of these numerical and performance techniques in ashared-memory parallel implementation of our communication-avoidingversion of GMRES, CA-GMRES. Compared to a similarly highly optimizedversion of standard GMRES, when both are running in parallel on 8cores of an Intel Clovertown (see Appendix A), CA-GMRES achieves 4.3× speedups over standard GMRES on standard sparse test matrices(described in Appendix B.5). When both are running in parallel on 8cores of an Intel Nehalem (see Appendix A), CA-GMRES achieves 4.1× speedups. See Section 3.6 for performance results and Section3.5 for corresponding numerical experiments. We first reportedperformance results for this implementation on the Intel Clovertownplatform in Demmel et al. [78].We have incorporated preconditioning into our methods. Notethat we have not yet developed practical communication-avoidingpreconditioners; this is future work. We have accomplished the following:* We show (in Sections 2.2 and 4.3) what the s-step basis shouldcompute in the preconditioned case for many different types of Krylovmethods and s-step bases. We explain why this is hard in Section 4.3.* We have identified two different structures that a preconditionermay have, in order to achieve the desired optimal reduction ofcommunication by a factor of s. See Section 2.2 for details.We present a detailed survey of related work, including s-step KSMs(Section 1.5, especially Table 1.1) and other techniques for reducingthe amount of communication in iterative methods (Section 1.6)