2,176 research outputs found
Soft Error Vulnerability of Iterative Linear Algebra Methods
Devices are increasingly vulnerable to soft errors as their feature sizes shrink. Previously, soft error rates were significant primarily in space and high-atmospheric computing. Modern architectures now use features so small at sufficiently low voltages that soft errors are becoming important even at terrestrial altitudes. Due to their large number of components, supercomputers are particularly susceptible to soft errors. Since many large scale parallel scientific applications use iterative linear algebra methods, the soft error vulnerability of these methods constitutes a large fraction of the applications overall vulnerability. Many users consider these methods invulnerable to most soft errors since they converge from an imprecise solution to a precise one. However, we show in this paper that iterative methods are vulnerable to soft errors, exhibiting both silent data corruptions and poor ability to detect errors. Further, we evaluate a variety of soft error detection and tolerance techniques, including checkpointing, linear matrix encodings, and residual tracking techniques
Recommended from our members
Soft Error Vulnerability of Iterative Linear Algebra Methods
Devices become increasingly vulnerable to soft errors as their feature sizes shrink. Previously, soft errors primarily caused problems for space and high-atmospheric computing applications. Modern architectures now use features so small at sufficiently low voltages that soft errors are becoming significant even at terrestrial altitudes. The soft error vulnerability of iterative linear algebra methods, which many scientific applications use, is a critical aspect of the overall application vulnerability. These methods are often considered invulnerable to many soft errors because they converge from an imprecise solution to a precise one. However, we show that iterative methods can be vulnerable to soft errors, with a high rate of silent data corruptions. We quantify this vulnerability, with algorithms generating up to 8.5% erroneous results when subjected to a single bit-flip. Further, we show that detecting soft errors in an iterative method depends on its detailed convergence properties and requires more complex mechanisms than simply checking the residual. Finally, we explore inexpensive techniques to tolerate soft errors in these methods
Exploiting Data Representation for Fault Tolerance
We explore the link between data representation and soft errors in dot
products. We present an analytic model for the absolute error introduced should
a soft error corrupt a bit in an IEEE-754 floating-point number. We show how
this finding relates to the fundamental linear algebra concepts of
normalization and matrix equilibration. We present a case study illustrating
that the probability of experiencing a large error in a dot product is
minimized when both vectors are normalized. Furthermore, when data is
normalized we show that the absolute error is less than one or very large,
which allows us to detect large errors. We demonstrate how this finding can be
used by instrumenting the GMRES iterative solver. We count all possible errors
that can be introduced through faults in arithmetic in the computationally
intensive orthogonalization phase, and show that when scaling is used the
absolute error can be bounded above by one
Resilience in Numerical Methods: A Position on Fault Models and Methodologies
Future extreme-scale computer systems may expose silent data corruption (SDC)
to applications, in order to save energy or increase performance. However,
resilience research struggles to come up with useful abstract programming
models for reasoning about SDC. Existing work randomly flips bits in running
applications, but this only shows average-case behavior for a low-level,
artificial hardware model. Algorithm developers need to understand worst-case
behavior with the higher-level data types they actually use, in order to make
their algorithms more resilient. Also, we know so little about how SDC may
manifest in future hardware, that it seems premature to draw conclusions about
the average case. We argue instead that numerical algorithms can benefit from a
numerical unreliability fault model, where faults manifest as unbounded
perturbations to floating-point data. Algorithms can use inexpensive "sanity"
checks that bound or exclude error in the results of computations. Given a
selective reliability programming model that requires reliability only when and
where needed, such checks can make algorithms reliable despite unbounded
faults. Sanity checks, and in general a healthy skepticism about the
correctness of subroutines, are wise even if hardware is perfectly reliable.Comment: Position Pape
Evaluating the Impact of SDC on the GMRES Iterative Solver
Increasing parallelism and transistor density, along with increasingly
tighter energy and peak power constraints, may force exposure of occasionally
incorrect computation or storage to application codes. Silent data corruption
(SDC) will likely be infrequent, yet one SDC suffices to make numerical
algorithms like iterative linear solvers cease progress towards the correct
answer. Thus, we focus on resilience of the iterative linear solver GMRES to a
single transient SDC. We derive inexpensive checks to detect the effects of an
SDC in GMRES that work for a more general SDC model than presuming a bit flip.
Our experiments show that when GMRES is used as the inner solver of an
inner-outer iteration, it can "run through" SDC of almost any magnitude in the
computationally intensive orthogonalization phase. That is, it gets the right
answer using faulty data without any required roll back. Those SDCs which it
cannot run through, get caught by our detection scheme
Fine-grained bit-flip protection for relaxation methods
[EN] Resilience is considered a challenging under-addressed issue that the high performance computing community (HPC) will have to face in order to produce reliable Exascale systems by the beginning of the next decade. As part of a push toward a resilient HPC ecosystem, in this paper we propose an error-resilient iterative solver for sparse linear systems based on stationary component-wise relaxation methods. Starting from a plain implementation of the Jacobi iteration, our approach introduces a low-cost component-wise technique that detects bit-flips, rejecting some component updates, and turning the initial synchronized solver into an asynchronous iteration. Our experimental study with sparse incomplete factorizations from a collection of real-world applications, and a practical GPU implementation, exposes the convergence delay incurred by the fault-tolerant implementation and its practical performance.This material is based upon work supported in part by the U.S. Department of Energy (Award Number DE-SC-0010042) and NVIDIA. E. S. Quintana-Orti was supported by project CICYT TIN2014-53495-R of MINECO and FEDER.Anzt, H.; Dongarra, J.; Quintana OrtĂ, ES. (2019). Fine-grained bit-flip protection for relaxation methods. Journal of Computational Science. 36:1-11. https://doi.org/10.1016/j.jocs.2016.11.013S11136Chow, E., & Patel, A. (2015). Fine-Grained Parallel Incomplete LU Factorization. SIAM Journal on Scientific Computing, 37(2), C169-C193. doi:10.1137/140968896Karpuzcu, U. R., Kim, N. S., & Torrellas, J. (2013). Coping with Parametric Variation at Near-Threshold Voltages. IEEE Micro, 33(4), 6-14. doi:10.1109/mm.2013.71Bronevetsky, G., & de Supinski, B. (2008). Soft error vulnerability of iterative linear algebra methods. Proceedings of the 22nd annual international conference on Supercomputing - ICS ’08. doi:10.1145/1375527.1375552Sao, P., & Vuduc, R. (2013). Self-stabilizing iterative solvers. Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems - ScalA ’13. doi:10.1145/2530268.2530272Calhoun, J., Snir, M., Olson, L., & Garzaran, M. (2015). Understanding the Propagation of Error Due to a Silent Data Corruption in a Sparse Matrix Vector Multiply. 2015 IEEE International Conference on Cluster Computing. doi:10.1109/cluster.2015.101Chazan, D., & Miranker, W. (1969). Chaotic relaxation. Linear Algebra and its Applications, 2(2), 199-222. doi:10.1016/0024-3795(69)90028-7Frommer, A., & Szyld, D. B. (2000). On asynchronous iterations. Journal of Computational and Applied Mathematics, 123(1-2), 201-216. doi:10.1016/s0377-0427(00)00409-xDuff, I. S., & Meurant, G. A. (1989). The effect of ordering on preconditioned conjugate gradients. BIT, 29(4), 635-657. doi:10.1007/bf01932738Aliaga, J. I., Barreda, M., Dolz, M. F., MartĂn, A. F., Mayo, R., & Quintana-OrtĂ, E. S. (2014). Assessing the impact of the CPU power-saving modes on the task-parallel solution of sparse linear systems. Cluster Computing, 17(4), 1335-1348. doi:10.1007/s10586-014-0402-
Evaluating application vulnerability to soft errors in multi-level cache hierarchy
As the capacity of cache increases dramatically with new processors, soft errors originating in cache has become a major reliability concern for high performance processors. This paper presents application specific soft error vulnerability analysis in order to understand an application's responses to soft errors from different levels of caches. Based on a high-performance processor simulator called Graphite, we have implemented a fault injection framework that can selectively inject bit flips to different levels of caches. We simulated a wide range of relevant bit error patterns and measured the applications' vulnerabilities to bit errors. Our experimental results have shown the various vulnerabilities of applications to bit errors from different levels of caches; the results have also indicated the probabilities of different behaviors from the applications
A vulnerability factor for ECC-protected memory
Fault injection studies and vulnerability analyses have been used to estimate the reliability of data structures in memory. We survey these metrics and look at their adequacy to describe the data stored in ECC-protected memory. We also introduce FEA, a new metric improving on the memory derating factor by ignoring a class of false errors. We measure all metrics using simulations and compare them to the outcomes of injecting errors in real runs. This in-depth study reveals that FEA provides more accurate results than any state-of-the-art vulnerability metric. Furthermore, FEA gives an upper bound on the failure probability due to an error in memory, making this metric a tool of choice to quantify memory vulnerability. Finally, we show that ignoring these false errors reduces the failure rate on average by 12.75% and up to over 45%.This work has been supported by the RoMoL ERC Advanced Grant (GA 321253), by the European HiPEAC Network of Excellence, by the Spanish Ministry of Economy and Competitiveness (contract TIN2015-65316- P), by the Generalitat de Catalunya (contracts 2017-SGR-1414 and 2017- SGR-1328), by the Spanish Government (Severo Ochoa grant SEV-2015- 0493) and by the European Union’s Horizon 2020 research and innovation programme (grant agreements 671697 and 779877). L. Jaulmes has been partially supported by the Spanish Ministry of Education, Culture and Sports under grant FPU2013/06982. M. Moreto and M. Casas have been partially supported by the Spanish Ministry of Economy, Industry and Competitiveness under Ramon y Cajal fellowships RYC-2016-21104 and RYC-2017-23269.Peer ReviewedPostprint (author's final draft
- …