516 research outputs found

    Fine-grained bit-flip protection for relaxation methods

    Full text link
    [EN] Resilience is considered a challenging under-addressed issue that the high performance computing community (HPC) will have to face in order to produce reliable Exascale systems by the beginning of the next decade. As part of a push toward a resilient HPC ecosystem, in this paper we propose an error-resilient iterative solver for sparse linear systems based on stationary component-wise relaxation methods. Starting from a plain implementation of the Jacobi iteration, our approach introduces a low-cost component-wise technique that detects bit-flips, rejecting some component updates, and turning the initial synchronized solver into an asynchronous iteration. Our experimental study with sparse incomplete factorizations from a collection of real-world applications, and a practical GPU implementation, exposes the convergence delay incurred by the fault-tolerant implementation and its practical performance.This material is based upon work supported in part by the U.S. Department of Energy (Award Number DE-SC-0010042) and NVIDIA. E. S. Quintana-Orti was supported by project CICYT TIN2014-53495-R of MINECO and FEDER.Anzt, H.; Dongarra, J.; Quintana Ortí, ES. (2019). Fine-grained bit-flip protection for relaxation methods. Journal of Computational Science. 36:1-11. https://doi.org/10.1016/j.jocs.2016.11.013S11136Chow, E., & Patel, A. (2015). Fine-Grained Parallel Incomplete LU Factorization. SIAM Journal on Scientific Computing, 37(2), C169-C193. doi:10.1137/140968896Karpuzcu, U. R., Kim, N. S., & Torrellas, J. (2013). Coping with Parametric Variation at Near-Threshold Voltages. IEEE Micro, 33(4), 6-14. doi:10.1109/mm.2013.71Bronevetsky, G., & de Supinski, B. (2008). Soft error vulnerability of iterative linear algebra methods. Proceedings of the 22nd annual international conference on Supercomputing - ICS ’08. doi:10.1145/1375527.1375552Sao, P., & Vuduc, R. (2013). Self-stabilizing iterative solvers. Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems - ScalA ’13. doi:10.1145/2530268.2530272Calhoun, J., Snir, M., Olson, L., & Garzaran, M. (2015). Understanding the Propagation of Error Due to a Silent Data Corruption in a Sparse Matrix Vector Multiply. 2015 IEEE International Conference on Cluster Computing. doi:10.1109/cluster.2015.101Chazan, D., & Miranker, W. (1969). Chaotic relaxation. Linear Algebra and its Applications, 2(2), 199-222. doi:10.1016/0024-3795(69)90028-7Frommer, A., & Szyld, D. B. (2000). On asynchronous iterations. Journal of Computational and Applied Mathematics, 123(1-2), 201-216. doi:10.1016/s0377-0427(00)00409-xDuff, I. S., & Meurant, G. A. (1989). The effect of ordering on preconditioned conjugate gradients. BIT, 29(4), 635-657. doi:10.1007/bf01932738Aliaga, J. I., Barreda, M., Dolz, M. F., Martín, A. F., Mayo, R., & Quintana-Ortí, E. S. (2014). Assessing the impact of the CPU power-saving modes on the task-parallel solution of sparse linear systems. Cluster Computing, 17(4), 1335-1348. doi:10.1007/s10586-014-0402-

    Resilience for Asynchronous Iterative Methods for Sparse Linear Systems

    Get PDF
    Large scale simulations are used in a variety of application areas in science and engineering to help forward the progress of innovation. Many spend the vast majority of their computational time attempting to solve large systems of linear equations; typically arising from discretizations of partial differential equations that are used to mathematically model various phenomena. The algorithms used to solve these problems are typically iterative in nature, and making efficient use of computational time on High Performance Computing (HPC) clusters involves constantly improving these iterative algorithms. Future HPC platforms are expected to encounter three main problem areas: scalability of code, reliability of hardware, and energy efficiency of the platform. The HPC resources that are expected to run the large programs are planned to consist of billions of processing units that come from more traditional multicore processors as well as a variety of different hardware accelerators. This growth in parallelism leads to the presence of all three problems. Previously, work on algorithm development has focused primarily on creating fault tolerance mechanisms for traditional iterative solvers. Recent work has begun to revisit using asynchronous methods for solving large scale applications, and this dissertation presents research into fault tolerance for fine-grained methods that are asynchronous in nature. Classical convergence results for asynchronous methods are revisited and modified to account for the possible occurrence of a fault, and a variety of techniques for recovery from the effects of a fault are proposed. Examples of how these techniques can be used are shown for various algorithms, including an analysis of a fine-grained algorithm for computing incomplete factorizations. Lastly, numerous modeling and simulation tools for the further construction of iterative algorithms for HPC applications are developed, including numerical models for simulating faults and a simulation framework that can be used to extrapolate the performance of algorithms towards future HPC systems

    Resiliency in numerical algorithm design for extreme scale simulations

    Get PDF
    This work is based on the seminar titled ‘Resiliency in Numerical Algorithm Design for Extreme Scale Simulations’ held March 1–6, 2020, at Schloss Dagstuhl, that was attended by all the authors. Advanced supercomputing is characterized by very high computation speeds at the cost of involving an enormous amount of resources and costs. A typical large-scale computation running for 48 h on a system consuming 20 MW, as predicted for exascale systems, would consume a million kWh, corresponding to about 100k Euro in energy cost for executing 1023 floating-point operations. It is clearly unacceptable to lose the whole computation if any of the several million parallel processes fails during the execution. Moreover, if a single operation suffers from a bit-flip error, should the whole computation be declared invalid? What about the notion of reproducibility itself: should this core paradigm of science be revised and refined for results that are obtained by large-scale simulation? Naive versions of conventional resilience techniques will not scale to the exascale regime: with a main memory footprint of tens of Petabytes, synchronously writing checkpoint data all the way to background storage at frequent intervals will create intolerable overheads in runtime and energy consumption. Forecasts show that the mean time between failures could be lower than the time to recover from such a checkpoint, so that large calculations at scale might not make any progress if robust alternatives are not investigated. More advanced resilience techniques must be devised. The key may lie in exploiting both advanced system features as well as specific application knowledge. Research will face two essential questions: (1) what are the reliability requirements for a particular computation and (2) how do we best design the algorithms and software to meet these requirements? While the analysis of use cases can help understand the particular reliability requirements, the construction of remedies is currently wide open. One avenue would be to refine and improve on system- or application-level checkpointing and rollback strategies in the case an error is detected. Developers might use fault notification interfaces and flexible runtime systems to respond to node failures in an application-dependent fashion. Novel numerical algorithms or more stochastic computational approaches may be required to meet accuracy requirements in the face of undetectable soft errors. These ideas constituted an essential topic of the seminar. The goal of this Dagstuhl Seminar was to bring together a diverse group of scientists with expertise in exascale computing to discuss novel ways to make applications resilient against detected and undetected faults. In particular, participants explored the role that algorithms and applications play in the holistic approach needed to tackle this challenge. This article gathers a broad range of perspectives on the role of algorithms, applications and systems in achieving resilience for extreme scale simulations. The ultimate goal is to spark novel ideas and encourage the development of concrete solutions for achieving such resilience holistically.Peer Reviewed"Article signat per 36 autors/es: Emmanuel Agullo, Mirco Altenbernd, Hartwig Anzt, Leonardo Bautista-Gomez, Tommaso Benacchio, Luca Bonaventura, Hans-Joachim Bungartz, Sanjay Chatterjee, Florina M. Ciorba, Nathan DeBardeleben, Daniel Drzisga, Sebastian Eibl, Christian Engelmann, Wilfried N. Gansterer, Luc Giraud, Dominik G ̈oddeke, Marco Heisig, Fabienne Jezequel, Nils Kohl, Xiaoye Sherry Li, Romain Lion, Miriam Mehl, Paul Mycek, Michael Obersteiner, Enrique S. Quintana-Ortiz, Francesco Rizzi, Ulrich Rude, Martin Schulz, Fred Fung, Robert Speck, Linda Stals, Keita Teranishi, Samuel Thibault, Dominik Thonnes, Andreas Wagner and Barbara Wohlmuth"Postprint (author's final draft

    An extensive study on iterative solver resilience : characterization, detection and prediction

    Get PDF
    Soft errors caused by transient bit flips have the potential to significantly impactan applicalion's behavior. This has motivated the design of an array of techniques to detect, isolate, and correct soft errors using microarchitectural, architectural, compilation­based, or application-level techniques to minimize their impact on the executing application. The first step toward the design of good error detection/correction techniques involves an understanding of an application's vulnerability to soft errors. This work focuses on silent data e orruption's effects on iterative solvers and efforts to mitigate those effects. In this thesis, we first present the first comprehensive characterizalion of !he impact of soft errors on !he convergen ce characteris tics of six iterative methods using application-level fault injection. We analyze the impact of soft errors In terms of the type of error (single-vs multi-bit), the distribution and location of bits affected, the data structure and statement impacted, and varialion with time. We create a public access database with more than 1.5 million fault injection results. We then analyze the performance of soft error detection mechanisms and present the comparalive results. Molivated by our observations, we evaluate a machine-learning based detector that takes as features that are the runtime features observed by the individual detectors to arrive al their conclusions. Our evalualion demonstrates improved results over individual detectors. We then propase amachine learning based method to predict a program's error behavior to make fault injection studies more efficient. We demonstrate this method on asse ssing the performance of soft error detectors. We show that our method maintains 84% accuracy on average with up to 53% less cost. We also show, once a model is trained further fault injection tests would cost 10% of the expected full fault injection runs.“Soft errors” causados por cambios de estado transitorios en bits, tienen el potencial de impactar significativamente el comportamiento de una aplicación. Esto, ha motivado el diseño de una variedad de técnicas para detectar, aislar y corregir soft errors aplicadas a micro-arquitecturas, arquitecturas, tiempo de compilación y a nivel de aplicación para minimizar su impacto en la ejecución de una aplicación. El primer paso para diseñar una buna técnica de detección/corrección de errores, implica el conocimiento de las vulnerabilidades de la aplicación ante posibles soft errors. Este trabajo se centra en los efectos de la corrupción silenciosa de datos en soluciones iterativas, así como en los esfuerzos para mitigar esos efectos. En esta tesis, primeramente, presentamos la primera caracterización extensiva del impacto de soft errors sobre las características convergentes de seis métodos iterativos usando inyección de fallos a nivel de aplicación. Analizamos el impacto de los soft errors en términos del tipo de error (único vs múltiples-bits), de la distribución y posición de los bits afectados, las estructuras de datos, instrucciones afectadas y de las variaciones en el tiempo. Creamos una base de datos pública con más de 1.5 millones de resultados de inyección de fallos. Después, analizamos el desempeño de mecanismos de detección de soft errors actuales y presentamos los resultados de su comparación. Motivados por las observaciones de los resultados presentados, evaluamos un detector de soft errors basado en técnicas de machine learning que toma como entrada las características observadas en el tiempo de ejecución individual de los detectores anteriores al llegar a su conclusión. La evaluación de los resultados obtenidos muestra una mejora por sobre los detectores individualmente. Basados en estos resultados propusimos un método basado en machine learning para predecir el comportamiento de los errores en un programa con el fin de hacer el estudio de inyección de errores mas eficiente. Presentamos este método para evaluar el rendimiento de los detectores de soft errors. Demostramos que nuestro método mantiene una precisión del 84% en promedio con hasta un 53% de mejora en el tiempo de ejecución. También mostramos que una vez que un modelo ha sido entrenado, las pruebas de inyección de errores siguientes costarían 10% del tiempo esperado de ejecución.Postprint (published version

    Benefits from using mixed precision computations in the ELPA-AEO and ESSEX-II eigensolver projects

    Get PDF
    We first briefly report on the status and recent achievements of the ELPA-AEO (Eigenvalue Solvers for Petaflop Applications - Algorithmic Extensions and Optimizations) and ESSEX II (Equipping Sparse Solvers for Exascale) projects. In both collaboratory efforts, scientists from the application areas, mathematicians, and computer scientists work together to develop and make available efficient highly parallel methods for the solution of eigenvalue problems. Then we focus on a topic addressed in both projects, the use of mixed precision computations to enhance efficiency. We give a more detailed description of our approaches for benefiting from either lower or higher precision in three selected contexts and of the results thus obtained

    Software for Exascale Computing - SPPEXA 2016-2019

    Get PDF
    This open access book summarizes the research done and results obtained in the second funding phase of the Priority Program 1648 "Software for Exascale Computing" (SPPEXA) of the German Research Foundation (DFG) presented at the SPPEXA Symposium in Dresden during October 21-23, 2019. In that respect, it both represents a continuation of Vol. 113 in Springer’s series Lecture Notes in Computational Science and Engineering, the corresponding report of SPPEXA’s first funding phase, and provides an overview of SPPEXA’s contributions towards exascale computing in today's sumpercomputer technology. The individual chapters address one or more of the research directions (1) computational algorithms, (2) system software, (3) application software, (4) data management and exploration, (5) programming, and (6) software tools. The book has an interdisciplinary appeal: scholars from computational sub-fields in computer science, mathematics, physics, or engineering will find it of particular interest

    Novel Monte Carlo Methods for Large-Scale Linear Algebra Operations

    Get PDF
    Linear algebra operations play an important role in scientific computing and data analysis. With increasing data volume and complexity in the Big Data era, linear algebra operations are important tools to process massive datasets. On one hand, the advent of modern high-performance computing architectures with increasing computing power has greatly enhanced our capability to deal with a large volume of data. One the other hand, many classical, deterministic numerical linear algebra algorithms have difficulty to scale to handle large data sets. Monte Carlo methods, which are based on statistical sampling, exhibit many attractive properties in dealing with large volume of datasets, including fast approximated results, memory efficiency, reduced data accesses, natural parallelism, and inherent fault tolerance. In this dissertation, we present new Monte Carlo methods to accommodate a set of fundamental and ubiquitous large-scale linear algebra operations, including solving large-scale linear systems, constructing low-rank matrix approximation, and approximating the extreme eigenvalues/ eigenvectors, across modern distributed and parallel computing architectures. First of all, we revisit the classical Ulam-von Neumann Monte Carlo algorithm and derive the necessary and sufficient condition for its convergence. To support a broad family of linear systems, we develop Krylov subspace Monte Carlo solvers that go beyond the use of Neumann series. New algorithms used in the Krylov subspace Monte Carlo solvers include (1) a Breakdown-Free Block Conjugate Gradient algorithm to address the potential rank deficiency problem occurred in block Krylov subspace methods; (2) a Block Conjugate Gradient for Least Squares algorithm to stably approximate the least squares solutions of general linear systems; (3) a BCGLS algorithm with deflation to gain convergence acceleration; and (4) a Monte Carlo Generalized Minimal Residual algorithm based on sampling matrix-vector products to provide fast approximation of solutions. Secondly, we design a rank-revealing randomized Singular Value Decomposition (R3SVD) algorithm for adaptively constructing low-rank matrix approximations to satisfy application-specific accuracy. Thirdly, we study the block power method on Markov Chain Monte Carlo transition matrices and find that the convergence is actually depending on the number of independent vectors in the block. Correspondingly, we develop a sliding window power method to find stationary distribution, which has demonstrated success in modeling stochastic luminal Calcium release site. Fourthly, we take advantage of hybrid CPU-GPU computing platforms to accelerate the performance of the Breakdown-Free Block Conjugate Gradient algorithm and the randomized Singular Value Decomposition algorithm. Finally, we design a Gaussian variant of Freivalds’ algorithm to efficiently verify the correctness of matrix-matrix multiplication while avoiding undetectable fault patterns encountered in deterministic algorithms
    corecore