1,910 research outputs found

    Evaluating the Impact of SDC on the GMRES Iterative Solver

    Full text link
    Increasing parallelism and transistor density, along with increasingly tighter energy and peak power constraints, may force exposure of occasionally incorrect computation or storage to application codes. Silent data corruption (SDC) will likely be infrequent, yet one SDC suffices to make numerical algorithms like iterative linear solvers cease progress towards the correct answer. Thus, we focus on resilience of the iterative linear solver GMRES to a single transient SDC. We derive inexpensive checks to detect the effects of an SDC in GMRES that work for a more general SDC model than presuming a bit flip. Our experiments show that when GMRES is used as the inner solver of an inner-outer iteration, it can "run through" SDC of almost any magnitude in the computationally intensive orthogonalization phase. That is, it gets the right answer using faulty data without any required roll back. Those SDCs which it cannot run through, get caught by our detection scheme

    Master of Science

    Get PDF
    thesisTo minimize resource consumption and maximize performance, computer architecture research has been investigating approaches that may compute inaccurate solutions. Such hardware inaccuracies may induce a wide variety of program behaviors which are not obs

    Cross-layer reliability evaluation, moving from the hardware architecture to the system level: A CLERECO EU project overview

    Get PDF
    Advanced computing systems realized in forthcoming technologies hold the promise of a significant increase of computational capabilities. However, the same path that is leading technologies toward these remarkable achievements is also making electronic devices increasingly unreliable. Developing new methods to evaluate the reliability of these systems in an early design stage has the potential to save costs, produce optimized designs and have a positive impact on the product time-to-market. CLERECO European FP7 research project addresses early reliability evaluation with a cross-layer approach across different computing disciplines, across computing system layers and across computing market segments. The fundamental objective of the project is to investigate in depth a methodology to assess system reliability early in the design cycle of the future systems of the emerging computing continuum. This paper presents a general overview of the CLERECO project focusing on the main tools and models that are being developed that could be of interest for the research community and engineering practice

    Ground-truth prediction to accelerate soft-error impact analysis for iterative methods

    Get PDF
    Understanding the impact of soft errors on applications can be expensive. Often, it requires an extensive error injection campaign involving numerous runs of the full application in the presence of errors. In this paper, we present a novel approach to arriving at the ground truth-the true impact of an error on the final output-for iterative methods by observing a small number of iterations to learn deviations between normal and error-impacted execution. We develop a machine learning based predictor for three iterative methods to generate ground-truth results without running them to completion for every error injected. We demonstrate that this approach achieves greater accuracy than alternative prediction strategies, including three existing soft error detection strategies. We demonstrate the effectiveness of the ground truth prediction model in evaluating vulnerability and the effectiveness of soft error detection strategies in the context of iterative methods.This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research under Award Number 66905, program manager Lucy Nowell. Pacific Northwest National Laboratory is operated by Battelle for DOE under Contract DE-AC05-76RL01830.Peer ReviewedPostprint (author's final draft

    Operating System Support for Redundant Multithreading

    Get PDF
    Failing hardware is a fact and trends in microprocessor design indicate that the fraction of hardware suffering from permanent and transient faults will continue to increase in future chip generations. Researchers proposed various solutions to this issue with different downsides: Specialized hardware components make hardware more expensive in production and consume additional energy at runtime. Fault-tolerant algorithms and libraries enforce specific programming models on the developer. Compiler-based fault tolerance requires the source code for all applications to be available for recompilation. In this thesis I present ASTEROID, an operating system architecture that integrates applications with different reliability needs. ASTEROID is built on top of the L4/Fiasco.OC microkernel and extends the system with Romain, an operating system service that transparently replicates user applications. Romain supports single- and multi-threaded applications without requiring access to the application's source code. Romain replicates applications and their resources completely and thereby does not rely on hardware extensions, such as ECC-protected memory. In my thesis I describe how to efficiently implement replication as a form of redundant multithreading in software. I develop mechanisms to manage replica resources and to make multi-threaded programs behave deterministically for replication. I furthermore present an approach to handle applications that use shared-memory channels with other programs. My evaluation shows that Romain provides 100% error detection and more than 99.6% error correction for single-bit flips in memory and general-purpose registers. At the same time, Romain's execution time overhead is below 14% for single-threaded applications running in triple-modular redundant mode. The last part of my thesis acknowledges that software-implemented fault tolerance methods often rely on the correct functioning of a certain set of hardware and software components, the Reliable Computing Base (RCB). I introduce the concept of the RCB and discuss what constitutes the RCB of the ASTEROID system and other fault tolerance mechanisms. Thereafter I show three case studies that evaluate approaches to protecting RCB components and thereby aim to achieve a software stack that is fully protected against hardware errors

    Exploiting non-constant safe memory in resilient algorithms and data structures

    Get PDF
    We extend the Faulty RAM model by Finocchi and Italiano (2008) by adding a safe memory of arbitrary size SS, and we then derive tradeoffs between the performance of resilient algorithmic techniques and the size of the safe memory. Let δ\delta and α\alpha denote, respectively, the maximum amount of faults which can happen during the execution of an algorithm and the actual number of occurred faults, with αδ\alpha \leq \delta. We propose a resilient algorithm for sorting nn entries which requires O(nlogn+α(δ/S+logS))O\left(n\log n+\alpha (\delta/S + \log S)\right) time and uses Θ(S)\Theta(S) safe memory words. Our algorithm outperforms previous resilient sorting algorithms which do not exploit the available safe memory and require O(nlogn+αδ)O\left(n\log n+ \alpha\delta\right) time. Finally, we exploit our sorting algorithm for deriving a resilient priority queue. Our implementation uses Θ(S)\Theta(S) safe memory words and Θ(n)\Theta(n) faulty memory words for storing nn keys, and requires O(logn+δ/S)O\left(\log n + \delta/S\right) amortized time for each insert and deletemin operation. Our resilient priority queue improves the O(logn+δ)O\left(\log n + \delta\right) amortized time required by the state of the art.Comment: To appear in Theoretical Computer Science, 201
    corecore