Search CORE

5,214 research outputs found

Exploiting Inherent Program Redundancy for Fault Tolerance

Author: Li Xuanhua
Publication venue
Publication date: 01/01/2009
Field of study

Technology scaling has led to growing concerns about reliability in microprocessors. Currently, fault tolerance studies rely on creating explicitly redundant execution for fault detection or recovery, which usually involves expensive cost on performance, power, or hardware, etc. In our study, we find exploiting program's inherent redundancy can better trade off between reliability, performance, and hardware cost. This work proposes two approaches to enhance program reliability. The first approach investigates the additional fault resilience at the application level. We explore program correctness definition that views correctness from the application's standpoint rather than the architecture's standpoint. Under application-level correctness, multiple numerical outputs can be deemed as correct as long as they are acceptable to users. Thus faults that cause program to produce such outputs can also be tolerated. We find programs which produce inexact and/or approximate outputs can be very resilient at the application level. We call such programs soft computations, and find that they are common in multimedia workloads, as well as artificial intelligence (AI) workloads. Programs that only compute exact numerical outputs offer less error resilience at the application level. However, all programs that we have studied exhibit some enhanced fault resilience at the application level, including those that are traditionally considered as exact computations-e.g., SPECInt CPU2000. We conduct fault injection experiments and evaluate the additional fault tolerance at the application level compared to the traditional architectural level. We also exploit the relaxed requirements for numerical integrity of application-level correctness to reduce checkpoint cost: our lightweight recovery mechanism checkpoints a minimal set of program state including program counter, architectural register file, and stack; our soft-checkpointing technique identifies computations that are resilient to errors and excludes their output state from checkpoint. Both techniques incur much smaller runtime overhead than traditional checkpointing, but can successfully recover either all or a major part of program crashes in soft computations. The second approach we take studies value predictability for reducing fault rate. Value prediction is considered as additional execution, and its results are compared with corresponding computational outputs. Any mismatch between them is accounted as symptom of potential faults and incurs restoration process. To reduce misprediction rate caused by limitations of predictor itself, we characterize fault vulnerability at the instruction level and only apply value prediction to instructions that are highly susceptible to faults. We also vary threshold of confidence estimation according to instruction's vulnerability-instructions with high vulnerability are assigned with low confidence threshold, while instructions with low vulnerability are assigned with high confidence threshold. Our experimental results show benefit from such selective prediction and adaptive confidence threshold on balance between reliability and performance

CiteSeerX

Digital Repository at the University of Maryland

Quantifying fault recovery in multiprocessor systems

Author: Harary Frank
Malek Miroslaw
Publication venue
Publication date
Field of study

Various aspects of reliable computing are formalized and quantified with emphasis on efficient fault recovery. The mathematical model which proves to be most appropriate is provided by the theory of graphs. New measures for fault recovery are developed and the value of elements of the fault recovery vector are observed to depend not only on the computation graph H and the architecture graph G, but also on the specific location of a fault. In the examples, a hypercube is chosen as a representative of parallel computer architecture, and a pipeline as a typical configuration for program execution. Dependability qualities of such a system is defined with or without a fault. These qualities are determined by the resiliency triple defined by three parameters: multiplicity, robustness, and configurability. Parameters for measuring the recovery effectiveness are also introduced in terms of distance, time, and the number of new, used, and moved nodes and edges

NASA Technical Reports Server

Survivable algorithms and redundancy management in NASA's distributed computing systems

Author: Malek Miroslaw
Publication venue
Publication date
Field of study

The design of survivable algorithms requires a solid foundation for executing them. While hardware techniques for fault-tolerant computing are relatively well understood, fault-tolerant operating systems, as well as fault-tolerant applications (survivable algorithms), are, by contrast, little understood, and much more work in this field is required. We outline some of our work that contributes to the foundation of ultrareliable operating systems and fault-tolerant algorithm design. We introduce our consensus-based framework for fault-tolerant system design. This is followed by a description of a hierarchical partitioning method for efficient consensus. A scheduler for redundancy management is introduced, and application-specific fault tolerance is described. We give an overview of our hybrid algorithm technique, which is an alternative to the formal approach given

NASA Technical Reports Server

Avionics systems integration technology

Author: Stech George
Williams James R.
Publication venue
Publication date
Field of study

A very dramatic and continuing explosion in digital electronics technology has been taking place in the last decade. The prudent and timely application of this technology will provide Army aviation the capability to prevail against a numerically superior enemy threat. The Army and NASA have exploited this technology explosion in the development and application of avionics systems integration technology for new and future aviation systems. A few selected Army avionics integration technology base efforts are discussed. Also discussed is the Avionics Integration Research Laboratory (AIRLAB) that NASA has established at Langley for research into the integration and validation of avionics systems, and evaluation of advanced technology in a total systems context

NASA Technical Reports Server

An approach to rollback recovery of collaborating mobile agents

Author: Bargiela A
Osman T
Wagealla W
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 24/02/2004
Field of study

Fault-tolerance is one of the main problems that must be resolved to improve the adoption of the agents' computing paradigm. In this paper, we analyse the execution model of agent platforms and the significance of the faults affecting their constituent components on the reliable execution of agent-based applications, in order to develop a pragmatic framework for agent systems fault-tolerance. The developed framework deploys a communication-pairs independent check pointing strategy to offer a low-cost, application-transparent model for reliable agent- based computing that covers all possible faults that might invalidate reliable agent execution, migration and communication and maintains the exactly-one execution property

Crossref

Nottingham Trent Institutional Repository (IRep)

Software reliability and dependability: a roadmap

Author: Littlewood B.
Strigini L.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2000
Field of study

Shifting the focus from software reliability to user-centred measures of dependability in complete software-based systems. Influencing design practice to facilitate dependability assessment. Propagating awareness of dependability issues and the use of existing, useful methods. Injecting some rigour in the use of process-related evidence for dependability assessment. Better understanding issues of diversity and variation as drivers of dependability. Bev Littlewood is founder-Director of the Centre for Software Reliability, and Professor of Software Engineering at City University, London. Prof Littlewood has worked for many years on problems associated with the modelling and evaluation of the dependability of software-based systems; he has published many papers in international journals and conference proceedings and has edited several books. Much of this work has been carried out in collaborative projects, including the successful EC-funded projects SHIP, PDCS, PDCS2, DeVa. He has been employed as a consultant t

CiteSeerX

City Research Online

Crossref