Search CORE

5,838 research outputs found

Recommended from our members

High performance fault tolerance through predictive instruction re-execution

Author: Jones TM
Soman J
Publication venue: 2017 IEEE Int. Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems, DFT 2017
Publication date: 01/01/2017
Field of study

Processor designers face the challenge of defect formation, leading to permanent faults, during fabrication and operation. Permanent or hard fault tolerance is an important problem in computing systems, solutions to which can help improve yield during fabrication and reduce the cost of transistor mortality during the service life of the processor. This paper presents PreFix, a method to handle hard errors to keep a faulty core running and correctly executing instructions. Instead of turning off faulty structures, PreFix predicts early on whether an instruction is likely to use faulty components, then refines this prediction later in the pipeline to actually detect when an error has occurred. Instructions marked as possibly- faulty in the front-end are queued for duplicate execution on a separate core. At commit, results from the original and duplicate instructions are compared. Upon a mismatch, the original instruction is patched up, the pipeline flushed and execution continues. Using PreFix, faulty components can continue performing useful work when their errors do not manifest in architecturally visible state changes. This enhances processor lifetime with minimal performance overhead

Apollo (Cambridge)

Cross-layer system reliability assessment framework for hardware faults

Author: Bosio Alberto
Canal Corretger Ramon
Chatzidimitriou Athanansios
Di Carlo Stefano
Di Natale Giorgio
Gizipoulos Dimitris
González Colás Antonio María
Kaliorakis Manolis
Kooli Maha
Politano Gianfranco
Riera Villanueva Marc
Savino Alessandro
Tselonis Sotiris
Vallero Alessandro
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2016
Field of study

System reliability estimation during early design phases facilitates informed decisions for the integration of effective protection mechanisms against different classes of hardware faults. When not all system abstraction layers (technology, circuit, microarchitecture, software) are factored in such an estimation model, the delivered reliability reports must be excessively pessimistic and thus lead to unacceptably expensive, over-designed systems. We propose a scalable, cross-layer methodology and supporting suite of tools for accurate but fast estimations of computing systems reliability. The backbone of the methodology is a component-based Bayesian model, which effectively calculates system reliability based on the masking probabilities of individual hardware and software components considering their complex interactions. Our detailed experimental evaluation for different technologies, microarchitectures, and benchmarks demonstrates that the proposed model delivers very accurate reliability estimations (FIT rates) compared to statistically significant but slow fault injection campaigns at the microarchitecture level.Peer ReviewedPostprint (author's final draft

Crossref

UPCommons. Portal del coneixement obert de la UPC

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

PORTO Publications Open Repository TOrino

Cross-layer reliability evaluation, moving from the hardware architecture to the system level: A CLERECO EU project overview

Author: Bosio A.
Di Carlo S.
Di Natale G.
Foutris N.
Gizopoulos D.
Kaliorakis M.
Kooli M.
Politano G.
Savino A.
Tselonis S.
Vallero A.
Publication venue: 'Elsevier BV'
Publication date: 01/06/2015
Field of study

Advanced computing systems realized in forthcoming technologies hold the promise of a significant increase of computational capabilities. However, the same path that is leading technologies toward these remarkable achievements is also making electronic devices increasingly unreliable. Developing new methods to evaluate the reliability of these systems in an early design stage has the potential to save costs, produce optimized designs and have a positive impact on the product time-to-market. CLERECO European FP7 research project addresses early reliability evaluation with a cross-layer approach across different computing disciplines, across computing system layers and across computing market segments. The fundamental objective of the project is to investigate in depth a methodology to assess system reliability early in the design cycle of the future systems of the emerging computing continuum. This paper presents a general overview of the CLERECO project focusing on the main tools and models that are being developed that could be of interest for the research community and engineering practice

Crossref

HAL Descartes

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

Hal-Diderot

PORTO Publications Open Repository TOrino

Advanced information processing system: The Army fault tolerant architecture conceptual study. Volume 1: Army fault tolerant architecture overview

Author: Alger L. S.
Babikyan C. A.
Butler B. P.
Friend S. A.
Ganska R. J.
Harper R. E.
Lala J. H.
Masotto T. K.
Meyer A. J.
Morton D. P.
Publication venue
Publication date
Field of study

Digital computing systems needed for Army programs such as the Computer-Aided Low Altitude Helicopter Flight Program and the Armored Systems Modernization (ASM) vehicles may be characterized by high computational throughput and input/output bandwidth, hard real-time response, high reliability and availability, and maintainability, testability, and producibility requirements. In addition, such a system should be affordable to produce, procure, maintain, and upgrade. To address these needs, the Army Fault Tolerant Architecture (AFTA) is being designed and constructed under a three-year program comprised of a conceptual study, detailed design and fabrication, and demonstration and validation phases. Described here are the results of the conceptual study phase of the AFTA development. Given here is an introduction to the AFTA program, its objectives, and key elements of its technical approach. A format is designed for representing mission requirements in a manner suitable for first order AFTA sizing and analysis, followed by a discussion of the current state of mission requirements acquisition for the targeted Army missions. An overview is given of AFTA's architectural theory of operation

NASA Technical Reports Server

Advanced information processing system: The Army fault tolerant architecture conceptual study. Volume 2: Army fault tolerant architecture design and analysis

Author: Alger L. S.
Babikyan C. A.
Butler B. P.
Friend S. A.
Ganska R. J.
Harper R. E.
Lala J. H.
Masotto T. K.
Meyer A. J.
Morton D. P.
Publication venue
Publication date
Field of study

Described here is the Army Fault Tolerant Architecture (AFTA) hardware architecture and components and the operating system. The architectural and operational theory of the AFTA Fault Tolerant Data Bus is discussed. The test and maintenance strategy developed for use in fielded AFTA installations is presented. An approach to be used in reducing the probability of AFTA failure due to common mode faults is described. Analytical models for AFTA performance, reliability, availability, life cycle cost, weight, power, and volume are developed. An approach is presented for using VHSIC Hardware Description Language (VHDL) to describe and design AFTA's developmental hardware. A plan is described for verifying and validating key AFTA concepts during the Dem/Val phase. Analytical models and partial mission requirements are used to generate AFTA configurations for the TF/TA/NOE and Ground Vehicle missions

NASA Technical Reports Server

Enhancing Program Soft Error Resilience through Algorithmic Approaches

Author: Chen Sui
Publication venue: LSU Digital Commons
Publication date: 01/01/2016
Field of study

The rising count and shrinking feature size of transistors within modern computers is making them increasingly vulnerable to various types of soft faults. This problem is especially acute in high-performance computing (HPC) systems used for scientific computing, because these systems include many thousands of compute cores and nodes, all of which may be utilized in a single large-scale run. The increasing vulnerability of HPC applications to errors induced by soft faults is motivating extensive work on techniques to make these applications more resilient to such faults, ranging from generic techniques such as replication or checkpoint/restart to algorithm-specific error detection and tolerance techniques. Effective use of such techniques requires a detailed understanding of how a given application is affected by soft faults to ensure that (i) efforts to improve application resilience are spent in the code regions most vulnerable to faults, (ii) the appropriate resilience techniques is applied to each code region, and (iii) the understanding be obtained in an efficient manner. This thesis presents two tools: FaultTelescope helps application developers view the routine and application vulnerability to soft errors while ErrorSight helps perform modular fault characteristics analysis for more complex applications. This thesis also illustrates how these tools can be used in the context of representative applications and kernels. In addition to providing actionable insights into application behavior, the tools automatically selects the number of fault injection experiments required to efficiently generation error profiles of an application, ensuring that the information is statistically well-grounded without performing unnecessary experiments

Louisiana State University