3,259 research outputs found

    Havens: Explicit Reliable Memory Regions for HPC Applications

    Full text link
    Supporting error resilience in future exascale-class supercomputing systems is a critical challenge. Due to transistor scaling trends and increasing memory density, scientific simulations are expected to experience more interruptions caused by transient errors in the system memory. Existing hardware-based detection and recovery techniques will be inadequate to manage the presence of high memory fault rates. In this paper we propose a partial memory protection scheme based on region-based memory management. We define the concept of regions called havens that provide fault protection for program objects. We provide reliability for the regions through a software-based parity protection mechanism. Our approach enables critical program objects to be placed in these havens. The fault coverage provided by our approach is application agnostic, unlike algorithm-based fault tolerance techniques.Comment: 2016 IEEE High Performance Extreme Computing Conference (HPEC '16), September 2016, Waltham, MA, US

    Selective SWIFT-R. A Flexible Software-Based Technique for Soft Error Mitigation in Low-Cost Embedded Systems

    Get PDF
    Commercial off-the-shelf microprocessors are the core of low-cost embedded systems due to their programmability and cost-effectiveness. Recent advances in electronic technologies have allowed remarkable improvements in their performance. However, they have also made microprocessors more susceptible to transient faults induced by radiation. These non-destructive events (soft errors), may cause a microprocessor to produce a wrong computation result or lose control of a system with catastrophic consequences. Therefore, soft error mitigation has become a compulsory requirement for an increasing number of applications, which operate from the space to the ground level. In this context, this paper uses the concept of selective hardening, which is aimed to design reduced-overhead and flexible mitigation techniques. Following this concept, a novel flexible version of the software-based fault recovery technique known as SWIFT-R is proposed. Our approach makes possible to select different registers subsets from the microprocessor register file to be protected on software. Thus, design space is enriched with a wide spectrum of new partially protected versions, which offer more flexibility to designers. This permits to find the best trade-offs between performance, code size, and fault coverage. Three case studies have been developed to show the applicability and flexibility of the proposal.This work was funded by the Ministry of Science and Innovation in Spain with the project ‘RENASER+: Integral Analysis of Digital Circuits and Systems for Aerospace Applications’ (TEC2010-22095-C03-01)

    Dependable Computing on Inexact Hardware through Anomaly Detection.

    Full text link
    Reliability of transistors is on the decline as transistors continue to shrink in size. Aggressive voltage scaling is making the problem even worse. Scaled-down transistors are more susceptible to transient faults as well as permanent in-field hardware failures. In order to continue to reap the benefits of technology scaling, it has become imperative to tackle the challenges risen due to the decreasing reliability of devices for the mainstream commodity market. Along with the worsening reliability, achieving energy efficiency and performance improvement by scaling is increasingly providing diminishing marginal returns. More than any other time in history, the semiconductor industry faces the crossroad of unreliability and the need to improve energy efficiency. These challenges of technology scaling can be tackled by categorizing the target applications in the following two categories: traditional applications that have relatively strict correctness requirement on outputs and emerging class of soft applications, from various domains such as multimedia, machine learning, and computer vision, that are inherently inaccuracy tolerant to a certain degree. Traditional applications can be protected against hardware failures by low-cost detection and protection methods while soft applications can trade off quality of outputs to achieve better performance or energy efficiency. For traditional applications, I propose an efficient, software-only application analysis and transformation solution to detect data and control flow transient faults. The intelligence of the data flow solution lies in the use of dynamic application information such as control flow, memory and value profiling. The control flow protection technique achieves its efficiency by simplifying signature calculations in each basic block and by performing checking at a coarse-grain level. For soft applications, I develop a quality control technique. The quality control technique employs continuous, light-weight checkers to ensure that the approximation is controlled and application output is acceptable. Overall, I show that the use of low-cost checkers to produce dependable results on commodity systems---constructed from inexact hardware components---is efficient and practical.PhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/113341/1/dskhudia_1.pd

    Reliable Software for Unreliable Hardware - A Cross-Layer Approach

    Get PDF
    A novel cross-layer reliability analysis, modeling, and optimization approach is proposed in this thesis that leverages multiple layers in the system design abstraction (i.e. hardware, compiler, system software, and application program) to exploit the available reliability enhancing potential at each system layer and to exchange this information across multiple system layers

    Application-Based Analysis of Register File Criticality for Reliability Assessment in Embedded Microprocessors

    Get PDF
    There is an increasing concern to reduce the cost and overheads during the development of reliable systems. Selective protection of most critical parts of the systems represents a viable solution to obtain a high level of reliability at a fraction of the cost. In particular to design a selective fault mitigation strategy for processor-based systems, it is mandatory to identify and prioritize the most vulnerable registers in the register file as best candidates to be protected (hardened). This paper presents an application-based metric to estimate the criticality of each register from the microprocessor register file in microprocessor-based systems. The proposed metric relies on the combination of three different criteria based on common features of executed applications. The applicability and accuracy of our proposal have been evaluated in a set of applications running in different microprocessors. Results show a significant improvement in accuracy compared to previous approaches and regardless of the underlying architecture.This work was funded in part by the Spanish Ministry of Education, Culture and Sports with the project “Developing hybrid fault tolerance techniques for embedded microprocessors” (PHB2012-0158-PC)

    Soft-error resilient on-chip memory structures

    Get PDF
    Soft errors induced by energetic particle strikes in on-chip memory structures, such as L1 data/instruction caches and register files, have become an increasing challenge in designing new generation reliable microprocessors. Due to their transient/random nature, soft errors cannot be captured by traditional verification and testing process due to the irrelevancy to the correctness of the logic. This dissertation is thus focusing on the reliability characterization and cost-effective reliable design of on-chip memories against soft errors. Due to various performance, area/size, and energy constraints in various target systems, many existing unoptimized protection schemes on cache memories may eventually prove significantly inadequate and ineffective. This work develops new lifetime models for data and tag arrays residing in both the data and instruction caches. These models facilitate the characterization of cache vulnerability of the stored items at various lifetime phases. The design methodology is further exemplified by the proposed reliability schemes targeting at specific vulnerable phases. Benchmarking is carried out to showcase the effectiveness of these approaches. The tag array demands high reliability against soft errors while the data array is fully protected in on-chip caches, because of its crucial importance to the correctness of cache accesses. Exploiting the address locality of memory accesses, this work proposes a Tag Replication Buffer (TRB) to protect information integrity of the tag array in the data cache with low performance, energy and area overheads. To provide a comprehensive evaluation of the tag array reliability, this work also proposes a refined evaluation metric, detected-without-replica-TVF (DOR-TVF), which combines the TVF and access-with-replica (AWR) analysis. Based on the DOR-TVF analysis, a TRB scheme with early write-back (TRB-EWB) is proposed, which achieves a zero DOR-TVF at a negligible performance overhead. Recent research, as well as the proposed optimization schemes in this cache vulnerability study, have focused on the design of cost-effective reliable data caches in terms of performance, energy, and area overheads based on the assumption of fixed error rates. However, for systems in operating environments that vary with time or location, those schemes will be either insufficient or over-designed for the changing error rates. This work explores the design of a self-adaptive reliable data cache that dynamically adapts its employed reliability schemes to the changing operating environments in order to maintain a target reliability. The experimental evaluation shows that the self-adaptive data cache achieves similar reliability to a cache protected by the most reliable scheme, while simultaneously minimizing the performance and power overheads. Besides the data/instruction caches, protecting the register file and its data buses is crucial to reliable computing in high-performance microprocessors. Since the register file is in the critical path of the processor pipeline, any reliable design that increases either the pressure on the register file or the register file access latency is not desirable. This work proposes to exploit narrow-width register values, which represent the majority of generated values, for making the duplicates within the same register data item. A detailed architectural vulnerability factor (AVF) analysis shows that this in-register duplication (IRD) scheme significantly reduces the AVF in the register file compared to the conventional design. The experimental evaluation also shows that IRD provides superior read-with-duplicate (RWD) and error detection/recovery rates under heavy error injection as compared to previous reliability schemes, while only incurring a small power overhead. By integrating the proposed reliable designs in data/instruction caches and register files, the vulnerability of the entire microprocessor is dramatically reduced. The new lifetime model, the self-adaptive design and the narrow-width value duplication scheme proposed in this work can also provide guidance to architects toward highly efficient reliable system design
    corecore