786 research outputs found

    Isomorphism between Linear Codes and Arithmetic Codes for Safe Data Processing in Embedded Software Systems

    Get PDF
    We present a transformation rule to convert linear codes into arithmetic codes. Linear codes are usually used for error detection and correction in broadcast and storage systems. In contrast, arithmetic codes are very suitable for protection of software processing in computer systems. This paper shows how to transform linear codes protecting the data stored in a computer system into arithmetic codes safeguarding the operations built on this data. Combination of the advantages of both coding mechanisms will increase the error detection capability in safety critical applications for embedded systems by detection and correction of arbitrary hardware faults

    A Pattern Language for High-Performance Computing Resilience

    Full text link
    High-performance computing systems (HPC) provide powerful capabilities for modeling, simulation, and data analytics for a broad class of computational problems. They enable extreme performance of the order of quadrillion floating-point arithmetic calculations per second by aggregating the power of millions of compute, memory, networking and storage components. With the rapidly growing scale and complexity of HPC systems for achieving even greater performance, ensuring their reliable operation in the face of system degradations and failures is a critical challenge. System fault events often lead the scientific applications to produce incorrect results, or may even cause their untimely termination. The sheer number of components in modern extreme-scale HPC systems and the complex interactions and dependencies among the hardware and software components, the applications, and the physical environment makes the design of practical solutions that support fault resilience a complex undertaking. To manage this complexity, we developed a methodology for designing HPC resilience solutions using design patterns. We codified the well-known techniques for handling faults, errors and failures that have been devised, applied and improved upon over the past three decades in the form of design patterns. In this paper, we present a pattern language to enable a structured approach to the development of HPC resilience solutions. The pattern language reveals the relations among the resilience patterns and provides the means to explore alternative techniques for handling a specific fault model that may have different efficiency and complexity characteristics. Using the pattern language enables the design and implementation of comprehensive resilience solutions as a set of interconnected resilience patterns that can be instantiated across layers of the system stack.Comment: Proceedings of the 22nd European Conference on Pattern Languages of Program

    Approximate Computing Survey, Part II: Application-Specific & Architectural Approximation Techniques and Applications

    Full text link
    The challenging deployment of compute-intensive applications from domains such Artificial Intelligence (AI) and Digital Signal Processing (DSP), forces the community of computing systems to explore new design approaches. Approximate Computing appears as an emerging solution, allowing to tune the quality of results in the design of a system in order to improve the energy efficiency and/or performance. This radical paradigm shift has attracted interest from both academia and industry, resulting in significant research on approximation techniques and methodologies at different design layers (from system down to integrated circuits). Motivated by the wide appeal of Approximate Computing over the last 10 years, we conduct a two-part survey to cover key aspects (e.g., terminology and applications) and review the state-of-the art approximation techniques from all layers of the traditional computing stack. In Part II of our survey, we classify and present the technical details of application-specific and architectural approximation techniques, which both target the design of resource-efficient processors/accelerators & systems. Moreover, we present a detailed analysis of the application spectrum of Approximate Computing and discuss open challenges and future directions.Comment: Under Review at ACM Computing Survey

    Low-cost error detection through high-level synthesis

    Get PDF
    System-on-chip design is becoming increasingly complex as technology scaling enables more and more functionality on a chip. This scaling and complexity has resulted in a variety of reliability and validation challenges including logic bugs, hot spots, wear-out, and soft errors. To make matters worse, as we reach the limits of Dennard scaling, efforts to improve system performance and energy efficiency have resulted in the integration of a wide variety of complex hardware accelerators in SoCs. Thus the challenge is to design complex, custom hardware that is efficient, but also correct and reliable. High-level synthesis shows promise to address the problem of complex hardware design by providing a bridge from the high-productivity software domain to the hardware design process. Much research has been done on high-level synthesis efficiency optimizations. This thesis shows that high-level synthesis also has the power to address validation and reliability challenges through two solutions. One solution for circuit reliability is modulo-3 shadow datapaths: performing lightweight shadow computations in modulo-3 space for each main computation. We leverage the binding and scheduling flexibility of high-level synthesis to detect control errors through diverse binding and minimize area cost through intelligent checkpoint scheduling and modulo-3 reducer sharing. We introduce logic and dataflow optimizations to further reduce cost. We evaluated our technique with 12 high-level synthesis benchmarks from the arithmetic-oriented PolyBench benchmark suite using FPGA emulated netlist-level error injection. We observe coverages of 99.1% for stuck-at faults, 99.5% for soft errors, and 99.6% for timing errors with a 25.7% area cost and negligible performance impact. Leveraging a mean error detection latency of 12.75 cycles (4150x faster than end result check) for soft errors, we also explore a rollback recovery method with an additional area cost of 28.0%, observing a 175x increase in reliability against soft errors. Another solution for rapid post-silicon validation of accelerator designs is Hybrid Quick Error Detection (H-QED): inserting signature generation logic in a hardware design to create a heavily compressed signature stream that captures the internal behavior of the design at a fine temporal and spatial granularity for comparison with a reference set of signatures generated by high-level simulation to detect bugs. Using H-QED, we demonstrate an improvement in error detection latency (time elapsed from when a bug is activated to when it manifests as an observable failure) of two orders of magnitude and a threefold improvement in bug coverage compared to traditional post-silicon validation techniques. H-QED also uncovered previously unknown bugs in the CHStone benchmark suite, which is widely used by the HLS community. H-QED incurs less than 10% area overhead for the accelerator it validates with negligible performance impact, and we also introduce techniques to minimize any possible intrusiveness introduced by H-QED

    Dependable Embedded Systems

    Get PDF
    This Open Access book introduces readers to many new techniques for enhancing and optimizing reliability in embedded systems, which have emerged particularly within the last five years. This book introduces the most prominent reliability concerns from today’s points of view and roughly recapitulates the progress in the community so far. Unlike other books that focus on a single abstraction level such circuit level or system level alone, the focus of this book is to deal with the different reliability challenges across different levels starting from the physical level all the way to the system level (cross-layer approaches). The book aims at demonstrating how new hardware/software co-design solution can be proposed to ef-fectively mitigate reliability degradation such as transistor aging, processor variation, temperature effects, soft errors, etc. Provides readers with latest insights into novel, cross-layer methods and models with respect to dependability of embedded systems; Describes cross-layer approaches that can leverage reliability through techniques that are pro-actively designed with respect to techniques at other layers; Explains run-time adaptation and concepts/means of self-organization, in order to achieve error resiliency in complex, future many core systems

    Sources of Variations in Error Sensitivity of Computer Systems

    Get PDF
    Technology scaling is reducing the reliability of integrated circuits. This makes it important to provide computers with mechanisms that can detect and correct hardware errors. This thesis deals with the problem of assessing the hardware error sensitivity of computer systems. Error sensitivity, which is the likelihood that a hardware error will escape detection and produce an erroneous output, measures a system’s inability to detect hardware errors. This thesis present the results of a series of fault injection experiments that investigated how er- ror sensitivity varies for different system characteristics, including (i) the inputs processed by a program, (ii) a program’s source code implementation, and (iii) the use of compiler optimizations. The study focused on the impact of tran- sient hardware faults that result in bit errors in CPU registers and main memory locations. We investigated how the error sensitivity varies for single-bit errors vs. double-bit errors, and how error sensitivity varies with respect to machine instructions that were targeted for fault injection. The results show that the in- put profile and source code implementation of the investigated programs had a major impact on error sensitivity, while using different compiler optimizations caused only minor variations. There was no significant difference in error sen- sitivity between single-bit and double-bit errors. Finally, the error sensitivity seems to depend more on the type of data processed by an instruction than on the instruction type

    Calcul sur architecture non fiable

    Get PDF
    Although materials could be fabricated as error-free theoretically with a huge cost for worst-case design methodologies, the circuit is still susceptible to transient faults by the effects of radiation, temperature sensitivity, and etc. On the contrary, an error-resilient design enables the manufacturing process to be relieved from the variability issue so as to save material cost. Since variability and transient upsets are worsening as emerging fabrication process and size shrink are tending intense, the requirement of robust design is imminent. This thesis addresses the issue of designing on unreliable circuit. The main contributions are fourfold. Firstly a fast error-correction and low cost redundancy fault-tolerant method is presented. Moreover, we introduce judicious two-dimensional criteria to estimate the reliability and the hardware efïŹciency of a circuit. A general-purpose model offers low-redundancy error-resilience for contemporary logic systems as well as future nanoeletronic architectures. At last, a decoder against internal transient faults is designed in this work.En thĂ©orie, les circuits Ă©lectroniques conçus selon la mĂ©thode du pire-cas sont supposĂ©s garantir un fonctionnement sans erreur pourun coĂ»t d’implĂ©mentation Ă©levĂ©. Dans la pratique les circuits restent sujets aux erreurs transitoires du fait de leur sensibilitĂ© aux alĂ©astels que la radiation et la tempĂ©rature. En revanche, une conception prenant en compte la tolĂ©rance aux fautes permet de faire face Ă  detels alĂ©as comme la variabilitĂ© du processus de fabrication. De plus, les erreurs transitoires et la variabilitĂ© de fabrication s’intensiïŹentavec l’émergence de nouveaux processus de fabrication et des circuits de dimension de plus en plus rĂ©duite. La demande d’une conceptionintĂ©grant la tolĂ©rance aux fautes devient dĂ©sormais primordiale. La prĂ©sente thĂšse a pour objectif de cerner la problĂ©matique de laconception de circuits sur des puces peu ïŹables et apporte des contributions suivant quatre aspects. Dans un premier temps, nous proposonsune mĂ©thode de tolĂ©rance aux fautes, basĂ©e sur la correction d’erreurs et la redondance Ă  faible coĂ»t. Puis, nous prĂ©sentonsun critĂšre bidimensionnel judicieux permettant d’évaluer la ïŹabilitĂ© et l’efïŹcacitĂ© matĂ©rielle de circuits. Nous proposons ensuite un modĂšleuniversel qui apporte une tolĂ©rance avec fautes Ă  redondance faible pour les systĂšmes logiques d’aujourd’hui et les architecturesnanoĂ©lectroniques de demain. EnïŹn, nous dĂ©couvrons un dĂ©codeur tolĂ©rant aux fautes transitoires internes
