3 research outputs found

    Trace Sanitizer:Eliminating the Effects of Non-Determinism of Error Propagation Analysis

    Get PDF
    Modern computing systems typically relax execution determinism, for instance by allowing the CPU scheduler to inter- leave the execution of several threads. While beneficial for performance, execution non-determinism affects programs' execution traces and hampers the comparability of repeated executions. We present TraceSanitizer, a novel approach for execution trace comparison in Error Propagation Analyses (EPA) of multi-threaded programs. TraceSanitizer can identify and compensate for non- determinisms caused either by dynamic memory allocation or by non-deterministic scheduling. We formulate a condition under which TraceSanitizer is guaranteed to achieve a 0% false positive rate, and automate its verification using Satisfiability Modulo Theory (SMT) solving techniques. TraceSanitizer is comprehensively evaluated using execution traces from the PARSEC and Phoenix benchmarks. In contrast with other approaches, Trace- Sanitizer eliminates false positives without increasing the false negative rate (for a specific class of programs), with reasonable performance overheads

    Practical Gpgpu Application Resilience Estimation And Fortification

    Get PDF
    Graphics Processing Units (GPUs) are becoming a de facto solution for accelerating a wide range of applications but remain susceptible to transient hardware faults (soft errors) that can easily compromise application output. One of the major challenges in the domain of GPU reliability is to accurately measure general purpose GPU (GPGPU) application resilience to transient faults. This challenge stems from the fact that a typical GPGPU application spawns a huge number of threads and then utilizes a large amount of potentially unreliable compute and memory resources available on the GPUs. As the number of possible fault locations can be in the billions, evaluating every fault and examining its effect on the application error resilience is impractical. Alternatively, fault site selection techniques have been proposed to approach high accuracy with less fault injection experiments. However, most of the existing methods in the literature only focus on the single-bit fault model and only one input. In this dissertation, we offer solutions to the two problems above. We extend a progressive fault site pruning technique for two multi-bit fault models: (a) multi-bit faults in the same word; (b) multiple single-bit faults in different words accessed by the same thread. We devise a methodology, SUGAR (Speeding Up GPGPU Application Resilience Estimation with input sizing), that dramatically speeds up the evaluation of application error resilience. Key of the SUGAR estimation methodology is the identification of repeating thread patterns that develop as a function of the size of the input. These patterns allow for accurate prediction of application error resilience for arbitrarily large inputs. With the presence of input-aware estimation strategies, we are able to pinpoint the vulnerabilities in a GPGPU application and propose low overhead protection techniques accordingly. Based on the variety of thread resilience in GPGPU applications, we propose a methodology that identifies the resilience of threads and aims to map threads with the same resilience characteristics to the same warp. Our technique allows engaging partial protection mechanisms at the warp level. We illustrate that threads can be remapped into reliable or unreliable warps with only minimal introduced overhead, and then selective protection via replication is applied in unreliable warps. We show how this remapping facilitates warp replication for error detection and correction and achieves a significant reduction of execution cycles, comparing to standard techniques. In addition to input-aware estimation and fortification, we present a detailed characterization comparing microarchitecture-level and software-level fault injection and show the gap of resilience estimation introduced by injecting faults into different layers in the system execution stack. We also implement a software-level redundancy protection mechanism and measure its effectiveness using microarchitecture-level and software-level fault injection

    Logs and Models in Engineering Complex Embedded Production Software Systems

    Get PDF
    corecore