2 research outputs found

    Doctor of Philosophy

    Get PDF
    dissertationCurrent scaling trends in transistor technology, in pursuit of larger component counts and improving power efficiency, are making the hardware increasingly less reliable. Due to extreme transistor miniaturization, it is becoming easier to flip a bit stored in memory elements built using these transistors. Given that soft errors can cause transient bit-flips in memory elements, caused due to alpha particles and cosmic rays striking those elements, soft errors have become one of the major impediments in system resilience as we move towards exascale computing. Soft errors escaping the hardware-layer may silently corrupt the runtime application data of a program, causing silent data corruption in the output. Also, given that soft errors are transient in nature, it is notoriously hard to trace back their origins. Therefore, techniques to enhance system resilience hinge on the availability of efficient error detectors that have high detection rates, low false positive rates, and lower computational overhead. It is equally important to have a flexible infrastructure capable of simulating realistic soft error models to promote an effective evaluation of newly developed error detectors. In this work, we present a set of techniques for efficiently detecting soft errors affecting control-flow, data, and structured address computations in an application. We evaluate the efficacy of the proposed techniques by evaluating them on a collection of benchmarks through fault-injection driven studies. As an important requirement, we also introduce two new LLVM-based fault injectors, KULFI and VULFI, which are geared towards scalar and vector architectures, respectively. Through this work, we aim to make contributions to the system resilience community by making our research tools (in the form of error detectors and fault injectors) publicly available
    corecore