2 research outputs found

    GPGPU Reliability Analysis: From Applications to Large Scale Systems

    Get PDF
    Over the past decade, GPUs have become an integral part of mainstream high-performance computing (HPC) facilities. Since applications running on HPC systems are usually long-running, any error or failure could result in significant loss in scientific productivity and system resources. Even worse, since HPC systems face severe resilience challenges as progressing towards exascale computing, it is imperative to develop a better understanding of the reliability of GPUs. This dissertation fills this gap by providing an understanding of the effects of soft errors on the entire system and on specific applications. To understand system-level reliability, a large-scale study on GPU soft errors in the field is conducted. The occurrences of GPU soft errors are linked to several temporal and spatial features, such as specific workloads, node location, temperature, and power consumption. Further, machine learning models are proposed to predict error occurrences on GPU nodes so as to proactively and dynamically turning on/off the costly error protection mechanisms based on prediction results. To understand the effects of soft errors at the application level, an effective fault-injection framework is designed aiming to understand the reliability and resilience characteristics of GPGPU applications. This framework is effective in terms of reducing the tremendous number of fault injection locations to a manageable size while still preserving remarkable accuracy. This framework is validated with both single-bit and multi-bit fault models for various GPGPU benchmarks. Lastly, taking advantage of the proposed fault-injection framework, this dissertation develops a hierarchical approach to understanding the error resilience characteristics of GPGPU applications at kernel, CTA, and warp levels. In addition, given that some corrupted application outputs due to soft errors may be acceptable, we present a use case to show how to enable low-overhead yet reliable GPU computing for GPGPU applications

    Scalable coordination of distributed in-memory transactions

    Get PDF
    Phd ThesisCoordinating transactions involves ensuring serializability in the presence of concurrent data accesses. Accomplishing it in an scalable manner for distributed in-memory transactions is the aim of this thesis work. To this end, the work makes three contributions. It first experimentally demonstrates that transaction latency and throughput scale considerably well when an atomic multicast service is offered to transaction nodes by a crash-tolerant ensemble of dedicated nodes and that using such a service is the most scalable approach compared to practices advocated in the literature. Secondly, we design, implement and evaluate a crash-tolerant and non-blocking atomic broadcast protocol, called ABcast, which is then used as the foundation for building the aforementioned multicast service. ABcast is a hybrid protocol, which consists of a pair of primary and backup protocols executing in parallel. The primary protocol is a deterministic atomic broadcast protocol that provides high performance when node crashes are absent, but blocks in their presence until a group membership service detects such failures. The backup protocol, Aramis, is a probabilistic protocol that does not block in the event of node crashes and allows message delivery to continue post-crash until the primary protocol is able to resume. Aramis design avoids blocking by assuming that message delays remain within a known bound with a high probability that can be estimated in advance, provided that recent delay estimates are used to (i) continually adjust that bound and (ii) regulate flow control. Aramis delivery of broadcasts preserve total order with a probability that can be tuned to be close to 1. Comprehensive evaluations show that this probability can be 99.99% or more. Finally, we assess the effect of low-probability order violations on implementing various isolation levels commonly considered in transaction systems. These three contributions together advance the state-of-art in two major ways: (i) identifying a service based approach to transactional scalability and (ii) establishing a practical alternative to the complex PAXOSiii style approach to building such a service, by using novel but simple protocols and open-source software frameworks.Red Ha
    corecore