2 research outputs found
GPGPU Reliability Analysis: From Applications to Large Scale Systems
Over the past decade, GPUs have become an integral part of mainstream high-performance computing (HPC) facilities. Since applications running on HPC systems are usually long-running, any error or failure could result in significant loss in scientific productivity and system resources. Even worse, since HPC systems face severe resilience challenges as progressing towards exascale computing, it is imperative to develop a better understanding of the reliability of GPUs. This dissertation fills this gap by providing an understanding of the effects of soft errors on the entire system and on specific applications. To understand system-level reliability, a large-scale study on GPU soft errors in the field is conducted. The occurrences of GPU soft errors are linked to several temporal and spatial features, such as specific workloads, node location, temperature, and power consumption. Further, machine learning models are proposed to predict error occurrences on GPU nodes so as to proactively and dynamically turning on/off the costly error protection mechanisms based on prediction results. To understand the effects of soft errors at the application level, an effective fault-injection framework is designed aiming to understand the reliability and resilience characteristics of GPGPU applications. This framework is effective in terms of reducing the tremendous number of fault injection locations to a manageable size while still preserving remarkable accuracy. This framework is validated with both single-bit and multi-bit fault models for various GPGPU benchmarks. Lastly, taking advantage of the proposed fault-injection framework, this dissertation develops a hierarchical approach to understanding the error resilience characteristics of GPGPU applications at kernel, CTA, and warp levels. In addition, given that some corrupted application outputs due to soft errors may be acceptable, we present a use case to show how to enable low-overhead yet reliable GPU computing for GPGPU applications
Scalable coordination of distributed in-memory transactions
Phd ThesisCoordinating transactions involves ensuring serializability in the presence of concurrent data
accesses. Accomplishing it in an scalable manner for distributed in-memory transactions is the
aim of this thesis work. To this end, the work makes three contributions. It first experimentally
demonstrates that transaction latency and throughput scale considerably well when an atomic
multicast service is offered to transaction nodes by a crash-tolerant ensemble of dedicated nodes
and that using such a service is the most scalable approach compared to practices advocated in
the literature. Secondly, we design, implement and evaluate a crash-tolerant and non-blocking
atomic broadcast protocol, called ABcast, which is then used as the foundation for building the
aforementioned multicast service.
ABcast is a hybrid protocol, which consists of a pair of primary and backup protocols executing
in parallel. The primary protocol is a deterministic atomic broadcast protocol that provides
high performance when node crashes are absent, but blocks in their presence until a group
membership service detects such failures. The backup protocol, Aramis, is a probabilistic protocol
that does not block in the event of node crashes and allows message delivery to continue
post-crash until the primary protocol is able to resume. Aramis design avoids blocking by assuming
that message delays remain within a known bound with a high probability that can be
estimated in advance, provided that recent delay estimates are used to (i) continually adjust
that bound and (ii) regulate flow control. Aramis delivery of broadcasts preserve total order
with a probability that can be tuned to be close to 1. Comprehensive evaluations show that this
probability can be 99.99% or more.
Finally, we assess the effect of low-probability order violations on implementing various
isolation levels commonly considered in transaction systems. These three contributions together
advance the state-of-art in two major ways: (i) identifying a service based approach
to transactional scalability and (ii) establishing a practical alternative to the complex PAXOSiii
style approach to building such a service, by using novel but simple protocols and open-source
software frameworks.Red Ha