78 research outputs found

    GPU Concurrency: Weak Behaviours and Programming Assumptions

    Get PDF
    Concurrency is pervasive and perplexing, particularly on graphics processing units (GPUs). Current specifications of languages and hardware are inconclusive; thus programmers often rely on folklore assumptions when writing software. To remedy this state of affairs, we conducted a large empirical study of the concurrent behaviour of deployed GPUs. Armed with litmus tests (i.e. short concurrent programs), we questioned the assumptions in programming guides and vendor documentation about the guarantees provided by hardware. We developed a tool to generate thousands of litmus tests and run them under stressful workloads. We observed a litany of previously elusive weak behaviours, and exposed folklore beliefs about GPU programming---often supported by official tutorials---as false. As a way forward, we propose a model of Nvidia GPU hardware, which correctly models every behaviour witnessed in our experiments. The model is a variant of SPARC Relaxed Memory Order (RMO), structured following the GPU concurrency hierarchy

    Master of Science

    Get PDF
    thesisGraphics Processing Units (GPUs) are highly parallel shared memory microprocessors, and as such, they are prone to the same concurrency considerations as their traditional multicore CPU counterparts. In this thesis, we consider shared memory consistency, i.e. what values can be read when issued concurrently with writes on current GPU hardware. While memory consistency has been relatively well studied for CPUs, GPUs present substantially different concurrency systems with an explicit thread and memory hierarchy. Because documentation on GPU memory models is limited, it remains unclear what behaviors are allowed by current GPU implementations. To this end, this work focuses on testing shared memory consistency behavior on NVIDIA GPUs. We present a format for describing GPU memory consistency tests (dubbed GPU litmus tests) which includes the placement of testing threads into the GPU thread hierarchy (e.g. cooperative thread arrays, warps) and memory locations into GPU memory regions (e.g. shared, global). We then present a framework for running GPU litmus tests under system stress designed to trigger weak memory model behaviors, that is, executions that do not correspond to an interleaving of the instructions of the concurrent program. We discuss GPU specific incantations (i.e. heuristics) which we found to be crucial for observing weak memory model executions; these include bank conflicts and custom GPU memory stressing functions. We then report the results of running GPU litmus tests in this framework and show that we observe a controversial relaxed coherence behavior on older NVIDIA chips. We present several examples of published GPU applications which may exhibit unintended behavior due to the lack of fence synchronization; one such example is a spin-lock published in the popular CUDA by Example book. We then test several families of tests and compare our results to a proposed operational GPU memory model and show that the model is unsound (i.e. disallows behaviors that we observe on hardware). Our techniques are implemented in a modified version of a memory model testing tool named litmus

    Mixed-proxy extensions for the NVIDIA PTX memory consistency model

    Get PDF
    In recent years, there has been a trend towards the use of accelerators and architectural specialization to continue scaling performance in spite of a slowing of Moore's Law. GPUs have always relied on dedicated hardware for graphics workloads, but modern GPUs now also incorporate compute-domain accelerators such as NVIDIA's Tensor Cores for machine learning. For these accelerators to be successfully integrated into a general-purpose programming language such as C++ or CUDA, there must be a forward- and backward-compatible API for the functionality they provide. To the extent that all of these accelerators interact with program threads through memory, they should be incorporated into the GPU's memory consistency model. Unfortunately, the use of accelerators and/or special non-coherent paths into memory produces non-standard memory behavior that existing GPU memory models cannot capture. In this work, we describe the "proxy" extensions added to version 7.5 of NVIDIA's PTX ISA for GPUs. A proxy is an extra tag abstractly applied to every memory or fence operation. Proxies generalize the notion of address translation and specialized non-coherent cache hierarchies into an abstraction that cleanly describes the resulting non-standard behavior. The goal of proxies is to facilitate integration of these specialized memory accesses into the general-purpose PTX programming model in a fully composable manner. This paper characterizes the behaviors that proxies can capture, the microarchitectural intuition behind them, the necessary updates to the formal memory model, and the tooling that we built in order to ensure that the resulting model both is sound and meets the needs of business-critical workloads that they are designed to support

    Efficient Precise Dynamic Data Race Detection For Cpu And Gpu

    Get PDF
    Data races are notorious bugs. They introduce non-determinism in programs behavior, complicate programs semantics, making it challenging to debug parallel programs. To make parallel programming easier, efficient data race detection has been a research topic in the last decades. However, existing data race detectors either sacrifice precision or incur high overhead, limiting their application to real-world applications and scenarios. This dissertation proposes approaches to improve the performance of dynamic data race detection without undermining precision, by identifying and removing metadata redundancy dynamically. This dissertation also explores ways to make it practical to detect data races dynamically for GPU programs, which has a disparate programming and execution model from CPU workloads. Further, this dissertation shows how the structured synchronization model in GPU programs can simplify the algorithm design of data race detection for GPU, and how the unique patterns in GPU workloads enable an efficient implementation of the algorithm, yielding a high-performance dynamic data race detector for GPU programs

    Automating C++ Execution Exploration to Solve the Out-of-thin-air Problem

    Get PDF
    Modern computers are marvels of engineering. Customisable reasoning engines which can be programmed to complete complex mathematical tasks at incredible speed. Decades of engineering has taken computers from room sized machines to near invisible devices in all aspects of life. With this engineering has come more complex and ornate design, a substantial leap forward being multiprocessing. Modern processors can execute threads of program logic in parallel, coordinating shared resources like memory and device access. Parallel computation leads to significant scaling of compute power, but yields a substantial complexity cost for both processors designers and programmers. Parallel access to shared memory requires coordination on which thread can use a particular fragment of memory at a given time. Simple mechanisms like locks and mutexes ensure only one process at a time can access memory gives an easy to use programming model, but they eschew the benefits of parallel computation. Instead, processors today have complex mechanisms to permit concurrent shared memory access. These mechanisms prevent simple programmer reasoning and require complex formal descriptions to define: memory models. Early memory model research focused on weak memory behaviours which are observable because of hardware design; over time it has become obvious that not only hardware but compilers are capable of making new weak behaviours observable. Substantial and rapid success has been achieved formalising the behaviour of these machines: researchers refined new specifications for shared-memory concurrency and used mechanisation to automate validation of their models. As the models were refined and new behaviours of the hardware were discovered, researchers also began working with processor vendors – helping to inform design choices in new processor designs to keep the weak behaviours within some sensible bounds. Unfortunately when reasoning about shared memory accesses of highly optimised programming languages like C and C++, deep questions are still left open about how best to describe the behaviour of shared memory accesses in the presence of dependency removing compiler optimisations. Until very recently it has not been possible to properly specify the behaviours of these programs without forbidding ii optimisations which are used and observable, or allowing program behaviours which are nonsense and never observable. In this thesis I explore the development of memory models through the lens of tooling: taking at first an industrial approach, and then exploring memory models for highly optimised programming languages. I show that taming the complexity of these models with automated tools aids bug finding even where formal evaluation has not. Further, building tools creates a focus on the computational complexity of the memory model which in turn can steer development of the model towards simpler designs. We will look at 3 case studies: the first is an industrial hardware model of NVIDIA GPUs which we extend to encompass more hardware features than before. This extension was validated using an automated testing process generating tests of finite size, and then verified against the original memory model in Coq. The second case study is an exploration of the first memory model for an optimised programming language which takes proper account of dependencies. We build a tool to automate execution of this model over a series of tests, and in the process discovered subtleties in the definitions which were unexpected – leading to refinement of the model. In the final case study, we develop a memory model that gives a direct definition for compiler preserved dependencies. This model is the first model that can be integrated with relative ease into the C/C++ programming language standard. We built this model alongside its own tooling, yielding a fast tool for giving determinations on a large number of litmus tests – a novelty for this sort of memory model. This model fits well with the existing C/C++ specifications, and we are working with the International Standards Organisation to understand how best to fit this model in the standard

    Automatically Comparing Memory Consistency Models

    Get PDF
    A memory consistency model (MCM) is the part of a programming language or computer architecture specification that defines which values can legally be read from shared memory locations. Because MCMs take into account various optimisations employed by archi- tectures and compilers, they are often complex and counterintu- itive, which makes them challenging to design and to understand. We identify four tasks involved in designing and understanding MCMs: generating conformance tests, distinguishing two MCMs, checking compiler optimisations, and checking compiler mappings. We show that all four tasks are instances of a general constraint-satisfaction problem to which the solution is either a program or a pair of programs. Although this problem is intractable for automatic solvers when phrased over programs directly, we show how to solve analogous constraints over program executions, and then construct programs that satisfy the original constraints. Our technique, which is implemented in the Alloy modelling framework, is illustrated on several software- and architecture-level MCMs, both axiomatically and operationally defined. We automatically recreate several known results, often in a simpler form, including: distinctions between variants of the C11 MCM; a failure of the ‘SC-DRF guarantee’ in an early C11 draft; that x86 is ‘multi-copy atomic’ and Power is not; bugs in common C11 compiler optimisations; and bugs in a compiler mapping from OpenCL to AMD-style GPUs. We also use our technique to develop and validate a new MCM for NVIDIA GPUs that supports a natural mapping from OpenCL

    Doctor of Philosophy

    Get PDF
    dissertationGraphics processing units (GPUs) are highly parallel processors that are now commonly used in the acceleration of a wide range of computationally intensive tasks. GPU programs often suffer from data races and deadlocks, necessitating systematic testing. Conventional GPU debuggers are ineffective at finding and root-causing races since they detect errors with respect to the specific platform and inputs as well as thread schedules. The recent formal and semiformal analysis based tools have improved the situation much, but they still have some problems. Our research goal is to aply scalable formal analysis to refrain from platform constraints and exploit all relevant inputs and thread schedules for GPU programs. To achieve this objective, we create a novel symbolic analysis, test and test case generator tailored for C++ GPU programs, the entire framework consisting of three stages: GKLEE, GKLEEp, and SESA. Moreover, my thesis not only presents that our framework is capable of uncovering many concurrency errors effectively in real-world CUDA programs such as latest CUDA SDK kernels, Parboil and LoneStarGPU benchmarks, but also demonstrates a high degree of test automation is achievable in the space of GPU programs through SMT-based symbolic execution, picking representative executions through thread abstraction, and combined static and dynamic analysis

    On expressing different concurrency paradigms on virtual execution environment

    Get PDF
    Virtual execution environments (VEE) such as the Java Virtual Machine (JVM) and the Microsoft Common Language Runtime (CLR) have been designed when the dominant computer architecture featured a Von-Neumann interface to programs: a single processor hiding all the complexity of parallel computations inside its design. Programs are expressed in an intermediate form that is executed by the VEE that defines an abstract computational model in which the concurrency model has been influenced by these design choices and it basically exposes the multi-threading model of the underlying operating system. Recently computer systems have introduced computational units in which concurrency is explicit and under program control. Relevant examples are the Graphical Processing Units (GPU such as Nvidia or AMD) and the Cell BE architecture which allow for explicit control of single processing unit, local memories and communication channels. Unfortunately programs designed for Virtual Machines cannot access to these resources since are not available through the abstractions provided by the VEE. A major redesign of VEEs seems to be necessary in order to bridge this gap. In this thesis we study the problem of exposing non-von Neumann computing resources within the Virtual Machine without need for a redesign of the whole execution infrastructure. In this work we express parallel computations relying on extensible meta-data and reflection to encode information. Meta-programming techniques are then used to rewrite the program into an equivalent one using the special purpose underlying architecture. We provide a case study in which this approach is applied to compiling Common Intermediate Language (CIL) methods to multi-core GPUs; we show that it is possible to access these non-standard computing resources without any change to the virtual machine design
    • …
    corecore