569 research outputs found

    A Survey of Techniques for Improving Security of GPUs

    Full text link
    Graphics processing unit (GPU), although a powerful performance-booster, also has many security vulnerabilities. Due to these, the GPU can act as a safe-haven for stealthy malware and the weakest `link' in the security `chain'. In this paper, we present a survey of techniques for analyzing and improving GPU security. We classify the works on key attributes to highlight their similarities and differences. More than informing users and researchers about GPU security techniques, this survey aims to increase their awareness about GPU security vulnerabilities and potential countermeasures

    Data Resource Management in Throughput Processors

    Full text link
    Graphics Processing Units (GPUs) are becoming common in data centers for tasks like neural network training and image processing due to their high performance and efficiency. GPUs maintain high throughput by running thousands of threads simultaneously, issuing instructions from ready threads to hide latency in others that are stalled. While this is effective for keeping the arithmetic units busy, the challenge in GPU design is moving the data for computation at the same high rate. Any inefficiency in data movement and storage will compromise the throughput and energy efficiency of the system. Since energy consumption and cooling make up a large part of the cost of provisioning and running and a data center, making GPUs more suitable for this environment requires removing the bottlenecks and overheads that limit their efficiency. The performance of GPU workloads is often limited by the throughput of the memory resources inside each GPU core, and though many of the power-hungry structures in CPUs are not found in GPU designs, there is overhead for storing each thread's state. When sharing a GPU between workloads, contention for resources also causes interference and slowdown. This thesis develops techniques to manage and streamline the data movement and storage resources in GPUs in each of these places. The first part of this thesis resolves data movement restrictions inside each GPU core. The GPU memory system is optimized for sequential accesses, but many workloads load data in irregular or transposed patterns that cause a throughput bottleneck even when all loads are cache hits. This work identifies and leverages opportunities to merge requests across threads before sending them to the cache. While requests are waiting for merges, they can be reordered to achieve a higher cache hit rate. These methods yielded a 38% speedup for memory throughput limited workloads. Another opportunity for optimization is found in the register file. Since it must store the registers for thousands of active threads, it is the largest on-chip data storage structure on a GPU. The second work in this thesis replaces the register file with a smaller, more energy-efficient register buffer. Compiler directives allow the GPU to know ahead of time which registers will be accessed, allowing the hardware to store only the registers that will be imminently accessed in the buffer, with the rest moved to main memory. This technique reduced total GPU energy by 11%. Finally, in a data center, many different applications will be launching GPU jobs, and just as multiple processes can share the same CPU to increase its utilization, running multiple workloads on the same GPU can increase its overall throughput. However, co-runners interfere with each other in unpredictable ways, especially when sharing memory resources. The final part of this thesis controls this interference, allowing a GPU to be shared between two tiers of workloads: one tier with a high performance target and another suitable for batch jobs without deadlines. At a 90% performance target, this technique increased GPU throughput by 9.3%. GPUs' high efficiency and performance makes them a valuable accelerator in the data center. The contributions in this thesis further increase their efficiency by removing data movement and storage overheads and unlock additional performance by enabling resources to be shared between workloads while controlling interference.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/146122/1/jklooste_1.pd

    Master of Science

    Get PDF
    thesisThe advent of the era of cheap and pervasive many-core and multicore parallel sys-tems has highlighted the disparity of the performance achieved between novice and expert developers targeting parallel architectures. This disparity is most notiable with software for running general purpose computations on grachics processing units (GPGPU programs). Current methods for implementing GPGPU programs require an expert level understanding of the memory hierarchy and execution model of the hardware to reach peak performance. Even for experts, rewriting a program to exploit these hardware features can be tedious and error prone. Compilers and their ability to make code transformations can assist in the implementation of GPGPU programs, handling many of the target specic details. This thesis presents CUDA-CHiLL, a source to source compiler transformation and code generation framework for the parallelization and optimization of computations expressed in sequential loop nests for running on many-core GPUs. This system uniquely uses a complete scripting language to describe composable compiler transformations that can be written, shared and reused by nonexpert application and library developers. CUDA-CHiLL is built on the polyhedral program transformation and code generation framework CHiLL, which is capable of robust composition of transformations while preserving the correctness of the program at each step. Through its use of powerful abstractions and a scripting interface, CUDA-CHiLL allows for a developer to focus on optimization strategies and ignore the error prone details and low level constructs of GPGPU programming. The high level framework can be used inside an orthogonal auto-tuning system that can quickly evaluate the space of possible implementations. Although specicl to CUDA at the moment, many of the abstractions would hold for any GPGPU framework, particularly Open CL. The contributions of this thesis include a programming language approach to providing transformation abstraction and composition, a unifying framework for general and GPU specicl transformations, and demonstration of the framework on standard benchmarks that show it capable of matching or outperforming hand-tuned GPU kernels

    BEC: Bit-Level Static Analysis for Reliability against Soft Errors

    Full text link
    Soft errors are a type of transient digital signal corruption that occurs in digital hardware components such as the internal flip-flops of CPU pipelines, the register file, memory cells, and even internal communication buses. Soft errors are caused by environmental radioactivity, magnetic interference, lasers, and temperature fluctuations, either unintentionally, or as part of a deliberate attempt to compromise a system and expose confidential data. We propose a bit-level error coalescing (BEC) static program analysis and its two use cases to understand and improve program reliability against soft errors. The BEC analysis tracks each bit corruption in the register file and classifies the effect of the corruption by its semantics at compile time. The usefulness of the proposed analysis is demonstrated in two scenarios, fault injection campaign pruning, and reliability-aware program transformation. Experimental results show that bit-level analysis pruned up to 30.04 % of exhaustive fault injection campaigns (13.71 % on average), without loss of accuracy. Program vulnerability was reduced by up to 13.11 % (4.94 % on average) through bit-level vulnerability-aware instruction scheduling. The analysis has been implemented within LLVM and evaluated on the RISC-V architecture. To the best of our knowledge, the proposed BEC analysis is the first bit-level compiler analysis for program reliability against soft errors. The proposed method is generic and not limited to a specific computer architecture.Comment: 13 pages, 4 figures, to be published in International Symposium on Code Generation and Optimization (CGO) 202

    A formally verified compiler back-end

    Get PDF
    This article describes the development and formal verification (proof of semantic preservation) of a compiler back-end from Cminor (a simple imperative intermediate language) to PowerPC assembly code, using the Coq proof assistant both for programming the compiler and for proving its correctness. Such a verified compiler is useful in the context of formal methods applied to the certification of critical software: the verification of the compiler guarantees that the safety properties proved on the source code hold for the executable compiled code as well
    corecore