69 research outputs found

    Disengaged Scheduling for Fair, Protected Access to Fast Computational Accelerators

    Get PDF
    Todayโ€™s operating systems treat GPUs and other computational accelerators as if they were simple devices, with bounded and predictable response times. With accelerators assuming an increasing share of the workload on modern machines, this strategy is already problematic, and likely to become untenable soon. If the operating system is to enforce fair sharing of the machine, it must assume responsibility for accelerator scheduling and resource management. Fair, safe scheduling is a particular challenge on fast accelerators, which allow applications to avoid kernel-crossing overhead by interacting directly with the device. We propose a disengaged scheduling strategy in which the kernel intercedes between applications and the accelerator on an infrequent basis, to monitor their use of accelerator cycles and to determine which applications should be granted access over the next time interval. Our strategy assumes a well defined, narrow interface exported by the accelerator. We build upon such an interface, systematically inferred for the latest Nvidia GPUs. We construct several example schedulers, including Disengaged Timeslice with overuse control that guarantees fairness and Disengaged Fair Queueing that is effective in limiting resource idleness, but probabilistic. Both schedulers ensure fair sharing of the GPU, even among uncooperative or adversarial applications; Disengaged Fair Queueing incurs a 4 % overhead on average (max 18%) compared to direct devic

    Glider: A GPU Library Driver for Improved System Security

    Full text link
    Legacy device drivers implement both device resource management and isolation. This results in a large code base with a wide high-level interface making the driver vulnerable to security attacks. This is particularly problematic for increasingly popular accelerators like GPUs that have large, complex drivers. We solve this problem with library drivers, a new driver architecture. A library driver implements resource management as an untrusted library in the application process address space, and implements isolation as a kernel module that is smaller and has a narrower lower-level interface (i.e., closer to hardware) than a legacy driver. We articulate a set of device and platform hardware properties that are required to retrofit a legacy driver into a library driver. To demonstrate the feasibility and superiority of library drivers, we present Glider, a library driver implementation for two GPUs of popular brands, Radeon and Intel. Glider reduces the TCB size and attack surface by about 35% and 84% respectively for a Radeon HD 6450 GPU and by about 38% and 90% respectively for an Intel Ivy Bridge GPU. Moreover, it incurs no performance cost. Indeed, Glider outperforms a legacy driver for applications requiring intensive interactions with the device driver, such as applications using the OpenGL immediate mode API

    FairGV: Fair and Fast GPU Virtualization

    Get PDF
    Increasingly high-performance computing (HPC) application developers are opting to use cloud resources due to higher availability. Virtualized GPUs would be an obvious and attractive option for HPC application developers using cloud hosting services. Unfortunately, existing GPU virtualization software is not ready to address fairness, utilization, and performance limitations associated with consolidating mixed HPC workloads. This paper presents FairGV, a radically redesigned GPU virtualization system that achieves system-wide weighted fair sharing and strong performance isolation in mixed workloads that use GPUs with variable degrees of intensity. To achieve its objectives, FairGV introduces a trap-less GPU processing architecture, a new fair queuing method integrated with work-conserving and GPU-centric co-scheduling polices, and a collaborative scheduling method for non-preemptive GPUs. Our prototype implementation achieves near ideal fairness (? 0.97 Min-Max Ratio) with little performance degradation (? 1.02 aggregated overhead) in a range of mixed HPC workloads that leverage GPUs

    Maruchitenantona kuraudo kankyลka deno GPU no daiikkyลซ keisan shigen toshite no chลซshลka

    Get PDF

    LoGA : Low-Overhead GPU Accounting Using Events

    Get PDF
    Over the last few years, GPUs have become common in computing. However, current GPUs are not designed for a shared environment like a cloud, creating a number of challenges whenever a GPU must be multiplexed between multiple users. In particular, the round-robin scheduling used by today\u27s GPUs does not distribute the available GPU computation time fairly among applications. Most of the previous work addressing this problem resorted to scheduling all GPU computation in software, which induces high overhead. While there is a GPU scheduler called NEON which reduces the scheduling overhead compared to previous work, NEON\u27s accounting mechanism frequently disables GPU access for all but one application, resulting in considerable overhead if that application does not saturate the GPU by itself. In this paper, we present LoGA, a novel accounting mechanism for GPU computation time. LoGA monitors the GPU\u27s state to detect GPU-internal context switches, and infers the amount of GPU computation time consumed by each process from the time between these context switches. This method allows LoGA to measure GPU computation time consumed by applications while keeping all applications running concurrently. As a result, LoGA achieves a lower accounting overhead than previous work, especially for applications that do not saturate the GPU by themselves. We have developed a prototype which combines LoGA with the pre-existing NEON scheduler. Experiments with that prototype have shown that LoGA induces no accounting overhead while still delivering accurate measurements of applications\u27 consumed GPU computation time

    Efficient and portable multi-tasking for heterogeneous systems

    Get PDF
    Modern computing systems comprise heterogeneous designs which combine multiple and diverse architectures on a single system. These designs provide potentials for high performance under reduced power requirements but require advanced resource management and workload scheduling across the available processors. Programmability frameworks, such as OpenCL and CUDA, enable resource management and workload scheduling on heterogeneous systems. These frameworks fully assign the control of resource allocation and scheduling to the application. This design sufficiently serves the needs of dedicated application systems but introduces significant challenges for multi-tasking environments where multiple users and applications compete for access to system resources. This thesis considers these challenges and presents three major contributions that enable efficient multi-tasking on heterogeneous systems. The presented contributions are compatible with existing systems, remain portable across vendors and do not require application changes or recompilation. The first contribution of this thesis is an optimization technique that reduces host-device communication overhead for OpenCL applications. It does this without modification or recompilation of the application source code and is portable across platforms. This work enables efficiency and performance improvements for diverse application workloads found on multi-tasking systems. The second contribution is the design and implementation of a secure, user-space virtualization layer that integrates the accelerator resources of a system with the standard multi-tasking and user-space virtualization facilities of the commodity Linux OS. It enables fine-grained sharing of mixed-vendor accelerator resources and targets heterogeneous systems found in data center nodes and requires no modification to the OS, OpenCL or application. Lastly, the third contribution is a technique and software infrastructure that enable resource sharing control on accelerators, while supporting software managed scheduling on accelerators. The infrastructure remains transparent to existing systems and applications and requires no modifications or recompilation. In enforces fair accelerator sharing which is required for multi-tasking purposes

    GPU์˜ ์‹ค์‹œ๊ฐ„ ๋ณด์žฅ ๋ฐ ๋” ๋‚˜์€ ์Šค์ผ€์ค„๋ง ๊ฐ€๋Šฅ์„ฑ์„ ์œ„ํ•œ ์Šฌ๋ผ์ด์Šค ์ˆ˜ ํƒ์ƒ‰

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(์„์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2021.8. ์ด์ฐฝ๊ฑด.This paper proposes a conditionally optimal slice counts searching algorithm to improve GPU's real-time guarantee and better schedulability. Despite the growing importance of GPUs due to the recent advances in deep learning, there is still a lack of technology to utilize them in real-time. This paper assumes a GPU as a uniprocessor and uses non-preemptive EDF to schedule GPU kernels. Additionally, solving the schedulability degradation problem caused by non-preemptive uniprocessor assumption through searching the slice count of each kernel that makes the GPU task set to be schedulable.๋ณธ ๋…ผ๋ฌธ์€ GPU์˜ ์‹ค์‹œ๊ฐ„์„ฑ ๋ณด์žฅ ๋ฐ ๋” ๋‚˜์€ ์Šค์ผ€์ค„๋ง ๊ฐ€๋Šฅ์„ฑ์„ ์œ„ํ•œ ์กฐ๊ฑด๋ถ€ ์ตœ์  ์Šฌ๋ผ์ด์Šค ์นด์šดํŠธ ํƒ์ƒ‰ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ œ์•ˆํ•œ๋‹ค. ๊ทผ๋ž˜ ๋”ฅ๋Ÿฌ๋‹์˜ ๋ฐœ์ „์œผ๋กœ ์ธํ•ด GPU์˜ ์ค‘์š”์„ฑ์ด ์ปค์ง€๊ณ  ์žˆ์Œ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ , GPU๋ฅผ ์‹ค์‹œ๊ฐ„์œผ๋กœ ํ™œ์šฉํ•˜๊ธฐ ์œ„ํ•œ ๊ธฐ์ˆ ๋“ค์€ ์•„์ง ๋ถ€์กฑํ•œ ์‹ค์ •์ด๋‹ค. ๋ณธ ๋…ผ๋ฌธ์€ GPU๋ฅผ ๋‹จ์ผ ํ”„๋กœ์„ธ์„œ๋กœ ๊ฐ€์ •ํ•˜๊ณ  ๋น„์„ ์ ํ˜• EDF๋ฅผ GPU ์ปค๋„์˜ ์Šค์ผ€์ค„๋ง์— ์‚ฌ์šฉํ•œ๋‹ค. ๋˜ํ•œ GPU task set์„ ์Šค์ผ€์ค„๋ง ๊ฐ€๋Šฅํ•˜๊ฒŒ ๋งŒ๋“œ๋Š” ์Šฌ๋ผ์ด์Šค ์นด์šดํŠธ ํƒ์ƒ‰ ๊ธฐ๋ฒ•์„ ํ†ตํ•ด ๋น„์„ ์ ํ˜• ๋‹จ์ผ ํ”„๋กœ์„ธ์„œ๋กœ์˜ ๊ฐ€์ •์œผ๋กœ ์ธํ•œ ์Šค์ผ€์ค„๋ง ๊ฐ€๋Šฅ์„ฑ ์ €ํ•˜ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•œ๋‹ค.1 Introduction 1 2 RelatedWorks 3 3 Real-Time Gaurantee of GPUs through Non-Preemptive Uniprocessor Assumption 5 4 Problem Description 9 5 Slice Counts Search 11 5.1 Blocking point, blocking tolerance, and blocking candidates 13 5.2 Searching slice counts for a task set 14 5.3 Stop conditions 18 5.4 Optimality of the slice counts search 18 5.5 Applying slice counts search in Real System 20 6 Experiment Results 21 6.1 Simulation Experiment 21 6.2 Implementation Results 23 7 Conclusion 25 References 26์„

    GPrioSwap : Towards a Swapping Policy for GPUs

    Get PDF
    Over the last few years, Graphics Processing Units (GPUs) have become popular in computing, and have found their way into a number of cloud platforms. However, integrating a GPU into a cloud environment requires the cloud provider to efficiently virtualize the GPU. While several research projects have addressed this challenge in the past, few of these projects attempt to properly enable sharing of GPU memory between multiple clients: To date, GPUswap is the only project that enables sharing of GPU memory without inducing unnecessary application overhead, while maintaining both fairness and high utilization of GPU memory. However, GPUswap includes only a rudimentary swapping policy, and therefore induces a rather large application overhead. In this paper, we work towards a practicable swapping policy for GPUs. To that end, we analyze the behavior of various GPU applications to determine their memory access patterns. Based on our insights about these patterns, we derive a swapping policy that includes a developer-assigned priority for each GPU buffer in its swapping decisions. Experiments with our prototype implementation show that a swapping policy based on buffer priorities can significantly reduce the swapping overhead
    • โ€ฆ
    corecore