446 research outputs found

    Enabling preemptive multiprogramming on GPUs

    Get PDF
    GPUs are being increasingly adopted as compute accelerators in many domains, spanning environments from mobile systems to cloud computing. These systems are usually running multiple applications, from one or several users. However GPUs do not provide the support for resource sharing traditionally expected in these scenarios. Thus, such systems are unable to provide key multiprogrammed workload requirements, such as responsiveness, fairness or quality of service. In this paper, we propose a set of hardware extensions that allow GPUs to efficiently support multiprogrammed GPU workloads. We argue for preemptive multitasking and design two preemption mechanisms that can be used to implement GPU scheduling policies. We extend the architecture to allow concurrent execution of GPU kernels from different user processes and implement a scheduling policy that dynamically distributes the GPU cores among concurrently running kernels, according to their priorities. We extend the NVIDIA GK110 (Kepler) like GPU architecture with our proposals and evaluate them on a set of multiprogrammed workloads with up to eight concurrent processes. Our proposals improve execution time of high-priority processes by 15.6x, the average application turnaround time between 1.5x to 2x, and system fairness up to 3.4x.We would like to thank the anonymous reviewers, Alexan- der Veidenbaum, Carlos Villavieja, Lluis Vilanova, Lluc Al- varez, and Marc Jorda on their comments and help improving our work and this paper. This work is supported by Euro- pean Commission through TERAFLUX (FP7-249013), Mont- Blanc (FP7-288777), and RoMoL (GA-321253) projects, NVIDIA through the CUDA Center of Excellence program, Spanish Government through Programa Severo Ochoa (SEV-2011-0067) and Spanish Ministry of Science and Technology through TIN2007-60625 and TIN2012-34557 projects.Peer ReviewedPostprint (authorโ€™s final draft

    HeteroCore GPU to exploit TLP-resource diversity

    Get PDF

    Preemptive Thread Block Scheduling with Online Structural Runtime Prediction for Concurrent GPGPU Kernels

    Full text link
    Recent NVIDIA Graphics Processing Units (GPUs) can execute multiple kernels concurrently. On these GPUs, the thread block scheduler (TBS) uses the FIFO policy to schedule their thread blocks. We show that FIFO leaves performance to chance, resulting in significant loss of performance and fairness. To improve performance and fairness, we propose use of the preemptive Shortest Remaining Time First (SRTF) policy instead. Although SRTF requires an estimate of runtime of GPU kernels, we show that such an estimate of the runtime can be easily obtained using online profiling and exploiting a simple observation on GPU kernels' grid structure. Specifically, we propose a novel Structural Runtime Predictor. Using a simple Staircase model of GPU kernel execution, we show that the runtime of a kernel can be predicted by profiling only the first few thread blocks. We evaluate an online predictor based on this model on benchmarks from ERCBench, and find that it can estimate the actual runtime reasonably well after the execution of only a single thread block. Next, we design a thread block scheduler that is both concurrent kernel-aware and uses this predictor. We implement the SRTF policy and evaluate it on two-program workloads from ERCBench. SRTF improves STP by 1.18x and ANTT by 2.25x over FIFO. When compared to MPMax, a state-of-the-art resource allocation policy for concurrent kernels, SRTF improves STP by 1.16x and ANTT by 1.3x. To improve fairness, we also propose SRTF/Adaptive which controls resource usage of concurrently executing kernels to maximize fairness. SRTF/Adaptive improves STP by 1.12x, ANTT by 2.23x and Fairness by 2.95x compared to FIFO. Overall, our implementation of SRTF achieves system throughput to within 12.64% of Shortest Job First (SJF, an oracle optimal scheduling policy), bridging 49% of the gap between FIFO and SJF.Comment: 14 pages, full pre-review version of PACT 2014 poste

    ๋ฉ€ํ‹ฐ ํƒœ์Šคํ‚น ํ™˜๊ฒฝ์—์„œ GPU๋ฅผ ์‚ฌ์šฉํ•œ ๋ฒ”์šฉ์  ๊ณ„์‚ฐ ์‘์šฉ์˜ ํšจ์œจ์ ์ธ ์‹œ์Šคํ…œ ์ž์› ํ™œ์šฉ์„ ์œ„ํ•œ GPU ์‹œ์Šคํ…œ ์ตœ์ ํ™”

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2020. 8. ์—ผํ—Œ์˜.Recently, General Purpose GPU (GPGPU) applications are playing key roles in many different research fields, such as high-performance computing (HPC) and deep learning (DL). The common feature exists in these applications is that all of them require massive computation power, which follows the high parallelism characteristics of the graphics processing unit (GPU). However, because of the resource usage pattern of each GPGPU application varies, a single application cannot fully exploit the GPU systems resources to achieve the best performance of the GPU since the GPU system is designed to provide system-level fairness to all applications instead of optimizing for a specific type. GPU multitasking can address the issue by co-locating multiple kernels with diverse resource usage patterns to share the GPU resource in parallel. However, the current GPU mul- titasking scheme focuses just on co-launching the kernels rather than making them execute more efficiently. Besides, the current GPU multitasking scheme is not open-sourced, which makes it more difficult to be optimized, since the GPGPU applications and the GPU system are unaware of the feature of each other. In this dissertation, we claim that using the support from framework between the GPU system and the GPGPU applications without modifying the application can yield better performance. We design and implement the frame- work while addressing two issues in GPGPU applications. First, we introduce a GPU memory checkpointing approach between the host memory and the device memory to address the problem that GPU memory cannot be over-subscripted in a multitasking environment. Second, we present a fine-grained GPU kernel management scheme to avoid the GPU resource under-utilization problem in a i multitasking environment. We implement and evaluate our schemes on a real GPU system. The experimental results show that our proposed approaches can solve the problems related to GPGPU applications than the existing approaches while delivering better performance.์ตœ๊ทผ ๋ฒ”์šฉ GPU (GPGPU) ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์€ ๊ณ ์„ฑ๋Šฅ ์ปดํ“จํŒ… (HPC) ๋ฐ ๋”ฅ ๋Ÿฌ๋‹ (DL)๊ณผ ๊ฐ™์€ ๋‹ค์–‘ํ•œ ์—ฐ๊ตฌ ๋ถ„์•ผ์—์„œ ํ•ต์‹ฌ์ ์ธ ์—ญํ• ์„ ์ˆ˜ํ–‰ํ•˜๊ณ  ์žˆ๋‹ค. ์ด๋Ÿฌํ•œ ์‘ ์šฉ ๋ถ„์•ผ์˜ ๊ณตํ†ต์ ์ธ ํŠน์„ฑ์€ ๊ฑฐ๋Œ€ํ•œ ๊ณ„์‚ฐ ์„ฑ๋Šฅ์ด ํ•„์š”ํ•œ ๊ฒƒ์ด๋ฉฐ ๊ทธ๋ž˜ํ”ฝ ์ฒ˜๋ฆฌ ์žฅ์น˜ (GPU)์˜ ๋†’์€ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ํŠน์„ฑ๊ณผ ๋งค์šฐ ์ ํ•ฉํ•˜๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ GPU ์‹œ์Šคํ…œ์€ ํŠน์ • ์œ  ํ˜•์˜ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์— ์ตœ์ €ํ™”ํ•˜๋Š” ๋Œ€์‹  ๋ชจ๋“  ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์— ์‹œ์Šคํ…œ ์ˆ˜์ค€์˜ ๊ณต์ • ์„ฑ์„ ์ œ๊ณตํ•˜๋„๋ก ์„ค๊ณ„๋˜์–ด ์žˆ์œผ๋ฉฐ ๊ฐ GPGPU ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์˜ ์ž์› ์‚ฌ์šฉ ํŒจํ„ด์ด ๋‹ค์–‘ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋‹จ์ผ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์ด GPU ์‹œ์Šคํ…œ์˜ ๋ฆฌ์†Œ์Šค๋ฅผ ์™„์ „ํžˆ ํ™œ์šฉํ•˜์—ฌ GPU์˜ ์ตœ๊ณ  ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑ ํ•  ์ˆ˜๋Š” ์—†๋‹ค. ๋”ฐ๋ผ์„œ GPU ๋ฉ€ํ‹ฐ ํƒœ์Šคํ‚น์€ ๋‹ค์–‘ํ•œ ๋ฆฌ์†Œ์Šค ์‚ฌ์šฉ ํŒจํ„ด์„ ๊ฐ€์ง„ ์—ฌ๋Ÿฌ ์‘์šฉ ํ”„๋กœ๊ทธ ๋žจ์„ ํ•จ๊ป˜ ๋ฐฐ์น˜ํ•˜์—ฌ GPU ๋ฆฌ์†Œ์Šค๋ฅผ ๊ณต์œ ํ•จ์œผ๋กœ์จ GPU ์ž์› ์‚ฌ์šฉ๋ฅ  ์ €ํ•˜ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๊ธฐ์กด GPU ๋ฉ€ํ‹ฐ ํƒœ์Šคํ‚น ๊ธฐ์ˆ ์€ ์ž์› ์‚ฌ์šฉ๋ฅ  ๊ด€์ ์—์„œ ์‘ ์šฉ ํ”„๋กœ๊ทธ๋žจ์˜ ํšจ์œจ์ ์ธ ์‹คํ–‰๋ณด๋‹ค ๊ณต๋™์œผ๋กœ ์‹คํ–‰ํ•˜๋Š” ๋ฐ ์ค‘์ ์„ ๋‘”๋‹ค. ๋˜ํ•œ ํ˜„์žฌ GPU ๋ฉ€ํ‹ฐ ํƒœ์Šคํ‚น ๊ธฐ์ˆ ์€ ์˜คํ”ˆ ์†Œ์Šค๊ฐ€ ์•„๋‹ˆ๋ฏ€๋กœ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ๊ณผ GPU ์‹œ์Šคํ…œ์ด ์„œ๋กœ์˜ ๊ธฐ๋Šฅ์„ ์ธ์‹ํ•˜์ง€ ๋ชปํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ตœ์ ํ™”ํ•˜๊ธฐ๊ฐ€ ๋” ์–ด๋ ค์šธ ์ˆ˜๋„ ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์„ ์ˆ˜์ • ์—†์ด GPU ์‹œ์Šคํ…œ๊ณผ GPGPU ์‘์šฉ ์‚ฌ ์ด์˜ ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ํ†ตํ•ด ์‚ฌ์šฉํ•˜๋ฉด ๋ณด๋‹ค ๋†’์€ ์‘์šฉ์„ฑ๋Šฅ๊ณผ ์ž์› ์‚ฌ์šฉ์„ ๋ณด์ผ ์ˆ˜ ์žˆ์Œ์„ ์ฆ๋ช…ํ•˜๊ณ ์ž ํ•œ๋‹ค. ๊ทธ๋Ÿฌ๊ธฐ ์œ„ํ•ด GPU ํƒœ์Šคํฌ ๊ด€๋ฆฌ ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ๊ฐœ๋ฐœํ•˜์—ฌ GPU ๋ฉ€ํ‹ฐ ํƒœ์Šคํ‚น ํ™˜๊ฒฝ์—์„œ ๋ฐœ์ƒํ•˜๋Š” ๋‘ ๊ฐ€์ง€ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜์˜€๋‹ค. ์ฒซ์งธ, ๋ฉ€ํ‹ฐ ํƒœ ์Šคํ‚น ํ™˜๊ฒฝ์—์„œ GPU ๋ฉ”๋ชจ๋ฆฌ ์ดˆ๊ณผ ํ• ๋‹นํ•  ์ˆ˜ ์—†๋Š” ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ํ˜ธ์ŠคํŠธ ๋ฉ”๋ชจ๋ฆฌ์™€ ๋””๋ฐ”์ด์Šค ๋ฉ”๋ชจ๋ฆฌ์— ์ฒดํฌํฌ์ธํŠธ ๋ฐฉ์‹์„ ๋„์ž…ํ•˜์˜€๋‹ค. ๋‘˜์งธ, ๋ฉ€ํ‹ฐ ํƒœ์Šคํ‚น ํ™˜ ๊ฒฝ์—์„œ GPU ์ž์› ์‚ฌ์šฉ์œจ ์ €ํ•˜ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋”์šฑ ์„ธ๋ถ„ํ™” ๋œ GPU ์ปค๋„ ๊ด€๋ฆฌ ์‹œ์Šคํ…œ์„ ์ œ์‹œํ•˜์˜€๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ œ์•ˆํ•œ ๋ฐฉ๋ฒ•๋“ค์˜ ํšจ๊ณผ๋ฅผ ์ฆ๋ช…ํ•˜๊ธฐ ์œ„ํ•ด ์‹ค์ œ GPU ์‹œ์Šคํ…œ์— 92 ๊ตฌํ˜„ํ•˜๊ณ  ๊ทธ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•˜์˜€๋‹ค. ์ œ์•ˆํ•œ ์ ‘๊ทผ๋ฐฉ์‹์ด ๊ธฐ์กด ์ ‘๊ทผ ๋ฐฉ์‹๋ณด๋‹ค GPGPU ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ๊ณผ ๊ด€๋ จ๋œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ ๋” ๋†’์€ ์„ฑ๋Šฅ์„ ์ œ๊ณตํ•  ์ˆ˜ ์žˆ์Œ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค.Chapter 1 Introduction 1 1.1 Motivation 2 1.2 Contribution . 7 1.3 Outline 8 Chapter 2 Background 10 2.1 GraphicsProcessingUnit(GPU) and CUDA 10 2.2 CheckpointandRestart . 11 2.3 ResourceSharingModel. 11 2.4 CUDAContext 12 2.5 GPUThreadBlockScheduling . 13 2.6 Multi-ProcessServicewithHyper-Q 13 Chapter 3 Checkpoint based solution for GPU memory over- subscription problem 16 3.1 Motivation 16 3.2 RelatedWork. 18 3.3 DesignandImplementation . 20 3.3.1 System Design 21 3.3.2 CUDAAPIwrappingmodule 22 3.3.3 Scheduler . 28 3.4 Evaluation. 31 3.4.1 Evaluationsetup . 31 3.4.2 OverheadofFlexGPU 32 3.4.3 Performance with GPU Benchmark Suits 34 3.4.4 Performance with Real-world Workloads 36 3.4.5 Performance of workloads composed of multiple applications 39 3.5 Summary 42 Chapter 4 A Workload-aware Fine-grained Resource Manage- ment Framework for GPGPUs 43 4.1 Motivation 43 4.2 RelatedWork. 45 4.2.1 GPUresourcesharing 45 4.2.2 GPUscheduling . 46 4.3 DesignandImplementation . 47 4.3.1 SystemArchitecture . 47 4.3.2 CUDAAPIWrappingModule . 49 4.3.3 smCompactorRuntime . 50 4.3.4 ImplementationDetails . 57 4.4 Analysis on the relation between performance and workload usage pattern 60 4.4.1 WorkloadDefinition . 60 4.4.2 Analysisonperformancesaturation 60 4.4.3 Predict the necessary SMs and thread blocks for best performance . 64 4.5 Evaluation. 69 4.5.1 EvaluationMethodology. 70 4.5.2 OverheadofsmCompactor . 71 4.5.3 Performance with Different Thread Block Counts on Dif- ferentNumberofSMs 72 4.5.4 Performance with Concurrent Kernel and Resource Sharing 74 4.6 Summary . 79 Chapter 5 Conclusion. 81 ์š”์•ฝ. 92Docto

    Supporting Preemptive Task Executions and Memory Copies in GPGPUs

    Get PDF
    GPGPUs (General Purpose Graphic Processing Units) provide massive computational power. However, applying GPGPU technology to real-time computing is challenging due to the non-preemptive nature of GPGPUs. Especially, a job running in a GPGPU or a data copy between a GPGPU and CPU is non-preemptive. As a result, a high priority job arriving in the middle of a low priority job execution or memory copy suffers from priority inversion. To address the problem, we present a new lightweight approach to supporting preemptive memory copies and job executions in GPGPUs. Moreover, in our approach, a GPGPU job and memory copy between a GPGPU and the hosting CPU are run concurrently to enhance the responsiveness. To show the feasibility of our approach, we have implemented a prototype system for preemptive job executions and data copies in a GPGPU. The experimental results show that our approach can bound the response times in a reliable manner. In addition, the response time of our approach is significantly shorter than those of the unmodified GPGPU runtime system that supports no preemption and an advanced GPGPU model designed to support prioritization and performance isolation via preemptive data copies

    CUsched: multiprogrammed workload scheduling on GPU architectures

    Get PDF
    Graphic Processing Units (GPUs) are currently widely used in High Performance Computing (HPC) applications to speed-up the execution of massively-parallel codes. GPUs are well-suited for such HPC environments because applications share a common characteristic with the gaming codes GPUs were designed for: only one application is using the GPU at the same time. Although, minimal support for multi-programmed systems exist, modern GPUs do not allow resource sharing among different processes. This lack of support restricts the usage of GPUs in desktop and mobile environment to a small amount of applications (e.g., games and multimedia players). In this paper we study the multi-programming support available in current GPUs, and show how such support is not sufficient. We propose a set of hardware extensions to the current GPU architectures to efficiently support multi-programmed GPU workloads, allowing concurrent execution of codes from different user processes. We implement several hardware schedulers on top of these extensions and analyze the behaviour of different work scheduling algorithms using system wide and per process metrics.Postprint (published version
    • โ€ฆ
    corecore