243 research outputs found

    Enabling preemptive multiprogramming on GPUs

    Get PDF
    GPUs are being increasingly adopted as compute accelerators in many domains, spanning environments from mobile systems to cloud computing. These systems are usually running multiple applications, from one or several users. However GPUs do not provide the support for resource sharing traditionally expected in these scenarios. Thus, such systems are unable to provide key multiprogrammed workload requirements, such as responsiveness, fairness or quality of service. In this paper, we propose a set of hardware extensions that allow GPUs to efficiently support multiprogrammed GPU workloads. We argue for preemptive multitasking and design two preemption mechanisms that can be used to implement GPU scheduling policies. We extend the architecture to allow concurrent execution of GPU kernels from different user processes and implement a scheduling policy that dynamically distributes the GPU cores among concurrently running kernels, according to their priorities. We extend the NVIDIA GK110 (Kepler) like GPU architecture with our proposals and evaluate them on a set of multiprogrammed workloads with up to eight concurrent processes. Our proposals improve execution time of high-priority processes by 15.6x, the average application turnaround time between 1.5x to 2x, and system fairness up to 3.4x.We would like to thank the anonymous reviewers, Alexan- der Veidenbaum, Carlos Villavieja, Lluis Vilanova, Lluc Al- varez, and Marc Jorda on their comments and help improving our work and this paper. This work is supported by Euro- pean Commission through TERAFLUX (FP7-249013), Mont- Blanc (FP7-288777), and RoMoL (GA-321253) projects, NVIDIA through the CUDA Center of Excellence program, Spanish Government through Programa Severo Ochoa (SEV-2011-0067) and Spanish Ministry of Science and Technology through TIN2007-60625 and TIN2012-34557 projects.Peer ReviewedPostprint (authorโ€™s final draft

    HeteroCore GPU to exploit TLP-resource diversity

    Get PDF

    A scheduling theory framework for GPU tasks e๏ฌƒcient execution

    Get PDF
    Concurrent execution of tasks in GPUs can reduce the computation time of a workload by overlapping data transfer and execution commands. However it is di๏ฌƒcult to implement an e๏ฌƒcient run- time scheduler that minimizes the workload makespan as many execution orderings should be evaluated. In this paper, we employ scheduling theory to build a model that takes into account the device capabili- ties, workload characteristics, constraints and objec- tive functions. In our model, GPU tasks schedul- ing is reformulated as a ๏ฌ‚ow shop scheduling prob- lem, which allow us to apply and compare well known methods already developed in the operations research ๏ฌeld. In addition we develop a new heuristic, specif- ically focused on executing GPU commands, that achieves better scheduling results than previous tech- niques. Finally, a comprehensive evaluation, showing the suitability and robustness of this new approach, is conducted in three di๏ฌ€erent NVIDIA architectures (Kepler, Maxwell and Pascal).Proyecto TIN2016- 0920R, Universidad de Mรกlaga (Campus de Excelencia Internacional Andalucรญa Tech) y programa de donaciรณn de NVIDIA Corporation

    Supporting Preemptive Task Executions and Memory Copies in GPGPUs

    Get PDF
    GPGPUs (General Purpose Graphic Processing Units) provide massive computational power. However, applying GPGPU technology to real-time computing is challenging due to the non-preemptive nature of GPGPUs. Especially, a job running in a GPGPU or a data copy between a GPGPU and CPU is non-preemptive. As a result, a high priority job arriving in the middle of a low priority job execution or memory copy suffers from priority inversion. To address the problem, we present a new lightweight approach to supporting preemptive memory copies and job executions in GPGPUs. Moreover, in our approach, a GPGPU job and memory copy between a GPGPU and the hosting CPU are run concurrently to enhance the responsiveness. To show the feasibility of our approach, we have implemented a prototype system for preemptive job executions and data copies in a GPGPU. The experimental results show that our approach can bound the response times in a reliable manner. In addition, the response time of our approach is significantly shorter than those of the unmodified GPGPU runtime system that supports no preemption and an advanced GPGPU model designed to support prioritization and performance isolation via preemptive data copies

    A C-DAG task model for scheduling complex real-time tasks on heterogeneous platforms: preemption matters

    Full text link
    Recent commercial hardware platforms for embedded real-time systems feature heterogeneous processing units and computing accelerators on the same System-on-Chip. When designing complex real-time application for such architectures, the designer needs to make a number of difficult choices: on which processor should a certain task be implemented? Should a component be implemented in parallel or sequentially? These choices may have a great impact on feasibility, as the difference in the processor internal architectures impact on the tasks' execution time and preemption cost. To help the designer explore the wide space of design choices and tune the scheduling parameters, in this paper we propose a novel real-time application model, called C-DAG, specifically conceived for heterogeneous platforms. A C-DAG allows to specify alternative implementations of the same component of an application for different processing engines to be selected off-line, as well as conditional branches to model if-then-else statements to be selected at run-time. We also propose a schedulability analysis for the C-DAG model and a heuristic allocation algorithm so that all deadlines are respected. Our analysis takes into account the cost of preempting a task, which can be non-negligible on certain processors. We demonstrate the effectiveness of our approach on a large set of synthetic experiments by comparing with state of the art algorithms in the literature

    ๋ฉ€ํ‹ฐ ํƒœ์Šคํ‚น ํ™˜๊ฒฝ์—์„œ GPU๋ฅผ ์‚ฌ์šฉํ•œ ๋ฒ”์šฉ์  ๊ณ„์‚ฐ ์‘์šฉ์˜ ํšจ์œจ์ ์ธ ์‹œ์Šคํ…œ ์ž์› ํ™œ์šฉ์„ ์œ„ํ•œ GPU ์‹œ์Šคํ…œ ์ตœ์ ํ™”

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2020. 8. ์—ผํ—Œ์˜.Recently, General Purpose GPU (GPGPU) applications are playing key roles in many different research fields, such as high-performance computing (HPC) and deep learning (DL). The common feature exists in these applications is that all of them require massive computation power, which follows the high parallelism characteristics of the graphics processing unit (GPU). However, because of the resource usage pattern of each GPGPU application varies, a single application cannot fully exploit the GPU systems resources to achieve the best performance of the GPU since the GPU system is designed to provide system-level fairness to all applications instead of optimizing for a specific type. GPU multitasking can address the issue by co-locating multiple kernels with diverse resource usage patterns to share the GPU resource in parallel. However, the current GPU mul- titasking scheme focuses just on co-launching the kernels rather than making them execute more efficiently. Besides, the current GPU multitasking scheme is not open-sourced, which makes it more difficult to be optimized, since the GPGPU applications and the GPU system are unaware of the feature of each other. In this dissertation, we claim that using the support from framework between the GPU system and the GPGPU applications without modifying the application can yield better performance. We design and implement the frame- work while addressing two issues in GPGPU applications. First, we introduce a GPU memory checkpointing approach between the host memory and the device memory to address the problem that GPU memory cannot be over-subscripted in a multitasking environment. Second, we present a fine-grained GPU kernel management scheme to avoid the GPU resource under-utilization problem in a i multitasking environment. We implement and evaluate our schemes on a real GPU system. The experimental results show that our proposed approaches can solve the problems related to GPGPU applications than the existing approaches while delivering better performance.์ตœ๊ทผ ๋ฒ”์šฉ GPU (GPGPU) ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์€ ๊ณ ์„ฑ๋Šฅ ์ปดํ“จํŒ… (HPC) ๋ฐ ๋”ฅ ๋Ÿฌ๋‹ (DL)๊ณผ ๊ฐ™์€ ๋‹ค์–‘ํ•œ ์—ฐ๊ตฌ ๋ถ„์•ผ์—์„œ ํ•ต์‹ฌ์ ์ธ ์—ญํ• ์„ ์ˆ˜ํ–‰ํ•˜๊ณ  ์žˆ๋‹ค. ์ด๋Ÿฌํ•œ ์‘ ์šฉ ๋ถ„์•ผ์˜ ๊ณตํ†ต์ ์ธ ํŠน์„ฑ์€ ๊ฑฐ๋Œ€ํ•œ ๊ณ„์‚ฐ ์„ฑ๋Šฅ์ด ํ•„์š”ํ•œ ๊ฒƒ์ด๋ฉฐ ๊ทธ๋ž˜ํ”ฝ ์ฒ˜๋ฆฌ ์žฅ์น˜ (GPU)์˜ ๋†’์€ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ํŠน์„ฑ๊ณผ ๋งค์šฐ ์ ํ•ฉํ•˜๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ GPU ์‹œ์Šคํ…œ์€ ํŠน์ • ์œ  ํ˜•์˜ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์— ์ตœ์ €ํ™”ํ•˜๋Š” ๋Œ€์‹  ๋ชจ๋“  ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์— ์‹œ์Šคํ…œ ์ˆ˜์ค€์˜ ๊ณต์ • ์„ฑ์„ ์ œ๊ณตํ•˜๋„๋ก ์„ค๊ณ„๋˜์–ด ์žˆ์œผ๋ฉฐ ๊ฐ GPGPU ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์˜ ์ž์› ์‚ฌ์šฉ ํŒจํ„ด์ด ๋‹ค์–‘ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋‹จ์ผ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์ด GPU ์‹œ์Šคํ…œ์˜ ๋ฆฌ์†Œ์Šค๋ฅผ ์™„์ „ํžˆ ํ™œ์šฉํ•˜์—ฌ GPU์˜ ์ตœ๊ณ  ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑ ํ•  ์ˆ˜๋Š” ์—†๋‹ค. ๋”ฐ๋ผ์„œ GPU ๋ฉ€ํ‹ฐ ํƒœ์Šคํ‚น์€ ๋‹ค์–‘ํ•œ ๋ฆฌ์†Œ์Šค ์‚ฌ์šฉ ํŒจํ„ด์„ ๊ฐ€์ง„ ์—ฌ๋Ÿฌ ์‘์šฉ ํ”„๋กœ๊ทธ ๋žจ์„ ํ•จ๊ป˜ ๋ฐฐ์น˜ํ•˜์—ฌ GPU ๋ฆฌ์†Œ์Šค๋ฅผ ๊ณต์œ ํ•จ์œผ๋กœ์จ GPU ์ž์› ์‚ฌ์šฉ๋ฅ  ์ €ํ•˜ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๊ธฐ์กด GPU ๋ฉ€ํ‹ฐ ํƒœ์Šคํ‚น ๊ธฐ์ˆ ์€ ์ž์› ์‚ฌ์šฉ๋ฅ  ๊ด€์ ์—์„œ ์‘ ์šฉ ํ”„๋กœ๊ทธ๋žจ์˜ ํšจ์œจ์ ์ธ ์‹คํ–‰๋ณด๋‹ค ๊ณต๋™์œผ๋กœ ์‹คํ–‰ํ•˜๋Š” ๋ฐ ์ค‘์ ์„ ๋‘”๋‹ค. ๋˜ํ•œ ํ˜„์žฌ GPU ๋ฉ€ํ‹ฐ ํƒœ์Šคํ‚น ๊ธฐ์ˆ ์€ ์˜คํ”ˆ ์†Œ์Šค๊ฐ€ ์•„๋‹ˆ๋ฏ€๋กœ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ๊ณผ GPU ์‹œ์Šคํ…œ์ด ์„œ๋กœ์˜ ๊ธฐ๋Šฅ์„ ์ธ์‹ํ•˜์ง€ ๋ชปํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ตœ์ ํ™”ํ•˜๊ธฐ๊ฐ€ ๋” ์–ด๋ ค์šธ ์ˆ˜๋„ ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์„ ์ˆ˜์ • ์—†์ด GPU ์‹œ์Šคํ…œ๊ณผ GPGPU ์‘์šฉ ์‚ฌ ์ด์˜ ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ํ†ตํ•ด ์‚ฌ์šฉํ•˜๋ฉด ๋ณด๋‹ค ๋†’์€ ์‘์šฉ์„ฑ๋Šฅ๊ณผ ์ž์› ์‚ฌ์šฉ์„ ๋ณด์ผ ์ˆ˜ ์žˆ์Œ์„ ์ฆ๋ช…ํ•˜๊ณ ์ž ํ•œ๋‹ค. ๊ทธ๋Ÿฌ๊ธฐ ์œ„ํ•ด GPU ํƒœ์Šคํฌ ๊ด€๋ฆฌ ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ๊ฐœ๋ฐœํ•˜์—ฌ GPU ๋ฉ€ํ‹ฐ ํƒœ์Šคํ‚น ํ™˜๊ฒฝ์—์„œ ๋ฐœ์ƒํ•˜๋Š” ๋‘ ๊ฐ€์ง€ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜์˜€๋‹ค. ์ฒซ์งธ, ๋ฉ€ํ‹ฐ ํƒœ ์Šคํ‚น ํ™˜๊ฒฝ์—์„œ GPU ๋ฉ”๋ชจ๋ฆฌ ์ดˆ๊ณผ ํ• ๋‹นํ•  ์ˆ˜ ์—†๋Š” ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ํ˜ธ์ŠคํŠธ ๋ฉ”๋ชจ๋ฆฌ์™€ ๋””๋ฐ”์ด์Šค ๋ฉ”๋ชจ๋ฆฌ์— ์ฒดํฌํฌ์ธํŠธ ๋ฐฉ์‹์„ ๋„์ž…ํ•˜์˜€๋‹ค. ๋‘˜์งธ, ๋ฉ€ํ‹ฐ ํƒœ์Šคํ‚น ํ™˜ ๊ฒฝ์—์„œ GPU ์ž์› ์‚ฌ์šฉ์œจ ์ €ํ•˜ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋”์šฑ ์„ธ๋ถ„ํ™” ๋œ GPU ์ปค๋„ ๊ด€๋ฆฌ ์‹œ์Šคํ…œ์„ ์ œ์‹œํ•˜์˜€๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ œ์•ˆํ•œ ๋ฐฉ๋ฒ•๋“ค์˜ ํšจ๊ณผ๋ฅผ ์ฆ๋ช…ํ•˜๊ธฐ ์œ„ํ•ด ์‹ค์ œ GPU ์‹œ์Šคํ…œ์— 92 ๊ตฌํ˜„ํ•˜๊ณ  ๊ทธ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•˜์˜€๋‹ค. ์ œ์•ˆํ•œ ์ ‘๊ทผ๋ฐฉ์‹์ด ๊ธฐ์กด ์ ‘๊ทผ ๋ฐฉ์‹๋ณด๋‹ค GPGPU ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ๊ณผ ๊ด€๋ จ๋œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ ๋” ๋†’์€ ์„ฑ๋Šฅ์„ ์ œ๊ณตํ•  ์ˆ˜ ์žˆ์Œ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค.Chapter 1 Introduction 1 1.1 Motivation 2 1.2 Contribution . 7 1.3 Outline 8 Chapter 2 Background 10 2.1 GraphicsProcessingUnit(GPU) and CUDA 10 2.2 CheckpointandRestart . 11 2.3 ResourceSharingModel. 11 2.4 CUDAContext 12 2.5 GPUThreadBlockScheduling . 13 2.6 Multi-ProcessServicewithHyper-Q 13 Chapter 3 Checkpoint based solution for GPU memory over- subscription problem 16 3.1 Motivation 16 3.2 RelatedWork. 18 3.3 DesignandImplementation . 20 3.3.1 System Design 21 3.3.2 CUDAAPIwrappingmodule 22 3.3.3 Scheduler . 28 3.4 Evaluation. 31 3.4.1 Evaluationsetup . 31 3.4.2 OverheadofFlexGPU 32 3.4.3 Performance with GPU Benchmark Suits 34 3.4.4 Performance with Real-world Workloads 36 3.4.5 Performance of workloads composed of multiple applications 39 3.5 Summary 42 Chapter 4 A Workload-aware Fine-grained Resource Manage- ment Framework for GPGPUs 43 4.1 Motivation 43 4.2 RelatedWork. 45 4.2.1 GPUresourcesharing 45 4.2.2 GPUscheduling . 46 4.3 DesignandImplementation . 47 4.3.1 SystemArchitecture . 47 4.3.2 CUDAAPIWrappingModule . 49 4.3.3 smCompactorRuntime . 50 4.3.4 ImplementationDetails . 57 4.4 Analysis on the relation between performance and workload usage pattern 60 4.4.1 WorkloadDefinition . 60 4.4.2 Analysisonperformancesaturation 60 4.4.3 Predict the necessary SMs and thread blocks for best performance . 64 4.5 Evaluation. 69 4.5.1 EvaluationMethodology. 70 4.5.2 OverheadofsmCompactor . 71 4.5.3 Performance with Different Thread Block Counts on Dif- ferentNumberofSMs 72 4.5.4 Performance with Concurrent Kernel and Resource Sharing 74 4.6 Summary . 79 Chapter 5 Conclusion. 81 ์š”์•ฝ. 92Docto

    Maruchitenantona kuraudo kankyลka deno GPU no daiikkyลซ keisan shigen toshite no chลซshลka

    Get PDF

    Cooperative kernels: GPU multitasking for blocking algorithms

    Get PDF
    There is growing interest in accelerating irregular data-parallel algorithms on GPUs. These algorithms are typically blocking , so they require fair scheduling. But GPU programming models (e.g. OpenCL) do not mandate fair scheduling, and GPU schedulers are unfair in practice. Current approaches avoid this issue by exploit- ing scheduling quirks of todayโ€™s GPUs in a manner that does not allow the GPU to be shared with other workloads (such as graphics rendering tasks). We propose cooperative kernels , an extension to the traditional GPU programming model geared towards writing blocking algorithms. Workgroups of a cooperative kernel are fairly scheduled, and multitasking is supported via a small set of language extensions through which the kernel and scheduler cooperate. We describe a prototype implementation of a cooperative kernel frame- work implemented in OpenCL 2.0 and evaluate our approach by porting a set of blocking GPU applications to cooperative kernels and examining their performance under multitasking

    Inter-workgroup barrier synchronisation on graphics processing units

    Get PDF
    GPUs are parallel devices that are able to run thousands of independent threads concurrently. Traditional GPU programs are data-parallel, requiring little to no communication, i.e. synchronisation, between threads. However, classical concurrency in the context of CPUs often exploits synchronisation idioms that are not supported on GPUs. By studying such idioms on GPUs, with an aim to facilitate them in a portable way, a wider and more generic space of GPU applications can be made possible. While the breadth of this thesis extends to many aspects of GPU systems, the common thread throughout is the global barrier: an execution barrier that synchronises all threads executing a GPU application. The idea of such a barrier might seem straightforward, however this investigation reveals many challenges and insights. In particular, this thesis includes the following studies: Execution models: while a general global barrier can deadlock due to starvation on GPUs, it is shown that the scheduling guarantees of current GPUs can be used to dynamically create an execution environment that allows for a safe and portable global barrier across a subset of the GPU threads. Application optimisations: a set GPU optimisations are examined that are tailored for graph applications, including one optimisation enabled by the global barrier. It is shown that these optimisations can provided substantial performance improvements, e.g. the barrier optimisation achieves over a 10X speedup on AMD and Intel GPUs. The performance portability of these optimisations is investigated, as their utility varies across input, application, and architecture. Multitasking: because many GPUs do not support preemption, long-running GPU compute tasks (e.g. applications that use the global barrier) may block other GPU functions, including graphics. A simple cooperative multitasking scheme is proposed that allows graphics tasks to meet their deadlines with reasonable overheads.Open Acces
    • โ€ฆ
    corecore