4 research outputs found

    Acceleration-as-a-Service: Exploiting Virtualised GPUs for a Financial Application

    Get PDF
    'How can GPU acceleration be obtained as a service in a cluster?' This question has become increasingly significant due to the inefficiency of installing GPUs on all nodes of a cluster. The research reported in this paper is motivated to address the above question by employing rCUDA (remote CUDA), a framework that facilitates Acceleration-as-a-Service (AaaS), such that the nodes of a cluster can request the acceleration of a set of remote GPUs on demand. The rCUDA framework exploits virtualisation and ensures that multiple nodes can share the same GPU. In this paper we test the feasibility of the rCUDA framework on a real-world application employed in the financial risk industry that can benefit from AaaS in the production setting. The results confirm the feasibility of rCUDA and highlight that rCUDA achieves similar performance compared to CUDA, provides consistent results, and more importantly, allows for a single application to benefit from all the GPUs available in the cluster without loosing efficiency.Comment: 11th IEEE International Conference on eScience (IEEE eScience) - Munich, Germany, 201

    Diplomat: Mapping of multi-kernel applications using a static dataflow abstraction

    Get PDF
    In this paper we propose a novel approach to heterogeneous embedded systems programmability using a taskgraph based framework called Diplomat. Diplomat is a taskgraph framework that exploits the potential of static dataflow modeling and analysis to deliver performance estimation and CPU/GPU mapping. An application has to be specified once, and then the framework can automatically propose good mappings. We evaluate Diplomat with a computer vision application on two embedded platforms. Using the Diplomat generation we observed a 16% performance improvement on average and up to a 30% improvement over the best existing hand-coded implementation

    ๋ฉ€ํ‹ฐ ํƒœ์Šคํ‚น ํ™˜๊ฒฝ์—์„œ GPU๋ฅผ ์‚ฌ์šฉํ•œ ๋ฒ”์šฉ์  ๊ณ„์‚ฐ ์‘์šฉ์˜ ํšจ์œจ์ ์ธ ์‹œ์Šคํ…œ ์ž์› ํ™œ์šฉ์„ ์œ„ํ•œ GPU ์‹œ์Šคํ…œ ์ตœ์ ํ™”

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2020. 8. ์—ผํ—Œ์˜.Recently, General Purpose GPU (GPGPU) applications are playing key roles in many different research fields, such as high-performance computing (HPC) and deep learning (DL). The common feature exists in these applications is that all of them require massive computation power, which follows the high parallelism characteristics of the graphics processing unit (GPU). However, because of the resource usage pattern of each GPGPU application varies, a single application cannot fully exploit the GPU systems resources to achieve the best performance of the GPU since the GPU system is designed to provide system-level fairness to all applications instead of optimizing for a specific type. GPU multitasking can address the issue by co-locating multiple kernels with diverse resource usage patterns to share the GPU resource in parallel. However, the current GPU mul- titasking scheme focuses just on co-launching the kernels rather than making them execute more efficiently. Besides, the current GPU multitasking scheme is not open-sourced, which makes it more difficult to be optimized, since the GPGPU applications and the GPU system are unaware of the feature of each other. In this dissertation, we claim that using the support from framework between the GPU system and the GPGPU applications without modifying the application can yield better performance. We design and implement the frame- work while addressing two issues in GPGPU applications. First, we introduce a GPU memory checkpointing approach between the host memory and the device memory to address the problem that GPU memory cannot be over-subscripted in a multitasking environment. Second, we present a fine-grained GPU kernel management scheme to avoid the GPU resource under-utilization problem in a i multitasking environment. We implement and evaluate our schemes on a real GPU system. The experimental results show that our proposed approaches can solve the problems related to GPGPU applications than the existing approaches while delivering better performance.์ตœ๊ทผ ๋ฒ”์šฉ GPU (GPGPU) ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์€ ๊ณ ์„ฑ๋Šฅ ์ปดํ“จํŒ… (HPC) ๋ฐ ๋”ฅ ๋Ÿฌ๋‹ (DL)๊ณผ ๊ฐ™์€ ๋‹ค์–‘ํ•œ ์—ฐ๊ตฌ ๋ถ„์•ผ์—์„œ ํ•ต์‹ฌ์ ์ธ ์—ญํ• ์„ ์ˆ˜ํ–‰ํ•˜๊ณ  ์žˆ๋‹ค. ์ด๋Ÿฌํ•œ ์‘ ์šฉ ๋ถ„์•ผ์˜ ๊ณตํ†ต์ ์ธ ํŠน์„ฑ์€ ๊ฑฐ๋Œ€ํ•œ ๊ณ„์‚ฐ ์„ฑ๋Šฅ์ด ํ•„์š”ํ•œ ๊ฒƒ์ด๋ฉฐ ๊ทธ๋ž˜ํ”ฝ ์ฒ˜๋ฆฌ ์žฅ์น˜ (GPU)์˜ ๋†’์€ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ํŠน์„ฑ๊ณผ ๋งค์šฐ ์ ํ•ฉํ•˜๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ GPU ์‹œ์Šคํ…œ์€ ํŠน์ • ์œ  ํ˜•์˜ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์— ์ตœ์ €ํ™”ํ•˜๋Š” ๋Œ€์‹  ๋ชจ๋“  ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์— ์‹œ์Šคํ…œ ์ˆ˜์ค€์˜ ๊ณต์ • ์„ฑ์„ ์ œ๊ณตํ•˜๋„๋ก ์„ค๊ณ„๋˜์–ด ์žˆ์œผ๋ฉฐ ๊ฐ GPGPU ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์˜ ์ž์› ์‚ฌ์šฉ ํŒจํ„ด์ด ๋‹ค์–‘ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋‹จ์ผ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์ด GPU ์‹œ์Šคํ…œ์˜ ๋ฆฌ์†Œ์Šค๋ฅผ ์™„์ „ํžˆ ํ™œ์šฉํ•˜์—ฌ GPU์˜ ์ตœ๊ณ  ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑ ํ•  ์ˆ˜๋Š” ์—†๋‹ค. ๋”ฐ๋ผ์„œ GPU ๋ฉ€ํ‹ฐ ํƒœ์Šคํ‚น์€ ๋‹ค์–‘ํ•œ ๋ฆฌ์†Œ์Šค ์‚ฌ์šฉ ํŒจํ„ด์„ ๊ฐ€์ง„ ์—ฌ๋Ÿฌ ์‘์šฉ ํ”„๋กœ๊ทธ ๋žจ์„ ํ•จ๊ป˜ ๋ฐฐ์น˜ํ•˜์—ฌ GPU ๋ฆฌ์†Œ์Šค๋ฅผ ๊ณต์œ ํ•จ์œผ๋กœ์จ GPU ์ž์› ์‚ฌ์šฉ๋ฅ  ์ €ํ•˜ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๊ธฐ์กด GPU ๋ฉ€ํ‹ฐ ํƒœ์Šคํ‚น ๊ธฐ์ˆ ์€ ์ž์› ์‚ฌ์šฉ๋ฅ  ๊ด€์ ์—์„œ ์‘ ์šฉ ํ”„๋กœ๊ทธ๋žจ์˜ ํšจ์œจ์ ์ธ ์‹คํ–‰๋ณด๋‹ค ๊ณต๋™์œผ๋กœ ์‹คํ–‰ํ•˜๋Š” ๋ฐ ์ค‘์ ์„ ๋‘”๋‹ค. ๋˜ํ•œ ํ˜„์žฌ GPU ๋ฉ€ํ‹ฐ ํƒœ์Šคํ‚น ๊ธฐ์ˆ ์€ ์˜คํ”ˆ ์†Œ์Šค๊ฐ€ ์•„๋‹ˆ๋ฏ€๋กœ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ๊ณผ GPU ์‹œ์Šคํ…œ์ด ์„œ๋กœ์˜ ๊ธฐ๋Šฅ์„ ์ธ์‹ํ•˜์ง€ ๋ชปํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ตœ์ ํ™”ํ•˜๊ธฐ๊ฐ€ ๋” ์–ด๋ ค์šธ ์ˆ˜๋„ ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์„ ์ˆ˜์ • ์—†์ด GPU ์‹œ์Šคํ…œ๊ณผ GPGPU ์‘์šฉ ์‚ฌ ์ด์˜ ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ํ†ตํ•ด ์‚ฌ์šฉํ•˜๋ฉด ๋ณด๋‹ค ๋†’์€ ์‘์šฉ์„ฑ๋Šฅ๊ณผ ์ž์› ์‚ฌ์šฉ์„ ๋ณด์ผ ์ˆ˜ ์žˆ์Œ์„ ์ฆ๋ช…ํ•˜๊ณ ์ž ํ•œ๋‹ค. ๊ทธ๋Ÿฌ๊ธฐ ์œ„ํ•ด GPU ํƒœ์Šคํฌ ๊ด€๋ฆฌ ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ๊ฐœ๋ฐœํ•˜์—ฌ GPU ๋ฉ€ํ‹ฐ ํƒœ์Šคํ‚น ํ™˜๊ฒฝ์—์„œ ๋ฐœ์ƒํ•˜๋Š” ๋‘ ๊ฐ€์ง€ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜์˜€๋‹ค. ์ฒซ์งธ, ๋ฉ€ํ‹ฐ ํƒœ ์Šคํ‚น ํ™˜๊ฒฝ์—์„œ GPU ๋ฉ”๋ชจ๋ฆฌ ์ดˆ๊ณผ ํ• ๋‹นํ•  ์ˆ˜ ์—†๋Š” ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ํ˜ธ์ŠคํŠธ ๋ฉ”๋ชจ๋ฆฌ์™€ ๋””๋ฐ”์ด์Šค ๋ฉ”๋ชจ๋ฆฌ์— ์ฒดํฌํฌ์ธํŠธ ๋ฐฉ์‹์„ ๋„์ž…ํ•˜์˜€๋‹ค. ๋‘˜์งธ, ๋ฉ€ํ‹ฐ ํƒœ์Šคํ‚น ํ™˜ ๊ฒฝ์—์„œ GPU ์ž์› ์‚ฌ์šฉ์œจ ์ €ํ•˜ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋”์šฑ ์„ธ๋ถ„ํ™” ๋œ GPU ์ปค๋„ ๊ด€๋ฆฌ ์‹œ์Šคํ…œ์„ ์ œ์‹œํ•˜์˜€๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ œ์•ˆํ•œ ๋ฐฉ๋ฒ•๋“ค์˜ ํšจ๊ณผ๋ฅผ ์ฆ๋ช…ํ•˜๊ธฐ ์œ„ํ•ด ์‹ค์ œ GPU ์‹œ์Šคํ…œ์— 92 ๊ตฌํ˜„ํ•˜๊ณ  ๊ทธ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•˜์˜€๋‹ค. ์ œ์•ˆํ•œ ์ ‘๊ทผ๋ฐฉ์‹์ด ๊ธฐ์กด ์ ‘๊ทผ ๋ฐฉ์‹๋ณด๋‹ค GPGPU ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ๊ณผ ๊ด€๋ จ๋œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ ๋” ๋†’์€ ์„ฑ๋Šฅ์„ ์ œ๊ณตํ•  ์ˆ˜ ์žˆ์Œ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค.Chapter 1 Introduction 1 1.1 Motivation 2 1.2 Contribution . 7 1.3 Outline 8 Chapter 2 Background 10 2.1 GraphicsProcessingUnit(GPU) and CUDA 10 2.2 CheckpointandRestart . 11 2.3 ResourceSharingModel. 11 2.4 CUDAContext 12 2.5 GPUThreadBlockScheduling . 13 2.6 Multi-ProcessServicewithHyper-Q 13 Chapter 3 Checkpoint based solution for GPU memory over- subscription problem 16 3.1 Motivation 16 3.2 RelatedWork. 18 3.3 DesignandImplementation . 20 3.3.1 System Design 21 3.3.2 CUDAAPIwrappingmodule 22 3.3.3 Scheduler . 28 3.4 Evaluation. 31 3.4.1 Evaluationsetup . 31 3.4.2 OverheadofFlexGPU 32 3.4.3 Performance with GPU Benchmark Suits 34 3.4.4 Performance with Real-world Workloads 36 3.4.5 Performance of workloads composed of multiple applications 39 3.5 Summary 42 Chapter 4 A Workload-aware Fine-grained Resource Manage- ment Framework for GPGPUs 43 4.1 Motivation 43 4.2 RelatedWork. 45 4.2.1 GPUresourcesharing 45 4.2.2 GPUscheduling . 46 4.3 DesignandImplementation . 47 4.3.1 SystemArchitecture . 47 4.3.2 CUDAAPIWrappingModule . 49 4.3.3 smCompactorRuntime . 50 4.3.4 ImplementationDetails . 57 4.4 Analysis on the relation between performance and workload usage pattern 60 4.4.1 WorkloadDefinition . 60 4.4.2 Analysisonperformancesaturation 60 4.4.3 Predict the necessary SMs and thread blocks for best performance . 64 4.5 Evaluation. 69 4.5.1 EvaluationMethodology. 70 4.5.2 OverheadofsmCompactor . 71 4.5.3 Performance with Different Thread Block Counts on Dif- ferentNumberofSMs 72 4.5.4 Performance with Concurrent Kernel and Resource Sharing 74 4.6 Summary . 79 Chapter 5 Conclusion. 81 ์š”์•ฝ. 92Docto

    Scheduling Concurrent Applications on a Cluster of CPU-GPU Nodes

    No full text
    corecore