Search CORE

446 research outputs found

Enabling preemptive multiprogramming on GPUs

Author: Cabezas Javier
Gelado Fernandez Isaac
Navarro Nacho
Ramírez Bellido Alejandro
Tanasic Ivan
Valero Cortés Mateo
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2014
Field of study

GPUs are being increasingly adopted as compute accelerators in many domains, spanning environments from mobile systems to cloud computing. These systems are usually running multiple applications, from one or several users. However GPUs do not provide the support for resource sharing traditionally expected in these scenarios. Thus, such systems are unable to provide key multiprogrammed workload requirements, such as responsiveness, fairness or quality of service. In this paper, we propose a set of hardware extensions that allow GPUs to efficiently support multiprogrammed GPU workloads. We argue for preemptive multitasking and design two preemption mechanisms that can be used to implement GPU scheduling policies. We extend the architecture to allow concurrent execution of GPU kernels from different user processes and implement a scheduling policy that dynamically distributes the GPU cores among concurrently running kernels, according to their priorities. We extend the NVIDIA GK110 (Kepler) like GPU architecture with our proposals and evaluate them on a set of multiprogrammed workloads with up to eight concurrent processes. Our proposals improve execution time of high-priority processes by 15.6x, the average application turnaround time between 1.5x to 2x, and system fairness up to 3.4x.We would like to thank the anonymous reviewers, Alexan- der Veidenbaum, Carlos Villavieja, Lluis Vilanova, Lluc Al- varez, and Marc Jorda on their comments and help improving our work and this paper. This work is supported by Euro- pean Commission through TERAFLUX (FP7-249013), Mont- Blanc (FP7-288777), and RoMoL (GA-321253) projects, NVIDIA through the CUDA Center of Excellence program, Spanish Government through Programa Severo Ochoa (SEV-2011-0067) and Spanish Ministry of Science and Technology through TIN2007-60625 and TIN2012-34557 projects.Peer ReviewedPostprint (author’s final draft

Crossref

UPCommons. Portal del coneixement obert de la UPC

HeteroCore GPU to exploit TLP-resource diversity

Author: Eeckhout Lieven
Wang Zhiying
Zhao Xia
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2019
Field of study

Ghent University Academic Bibliography

Preemptive Thread Block Scheduling with Online Structural Runtime Prediction for Concurrent GPGPU Kernels

Author: Govindarajan R.
Pai Sreepathi
Thazhuthaveetil Matthew J.
Publication venue
Publication date: 01/01/2014
Field of study

Recent NVIDIA Graphics Processing Units (GPUs) can execute multiple kernels concurrently. On these GPUs, the thread block scheduler (TBS) uses the FIFO policy to schedule their thread blocks. We show that FIFO leaves performance to chance, resulting in significant loss of performance and fairness. To improve performance and fairness, we propose use of the preemptive Shortest Remaining Time First (SRTF) policy instead. Although SRTF requires an estimate of runtime of GPU kernels, we show that such an estimate of the runtime can be easily obtained using online profiling and exploiting a simple observation on GPU kernels' grid structure. Specifically, we propose a novel Structural Runtime Predictor. Using a simple Staircase model of GPU kernel execution, we show that the runtime of a kernel can be predicted by profiling only the first few thread blocks. We evaluate an online predictor based on this model on benchmarks from ERCBench, and find that it can estimate the actual runtime reasonably well after the execution of only a single thread block. Next, we design a thread block scheduler that is both concurrent kernel-aware and uses this predictor. We implement the SRTF policy and evaluate it on two-program workloads from ERCBench. SRTF improves STP by 1.18x and ANTT by 2.25x over FIFO. When compared to MPMax, a state-of-the-art resource allocation policy for concurrent kernels, SRTF improves STP by 1.16x and ANTT by 1.3x. To improve fairness, we also propose SRTF/Adaptive which controls resource usage of concurrently executing kernels to maximize fairness. SRTF/Adaptive improves STP by 1.12x, ANTT by 2.23x and Fairness by 2.95x compared to FIFO. Overall, our implementation of SRTF achieves system throughput to within 12.64% of Shortest Job First (SJF, an oracle optimal scheduling policy), bridging 49% of the gap between FIFO and SJF.Comment: 14 pages, full pre-review version of PACT 2014 poste

arXiv.org e-Print Archive

CiteSeerX

Crossref

멀티 태스킹 환경에서 GPU를 사용한 범용적 계산 응용의 효율적인 시스템 자원 활용을 위한 GPU 시스템 최적화

Author: CHEN QICHEN
Publication venue: 서울대학교 대학원
Publication date: 01/08/2020
Field of study

학위논문 (박사) -- 서울대학교 대학원 : 공과대학 전기·컴퓨터공학부, 2020. 8. 염헌영.Recently, General Purpose GPU (GPGPU) applications are playing key roles in many different research fields, such as high-performance computing (HPC) and deep learning (DL). The common feature exists in these applications is that all of them require massive computation power, which follows the high parallelism characteristics of the graphics processing unit (GPU). However, because of the resource usage pattern of each GPGPU application varies, a single application cannot fully exploit the GPU systems resources to achieve the best performance of the GPU since the GPU system is designed to provide system-level fairness to all applications instead of optimizing for a specific type. GPU multitasking can address the issue by co-locating multiple kernels with diverse resource usage patterns to share the GPU resource in parallel. However, the current GPU mul- titasking scheme focuses just on co-launching the kernels rather than making them execute more efficiently. Besides, the current GPU multitasking scheme is not open-sourced, which makes it more difficult to be optimized, since the GPGPU applications and the GPU system are unaware of the feature of each other. In this dissertation, we claim that using the support from framework between the GPU system and the GPGPU applications without modifying the application can yield better performance. We design and implement the frame- work while addressing two issues in GPGPU applications. First, we introduce a GPU memory checkpointing approach between the host memory and the device memory to address the problem that GPU memory cannot be over-subscripted in a multitasking environment. Second, we present a fine-grained GPU kernel management scheme to avoid the GPU resource under-utilization problem in a i multitasking environment. We implement and evaluate our schemes on a real GPU system. The experimental results show that our proposed approaches can solve the problems related to GPGPU applications than the existing approaches while delivering better performance.최근 범용 GPU (GPGPU) 응용 프로그램은 고성능 컴퓨팅 (HPC) 및 딥 러닝 (DL)과 같은 다양한 연구 분야에서 핵심적인 역할을 수행하고 있다. 이러한 응 용 분야의 공통적인 특성은 거대한 계산 성능이 필요한 것이며 그래픽 처리 장치 (GPU)의 높은 병렬 처리 특성과 매우 적합하다. 그러나 GPU 시스템은 특정 유 형의 응용 프로그램에 최저화하는 대신 모든 응용 프로그램에 시스템 수준의 공정 성을 제공하도록 설계되어 있으며 각 GPGPU 응용 프로그램의 자원 사용 패턴이 다양하기 때문에 단일 응용 프로그램이 GPU 시스템의 리소스를 완전히 활용하여 GPU의 최고 성능을 달성 할 수는 없다. 따라서 GPU 멀티 태스킹은 다양한 리소스 사용 패턴을 가진 여러 응용 프로그 램을 함께 배치하여 GPU 리소스를 공유함으로써 GPU 자원 사용률 저하 문제를 해결할 수 있다. 그러나 기존 GPU 멀티 태스킹 기술은 자원 사용률 관점에서 응 용 프로그램의 효율적인 실행보다 공동으로 실행하는 데 중점을 둔다. 또한 현재 GPU 멀티 태스킹 기술은 오픈 소스가 아니므로 응용 프로그램과 GPU 시스템이 서로의 기능을 인식하지 못하기 때문에 최적화하기가 더 어려울 수도 있다. 본 논문에서는 응용 프로그램을 수정 없이 GPU 시스템과 GPGPU 응용 사 이의 프레임워크를 통해 사용하면 보다 높은 응용성능과 자원 사용을 보일 수 있음을 증명하고자 한다. 그러기 위해 GPU 태스크 관리 프레임워크를 개발하여 GPU 멀티 태스킹 환경에서 발생하는 두 가지 문제를 해결하였다. 첫째, 멀티 태 스킹 환경에서 GPU 메모리 초과 할당할 수 없는 문제를 해결하기 위해 호스트 메모리와 디바이스 메모리에 체크포인트 방식을 도입하였다. 둘째, 멀티 태스킹 환 경에서 GPU 자원 사용율 저하 문제를 해결하기 위해 더욱 세분화 된 GPU 커널 관리 시스템을 제시하였다. 본 논문에서는 제안한 방법들의 효과를 증명하기 위해 실제 GPU 시스템에 92 구현하고 그 성능을 평가하였다. 제안한 접근방식이 기존 접근 방식보다 GPGPU 응용 프로그램과 관련된 문제를 해결할 수 있으며 더 높은 성능을 제공할 수 있음을 확인할 수 있었다.Chapter 1 Introduction 1 1.1 Motivation 2 1.2 Contribution . 7 1.3 Outline 8 Chapter 2 Background 10 2.1 GraphicsProcessingUnit(GPU) and CUDA 10 2.2 CheckpointandRestart . 11 2.3 ResourceSharingModel. 11 2.4 CUDAContext 12 2.5 GPUThreadBlockScheduling . 13 2.6 Multi-ProcessServicewithHyper-Q 13 Chapter 3 Checkpoint based solution for GPU memory over- subscription problem 16 3.1 Motivation 16 3.2 RelatedWork. 18 3.3 DesignandImplementation . 20 3.3.1 System Design 21 3.3.2 CUDAAPIwrappingmodule 22 3.3.3 Scheduler . 28 3.4 Evaluation. 31 3.4.1 Evaluationsetup . 31 3.4.2 OverheadofFlexGPU 32 3.4.3 Performance with GPU Benchmark Suits 34 3.4.4 Performance with Real-world Workloads 36 3.4.5 Performance of workloads composed of multiple applications 39 3.5 Summary 42 Chapter 4 A Workload-aware Fine-grained Resource Manage- ment Framework for GPGPUs 43 4.1 Motivation 43 4.2 RelatedWork. 45 4.2.1 GPUresourcesharing 45 4.2.2 GPUscheduling . 46 4.3 DesignandImplementation . 47 4.3.1 SystemArchitecture . 47 4.3.2 CUDAAPIWrappingModule . 49 4.3.3 smCompactorRuntime . 50 4.3.4 ImplementationDetails . 57 4.4 Analysis on the relation between performance and workload usage pattern 60 4.4.1 WorkloadDefinition . 60 4.4.2 Analysisonperformancesaturation 60 4.4.3 Predict the necessary SMs and thread blocks for best performance . 64 4.5 Evaluation. 69 4.5.1 EvaluationMethodology. 70 4.5.2 OverheadofsmCompactor . 71 4.5.3 Performance with Different Thread Block Counts on Dif- ferentNumberofSMs 72 4.5.4 Performance with Concurrent Kernel and Resource Sharing 74 4.6 Summary . 79 Chapter 5 Conclusion. 81 요약. 92Docto

SNU Open Repository and Archive

Supporting Preemptive Task Executions and Memory Copies in GPGPUs

Author: Basaran Can
Kang Kyoung-Don
Publication venue: The Open Repository @ Binghamton (The ORB)
Publication date: 01/07/2012
Field of study

GPGPUs (General Purpose Graphic Processing Units) provide massive computational power. However, applying GPGPU technology to real-time computing is challenging due to the non-preemptive nature of GPGPUs. Especially, a job running in a GPGPU or a data copy between a GPGPU and CPU is non-preemptive. As a result, a high priority job arriving in the middle of a low priority job execution or memory copy suffers from priority inversion. To address the problem, we present a new lightweight approach to supporting preemptive memory copies and job executions in GPGPUs. Moreover, in our approach, a GPGPU job and memory copy between a GPGPU and the hosting CPU are run concurrently to enhance the responsiveness. To show the feasibility of our approach, we have implemented a prototype system for preemptive job executions and data copies in a GPGPU. The experimental results show that our approach can bound the response times in a reliable manner. In addition, the response time of our approach is significantly shorter than those of the unmodified GPGPU runtime system that supports no preemption and an advanced GPGPU model designed to support prioritization and performance isolation via preemptive data copies

The Open Repository @Binghamton (The ORB)

CUsched: multiprogrammed workload scheduling on GPU architectures

Author: Cabezas Javier
Gelado Fernandez Isaac
Navarro Nacho
Ramírez Bellido Alejandro
Tanasic Ivan
Valero Cortés Mateo
Publication venue
Publication date: 01/01/2013
Field of study

Graphic Processing Units (GPUs) are currently widely used in High Performance Computing (HPC) applications to speed-up the execution of massively-parallel codes. GPUs are well-suited for such HPC environments because applications share a common characteristic with the gaming codes GPUs were designed for: only one application is using the GPU at the same time. Although, minimal support for multi-programmed systems exist, modern GPUs do not allow resource sharing among different processes. This lack of support restricts the usage of GPUs in desktop and mobile environment to a small amount of applications (e.g., games and multimedia players). In this paper we study the multi-programming support available in current GPUs, and show how such support is not sufficient. We propose a set of hardware extensions to the current GPU architectures to efficiently support multi-programmed GPU workloads, allowing concurrent execution of codes from different user processes. We implement several hardware schedulers on top of these extensions and analyze the behaviour of different work scheduling algorithms using system wide and per process metrics.Postprint (published version

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC