351 research outputs found
Cooperative kernels: GPU multitasking for blocking algorithms
There is growing interest in accelerating irregular data-parallel algorithms on GPUs. These algorithms are typically blocking , so they require fair scheduling. But GPU programming models (e.g. OpenCL) do not mandate fair scheduling, and GPU schedulers are unfair in practice. Current approaches avoid this issue by exploit- ing scheduling quirks of todayโs GPUs in a manner that does not allow the GPU to be shared with other workloads (such as graphics rendering tasks). We propose cooperative kernels , an extension to the traditional GPU programming model geared towards writing blocking algorithms. Workgroups of a cooperative kernel are fairly scheduled, and multitasking is supported via a small set of language extensions through which the kernel and scheduler cooperate. We describe a prototype implementation of a cooperative kernel frame- work implemented in OpenCL 2.0 and evaluate our approach by porting a set of blocking GPU applications to cooperative kernels and examining their performance under multitasking
Exploiting Hardware Abstraction for Parallel Programming Framework: Platform and Multitasking
With the help of the parallelism provided by the fine-grained architecture, hardware accelerators on Field Programmable Gate Arrays (FPGAs) can significantly improve the performance of many applications. However, designers are required to have excellent hardware programming skills and unique optimization techniques to explore the potential of FPGA resources fully. Intermediate frameworks above hardware circuits are proposed to improve either performance or productivity by leveraging parallel programming models beyond the multi-core era.
In this work, we propose the PolyPC (Polymorphic Parallel Computing) framework, which targets enhancing productivity without losing performance. It helps designers develop parallelized applications and implement them on FPGAs. The PolyPC framework implements a custom hardware platform, on which programs written in an OpenCL-like programming model can launch. Additionally, the PolyPC framework extends vendor-provided tools to provide a complete development environment including intermediate software framework, and automatic system builders. Designers\u27 programs can be either synthesized as hardware processing elements (PEs) or compiled to executable files running on software PEs. Benefiting from nontrivial features of re-loadable PEs, and independent group-level schedulers, the multitasking is enabled for both software and hardware PEs to improve the efficiency of utilizing hardware resources.
The PolyPC framework is evaluated regarding performance, area efficiency, and multitasking. The results show a maximum 66 times speedup over a dual-core ARM processor and 1043 times speedup over a high-performance MicroBlaze with 125 times of area efficiency. It delivers a significant improvement in response time to high-priority tasks with the priority-aware scheduling. Overheads of multitasking are evaluated to analyze trade-offs. With the help of the design flow, the OpenCL application programs are converted into executables through the front-end source-to-source transformation and back-end synthesis/compilation to run on PEs, and the framework is generated from users\u27 specifications
A scheduling theory framework for GPU tasks e๏ฌcient execution
Concurrent execution of tasks in GPUs can reduce the computation time of a workload by
overlapping data transfer and execution commands.
However it is di๏ฌcult to implement an e๏ฌcient run-
time scheduler that minimizes the workload makespan
as many execution orderings should be evaluated. In
this paper, we employ scheduling theory to build a
model that takes into account the device capabili-
ties, workload characteristics, constraints and objec-
tive functions. In our model, GPU tasks schedul-
ing is reformulated as a ๏ฌow shop scheduling prob-
lem, which allow us to apply and compare well known
methods already developed in the operations research
๏ฌeld. In addition we develop a new heuristic, specif-
ically focused on executing GPU commands, that
achieves better scheduling results than previous tech-
niques. Finally, a comprehensive evaluation, showing
the suitability and robustness of this new approach,
is conducted in three di๏ฌerent NVIDIA architectures
(Kepler, Maxwell and Pascal).Proyecto TIN2016- 0920R, Universidad de Mรกlaga (Campus de Excelencia Internacional Andalucรญa Tech) y programa de donaciรณn de NVIDIA Corporation
๋ฉํฐ ํ์คํน ํ๊ฒฝ์์ GPU๋ฅผ ์ฌ์ฉํ ๋ฒ์ฉ์ ๊ณ์ฐ ์์ฉ์ ํจ์จ์ ์ธ ์์คํ ์์ ํ์ฉ์ ์ํ GPU ์์คํ ์ต์ ํ
ํ์๋
ผ๋ฌธ (๋ฐ์ฌ) -- ์์ธ๋ํ๊ต ๋ํ์ : ๊ณต๊ณผ๋ํ ์ ๊ธฐยท์ปดํจํฐ๊ณตํ๋ถ, 2020. 8. ์ผํ์.Recently, General Purpose GPU (GPGPU) applications are playing key roles in many different research fields, such as high-performance computing (HPC) and deep learning (DL). The common feature exists in these applications is that all of them require massive computation power, which follows the high parallelism characteristics of the graphics processing unit (GPU). However, because of the resource usage pattern of each GPGPU application varies, a single application cannot fully exploit the GPU systems resources to achieve the best performance of the GPU since the GPU system is designed to provide system-level fairness to all applications instead of optimizing for a specific type. GPU multitasking can address the issue by co-locating multiple kernels with diverse resource usage patterns to share the GPU resource in parallel. However, the current GPU mul- titasking scheme focuses just on co-launching the kernels rather than making them execute more efficiently. Besides, the current GPU multitasking scheme is not open-sourced, which makes it more difficult to be optimized, since the GPGPU applications and the GPU system are unaware of the feature of each other. In this dissertation, we claim that using the support from framework between the GPU system and the GPGPU applications without modifying the application can yield better performance. We design and implement the frame- work while addressing two issues in GPGPU applications. First, we introduce a GPU memory checkpointing approach between the host memory and the device memory to address the problem that GPU memory cannot be over-subscripted in a multitasking environment. Second, we present a fine-grained GPU kernel management scheme to avoid the GPU resource under-utilization problem in a
i
multitasking environment. We implement and evaluate our schemes on a real GPU system. The experimental results show that our proposed approaches can solve the problems related to GPGPU applications than the existing approaches while delivering better performance.์ต๊ทผ ๋ฒ์ฉ GPU (GPGPU) ์์ฉ ํ๋ก๊ทธ๋จ์ ๊ณ ์ฑ๋ฅ ์ปดํจํ
(HPC) ๋ฐ ๋ฅ ๋ฌ๋ (DL)๊ณผ ๊ฐ์ ๋ค์ํ ์ฐ๊ตฌ ๋ถ์ผ์์ ํต์ฌ์ ์ธ ์ญํ ์ ์ํํ๊ณ ์๋ค. ์ด๋ฌํ ์ ์ฉ ๋ถ์ผ์ ๊ณตํต์ ์ธ ํน์ฑ์ ๊ฑฐ๋ํ ๊ณ์ฐ ์ฑ๋ฅ์ด ํ์ํ ๊ฒ์ด๋ฉฐ ๊ทธ๋ํฝ ์ฒ๋ฆฌ ์ฅ์น (GPU)์ ๋์ ๋ณ๋ ฌ ์ฒ๋ฆฌ ํน์ฑ๊ณผ ๋งค์ฐ ์ ํฉํ๋ค. ๊ทธ๋ฌ๋ GPU ์์คํ
์ ํน์ ์ ํ์ ์์ฉ ํ๋ก๊ทธ๋จ์ ์ต์ ํํ๋ ๋์ ๋ชจ๋ ์์ฉ ํ๋ก๊ทธ๋จ์ ์์คํ
์์ค์ ๊ณต์ ์ฑ์ ์ ๊ณตํ๋๋ก ์ค๊ณ๋์ด ์์ผ๋ฉฐ ๊ฐ GPGPU ์์ฉ ํ๋ก๊ทธ๋จ์ ์์ ์ฌ์ฉ ํจํด์ด ๋ค์ํ๊ธฐ ๋๋ฌธ์ ๋จ์ผ ์์ฉ ํ๋ก๊ทธ๋จ์ด GPU ์์คํ
์ ๋ฆฌ์์ค๋ฅผ ์์ ํ ํ์ฉํ์ฌ GPU์ ์ต๊ณ ์ฑ๋ฅ์ ๋ฌ์ฑ ํ ์๋ ์๋ค.
๋ฐ๋ผ์ GPU ๋ฉํฐ ํ์คํน์ ๋ค์ํ ๋ฆฌ์์ค ์ฌ์ฉ ํจํด์ ๊ฐ์ง ์ฌ๋ฌ ์์ฉ ํ๋ก๊ทธ ๋จ์ ํจ๊ป ๋ฐฐ์นํ์ฌ GPU ๋ฆฌ์์ค๋ฅผ ๊ณต์ ํจ์ผ๋ก์จ GPU ์์ ์ฌ์ฉ๋ฅ ์ ํ ๋ฌธ์ ๋ฅผ ํด๊ฒฐํ ์ ์๋ค. ๊ทธ๋ฌ๋ ๊ธฐ์กด GPU ๋ฉํฐ ํ์คํน ๊ธฐ์ ์ ์์ ์ฌ์ฉ๋ฅ ๊ด์ ์์ ์ ์ฉ ํ๋ก๊ทธ๋จ์ ํจ์จ์ ์ธ ์คํ๋ณด๋ค ๊ณต๋์ผ๋ก ์คํํ๋ ๋ฐ ์ค์ ์ ๋๋ค. ๋ํ ํ์ฌ GPU ๋ฉํฐ ํ์คํน ๊ธฐ์ ์ ์คํ ์์ค๊ฐ ์๋๋ฏ๋ก ์์ฉ ํ๋ก๊ทธ๋จ๊ณผ GPU ์์คํ
์ด ์๋ก์ ๊ธฐ๋ฅ์ ์ธ์ํ์ง ๋ชปํ๊ธฐ ๋๋ฌธ์ ์ต์ ํํ๊ธฐ๊ฐ ๋ ์ด๋ ค์ธ ์๋ ์๋ค.
๋ณธ ๋
ผ๋ฌธ์์๋ ์์ฉ ํ๋ก๊ทธ๋จ์ ์์ ์์ด GPU ์์คํ
๊ณผ GPGPU ์์ฉ ์ฌ ์ด์ ํ๋ ์์ํฌ๋ฅผ ํตํด ์ฌ์ฉํ๋ฉด ๋ณด๋ค ๋์ ์์ฉ์ฑ๋ฅ๊ณผ ์์ ์ฌ์ฉ์ ๋ณด์ผ ์ ์์์ ์ฆ๋ช
ํ๊ณ ์ ํ๋ค. ๊ทธ๋ฌ๊ธฐ ์ํด GPU ํ์คํฌ ๊ด๋ฆฌ ํ๋ ์์ํฌ๋ฅผ ๊ฐ๋ฐํ์ฌ GPU ๋ฉํฐ ํ์คํน ํ๊ฒฝ์์ ๋ฐ์ํ๋ ๋ ๊ฐ์ง ๋ฌธ์ ๋ฅผ ํด๊ฒฐํ์๋ค. ์ฒซ์งธ, ๋ฉํฐ ํ ์คํน ํ๊ฒฝ์์ GPU ๋ฉ๋ชจ๋ฆฌ ์ด๊ณผ ํ ๋นํ ์ ์๋ ๋ฌธ์ ๋ฅผ ํด๊ฒฐํ๊ธฐ ์ํด ํธ์คํธ ๋ฉ๋ชจ๋ฆฌ์ ๋๋ฐ์ด์ค ๋ฉ๋ชจ๋ฆฌ์ ์ฒดํฌํฌ์ธํธ ๋ฐฉ์์ ๋์
ํ์๋ค. ๋์งธ, ๋ฉํฐ ํ์คํน ํ ๊ฒฝ์์ GPU ์์ ์ฌ์ฉ์จ ์ ํ ๋ฌธ์ ๋ฅผ ํด๊ฒฐํ๊ธฐ ์ํด ๋์ฑ ์ธ๋ถํ ๋ GPU ์ปค๋ ๊ด๋ฆฌ ์์คํ
์ ์ ์ํ์๋ค.
๋ณธ ๋
ผ๋ฌธ์์๋ ์ ์ํ ๋ฐฉ๋ฒ๋ค์ ํจ๊ณผ๋ฅผ ์ฆ๋ช
ํ๊ธฐ ์ํด ์ค์ GPU ์์คํ
์
92
๊ตฌํํ๊ณ ๊ทธ ์ฑ๋ฅ์ ํ๊ฐํ์๋ค. ์ ์ํ ์ ๊ทผ๋ฐฉ์์ด ๊ธฐ์กด ์ ๊ทผ ๋ฐฉ์๋ณด๋ค GPGPU ์์ฉ ํ๋ก๊ทธ๋จ๊ณผ ๊ด๋ จ๋ ๋ฌธ์ ๋ฅผ ํด๊ฒฐํ ์ ์์ผ๋ฉฐ ๋ ๋์ ์ฑ๋ฅ์ ์ ๊ณตํ ์ ์์์ ํ์ธํ ์ ์์๋ค.Chapter 1 Introduction 1
1.1 Motivation 2
1.2 Contribution . 7
1.3 Outline 8
Chapter 2 Background 10
2.1 GraphicsProcessingUnit(GPU) and CUDA 10
2.2 CheckpointandRestart . 11
2.3 ResourceSharingModel. 11
2.4 CUDAContext 12
2.5 GPUThreadBlockScheduling . 13
2.6 Multi-ProcessServicewithHyper-Q 13
Chapter 3 Checkpoint based solution for GPU memory over- subscription problem 16
3.1 Motivation 16
3.2 RelatedWork. 18
3.3 DesignandImplementation . 20
3.3.1 System Design 21
3.3.2 CUDAAPIwrappingmodule 22
3.3.3 Scheduler . 28
3.4 Evaluation. 31
3.4.1 Evaluationsetup . 31
3.4.2 OverheadofFlexGPU 32
3.4.3 Performance with GPU Benchmark Suits 34
3.4.4 Performance with Real-world Workloads 36
3.4.5 Performance of workloads composed of multiple applications 39
3.5 Summary 42
Chapter 4 A Workload-aware Fine-grained Resource Manage- ment Framework for GPGPUs 43
4.1 Motivation 43
4.2 RelatedWork. 45
4.2.1 GPUresourcesharing 45
4.2.2 GPUscheduling . 46
4.3 DesignandImplementation . 47
4.3.1 SystemArchitecture . 47
4.3.2 CUDAAPIWrappingModule . 49
4.3.3 smCompactorRuntime . 50
4.3.4 ImplementationDetails . 57
4.4 Analysis on the relation between performance and workload usage pattern 60
4.4.1 WorkloadDefinition . 60
4.4.2 Analysisonperformancesaturation 60
4.4.3 Predict the necessary SMs and thread blocks for best performance . 64
4.5 Evaluation. 69
4.5.1 EvaluationMethodology. 70
4.5.2 OverheadofsmCompactor . 71
4.5.3 Performance with Different Thread Block Counts on Dif- ferentNumberofSMs 72
4.5.4 Performance with Concurrent Kernel and Resource Sharing 74
4.6 Summary . 79
Chapter 5 Conclusion. 81
์์ฝ. 92Docto
Inter-workgroup barrier synchronisation on graphics processing units
GPUs are parallel devices that are able to run thousands of
independent threads concurrently. Traditional GPU programs are
data-parallel, requiring little to no communication,
i.e. synchronisation, between threads. However, classical concurrency
in the context of CPUs often exploits synchronisation idioms that are
not supported on GPUs. By studying such idioms on GPUs, with an aim to
facilitate them in a portable way, a wider and more generic space of
GPU applications can be made possible.
While the breadth of this thesis extends to many aspects of GPU
systems, the common thread throughout is the global barrier: an
execution barrier that synchronises all threads executing a GPU
application. The idea of such a barrier might seem straightforward,
however this investigation reveals many challenges and insights. In
particular, this thesis includes the following studies:
Execution models: while a general global barrier can deadlock due to
starvation on GPUs, it is shown that the scheduling guarantees of
current GPUs can be used to dynamically create an execution
environment that allows for a safe and portable global barrier
across a subset of the GPU threads.
Application optimisations: a set GPU optimisations are examined that
are tailored for graph applications, including one optimisation
enabled by the global barrier. It is shown that these optimisations
can provided substantial performance improvements, e.g. the barrier
optimisation achieves over a 10X speedup on AMD and Intel GPUs. The
performance portability of these optimisations is investigated, as
their utility varies across input, application, and architecture.
Multitasking: because many GPUs do not support preemption,
long-running GPU compute tasks (e.g. applications that use the
global barrier) may block other GPU functions, including graphics. A
simple cooperative multitasking scheme is proposed that allows
graphics tasks to meet their deadlines with reasonable overheads.Open Acces
MURAC: A unified machine model for heterogeneous computers
Includes bibliographical referencesHeterogeneous computing enables the performance and energy advantages of multiple distinct processing architectures to be efficiently exploited within a single machine. These systems are capable of delivering large performance increases by matching the applications to architectures that are most suited to them. The Multiple Runtime-reconfigurable Architecture Computer (MURAC) model has been proposed to tackle the problems commonly found in the design and usage of these machines. This model presents a system-level approach that creates a clear separation of concerns between the system implementer and the application developer. The three key concepts that make up the MURAC model are a unified machine model, a unified instruction stream and a unified memory space. A simple programming model built upon these abstractions provides a consistent interface for interacting with the underlying machine to the user application. This programming model simplifies application partitioning between hardware and software and allows the easy integration of different execution models within the single control ow of a mixed-architecture application. The theoretical and practical trade-offs of the proposed model have been explored through the design of several systems. An instruction-accurate system simulator has been developed that supports the simulated execution of mixed-architecture applications. An embedded System-on-Chip implementation has been used to measure the overhead in hardware resources required to support the model, which was found to be minimal. An implementation of the model within an operating system on a tightly-coupled reconfigurable processor platform has been created. This implementation is used to extend the software scheduler to allow for the full support of mixed-architecture applications in a multitasking environment. Different scheduling strategies have been tested using this scheduler for mixed-architecture applications. The design and implementation of these systems has shown that a unified abstraction model for heterogeneous computers provides important usability benefits to system and application designers. These benefits are achieved through a consistent view of the multiple different architectures to the operating system and user applications. This allows them to focus on achieving their performance and efficiency goals by gaining the benefits of different execution models during runtime without the complex implementation details of the system-level synchronisation and coordination
Parallel and Distributed Computing
The 14 chapters presented in this book cover a wide variety of representative works ranging from hardware design to application development. Particularly, the topics that are addressed are programmable and reconfigurable devices and systems, dependability of GPUs (General Purpose Units), network topologies, cache coherence protocols, resource allocation, scheduling algorithms, peertopeer networks, largescale network simulation, and parallel routines and algorithms. In this way, the articles included in this book constitute an excellent reference for engineers and researchers who have particular interests in each of these topics in parallel and distributed computing
Miriam: Exploiting Elastic Kernels for Real-time Multi-DNN Inference on Edge GPU
Many applications such as autonomous driving and augmented reality, require
the concurrent running of multiple deep neural networks (DNN) that poses
different levels of real-time performance requirements. However, coordinating
multiple DNN tasks with varying levels of criticality on edge GPUs remains an
area of limited study. Unlike server-level GPUs, edge GPUs are resource-limited
and lack hardware-level resource management mechanisms for avoiding resource
contention. Therefore, we propose Miriam, a contention-aware task coordination
framework for multi-DNN inference on edge GPU. Miriam consolidates two main
components, an elastic-kernel generator, and a runtime dynamic kernel
coordinator, to support mixed critical DNN inference. To evaluate Miriam, we
build a new DNN inference benchmark based on CUDA with diverse representative
DNN workloads. Experiments on two edge GPU platforms show that Miriam can
increase system throughput by 92% while only incurring less than 10\% latency
overhead for critical tasks, compared to state of art baselines
- โฆ