126 research outputs found

    멀티 태스킹 환경에서 GPU를 사용한 범용적 계산 응용의 효율적인 시스템 자원 활용을 위한 GPU 시스템 최적화

    Get PDF
    학위논문 (박사) -- 서울대학교 대학원 : 공과대학 전기·컴퓨터공학부, 2020. 8. 염헌영.Recently, General Purpose GPU (GPGPU) applications are playing key roles in many different research fields, such as high-performance computing (HPC) and deep learning (DL). The common feature exists in these applications is that all of them require massive computation power, which follows the high parallelism characteristics of the graphics processing unit (GPU). However, because of the resource usage pattern of each GPGPU application varies, a single application cannot fully exploit the GPU systems resources to achieve the best performance of the GPU since the GPU system is designed to provide system-level fairness to all applications instead of optimizing for a specific type. GPU multitasking can address the issue by co-locating multiple kernels with diverse resource usage patterns to share the GPU resource in parallel. However, the current GPU mul- titasking scheme focuses just on co-launching the kernels rather than making them execute more efficiently. Besides, the current GPU multitasking scheme is not open-sourced, which makes it more difficult to be optimized, since the GPGPU applications and the GPU system are unaware of the feature of each other. In this dissertation, we claim that using the support from framework between the GPU system and the GPGPU applications without modifying the application can yield better performance. We design and implement the frame- work while addressing two issues in GPGPU applications. First, we introduce a GPU memory checkpointing approach between the host memory and the device memory to address the problem that GPU memory cannot be over-subscripted in a multitasking environment. Second, we present a fine-grained GPU kernel management scheme to avoid the GPU resource under-utilization problem in a i multitasking environment. We implement and evaluate our schemes on a real GPU system. The experimental results show that our proposed approaches can solve the problems related to GPGPU applications than the existing approaches while delivering better performance.최근 범용 GPU (GPGPU) 응용 프로그램은 고성능 컴퓨팅 (HPC) 및 딥 러닝 (DL)과 같은 다양한 연구 분야에서 핵심적인 역할을 수행하고 있다. 이러한 응 용 분야의 공통적인 특성은 거대한 계산 성능이 필요한 것이며 그래픽 처리 장치 (GPU)의 높은 병렬 처리 특성과 매우 적합하다. 그러나 GPU 시스템은 특정 유 형의 응용 프로그램에 최저화하는 대신 모든 응용 프로그램에 시스템 수준의 공정 성을 제공하도록 설계되어 있으며 각 GPGPU 응용 프로그램의 자원 사용 패턴이 다양하기 때문에 단일 응용 프로그램이 GPU 시스템의 리소스를 완전히 활용하여 GPU의 최고 성능을 달성 할 수는 없다. 따라서 GPU 멀티 태스킹은 다양한 리소스 사용 패턴을 가진 여러 응용 프로그 램을 함께 배치하여 GPU 리소스를 공유함으로써 GPU 자원 사용률 저하 문제를 해결할 수 있다. 그러나 기존 GPU 멀티 태스킹 기술은 자원 사용률 관점에서 응 용 프로그램의 효율적인 실행보다 공동으로 실행하는 데 중점을 둔다. 또한 현재 GPU 멀티 태스킹 기술은 오픈 소스가 아니므로 응용 프로그램과 GPU 시스템이 서로의 기능을 인식하지 못하기 때문에 최적화하기가 더 어려울 수도 있다. 본 논문에서는 응용 프로그램을 수정 없이 GPU 시스템과 GPGPU 응용 사 이의 프레임워크를 통해 사용하면 보다 높은 응용성능과 자원 사용을 보일 수 있음을 증명하고자 한다. 그러기 위해 GPU 태스크 관리 프레임워크를 개발하여 GPU 멀티 태스킹 환경에서 발생하는 두 가지 문제를 해결하였다. 첫째, 멀티 태 스킹 환경에서 GPU 메모리 초과 할당할 수 없는 문제를 해결하기 위해 호스트 메모리와 디바이스 메모리에 체크포인트 방식을 도입하였다. 둘째, 멀티 태스킹 환 경에서 GPU 자원 사용율 저하 문제를 해결하기 위해 더욱 세분화 된 GPU 커널 관리 시스템을 제시하였다. 본 논문에서는 제안한 방법들의 효과를 증명하기 위해 실제 GPU 시스템에 92 구현하고 그 성능을 평가하였다. 제안한 접근방식이 기존 접근 방식보다 GPGPU 응용 프로그램과 관련된 문제를 해결할 수 있으며 더 높은 성능을 제공할 수 있음을 확인할 수 있었다.Chapter 1 Introduction 1 1.1 Motivation 2 1.2 Contribution . 7 1.3 Outline 8 Chapter 2 Background 10 2.1 GraphicsProcessingUnit(GPU) and CUDA 10 2.2 CheckpointandRestart . 11 2.3 ResourceSharingModel. 11 2.4 CUDAContext 12 2.5 GPUThreadBlockScheduling . 13 2.6 Multi-ProcessServicewithHyper-Q 13 Chapter 3 Checkpoint based solution for GPU memory over- subscription problem 16 3.1 Motivation 16 3.2 RelatedWork. 18 3.3 DesignandImplementation . 20 3.3.1 System Design 21 3.3.2 CUDAAPIwrappingmodule 22 3.3.3 Scheduler . 28 3.4 Evaluation. 31 3.4.1 Evaluationsetup . 31 3.4.2 OverheadofFlexGPU 32 3.4.3 Performance with GPU Benchmark Suits 34 3.4.4 Performance with Real-world Workloads 36 3.4.5 Performance of workloads composed of multiple applications 39 3.5 Summary 42 Chapter 4 A Workload-aware Fine-grained Resource Manage- ment Framework for GPGPUs 43 4.1 Motivation 43 4.2 RelatedWork. 45 4.2.1 GPUresourcesharing 45 4.2.2 GPUscheduling . 46 4.3 DesignandImplementation . 47 4.3.1 SystemArchitecture . 47 4.3.2 CUDAAPIWrappingModule . 49 4.3.3 smCompactorRuntime . 50 4.3.4 ImplementationDetails . 57 4.4 Analysis on the relation between performance and workload usage pattern 60 4.4.1 WorkloadDefinition . 60 4.4.2 Analysisonperformancesaturation 60 4.4.3 Predict the necessary SMs and thread blocks for best performance . 64 4.5 Evaluation. 69 4.5.1 EvaluationMethodology. 70 4.5.2 OverheadofsmCompactor . 71 4.5.3 Performance with Different Thread Block Counts on Dif- ferentNumberofSMs 72 4.5.4 Performance with Concurrent Kernel and Resource Sharing 74 4.6 Summary . 79 Chapter 5 Conclusion. 81 요약. 92Docto

    Inter-workgroup barrier synchronisation on graphics processing units

    Get PDF
    GPUs are parallel devices that are able to run thousands of independent threads concurrently. Traditional GPU programs are data-parallel, requiring little to no communication, i.e. synchronisation, between threads. However, classical concurrency in the context of CPUs often exploits synchronisation idioms that are not supported on GPUs. By studying such idioms on GPUs, with an aim to facilitate them in a portable way, a wider and more generic space of GPU applications can be made possible. While the breadth of this thesis extends to many aspects of GPU systems, the common thread throughout is the global barrier: an execution barrier that synchronises all threads executing a GPU application. The idea of such a barrier might seem straightforward, however this investigation reveals many challenges and insights. In particular, this thesis includes the following studies: Execution models: while a general global barrier can deadlock due to starvation on GPUs, it is shown that the scheduling guarantees of current GPUs can be used to dynamically create an execution environment that allows for a safe and portable global barrier across a subset of the GPU threads. Application optimisations: a set GPU optimisations are examined that are tailored for graph applications, including one optimisation enabled by the global barrier. It is shown that these optimisations can provided substantial performance improvements, e.g. the barrier optimisation achieves over a 10X speedup on AMD and Intel GPUs. The performance portability of these optimisations is investigated, as their utility varies across input, application, and architecture. Multitasking: because many GPUs do not support preemption, long-running GPU compute tasks (e.g. applications that use the global barrier) may block other GPU functions, including graphics. A simple cooperative multitasking scheme is proposed that allows graphics tasks to meet their deadlines with reasonable overheads.Open Acces

    Exploiting Hardware Abstraction for Parallel Programming Framework: Platform and Multitasking

    Get PDF
    With the help of the parallelism provided by the fine-grained architecture, hardware accelerators on Field Programmable Gate Arrays (FPGAs) can significantly improve the performance of many applications. However, designers are required to have excellent hardware programming skills and unique optimization techniques to explore the potential of FPGA resources fully. Intermediate frameworks above hardware circuits are proposed to improve either performance or productivity by leveraging parallel programming models beyond the multi-core era. In this work, we propose the PolyPC (Polymorphic Parallel Computing) framework, which targets enhancing productivity without losing performance. It helps designers develop parallelized applications and implement them on FPGAs. The PolyPC framework implements a custom hardware platform, on which programs written in an OpenCL-like programming model can launch. Additionally, the PolyPC framework extends vendor-provided tools to provide a complete development environment including intermediate software framework, and automatic system builders. Designers\u27 programs can be either synthesized as hardware processing elements (PEs) or compiled to executable files running on software PEs. Benefiting from nontrivial features of re-loadable PEs, and independent group-level schedulers, the multitasking is enabled for both software and hardware PEs to improve the efficiency of utilizing hardware resources. The PolyPC framework is evaluated regarding performance, area efficiency, and multitasking. The results show a maximum 66 times speedup over a dual-core ARM processor and 1043 times speedup over a high-performance MicroBlaze with 125 times of area efficiency. It delivers a significant improvement in response time to high-priority tasks with the priority-aware scheduling. Overheads of multitasking are evaluated to analyze trade-offs. With the help of the design flow, the OpenCL application programs are converted into executables through the front-end source-to-source transformation and back-end synthesis/compilation to run on PEs, and the framework is generated from users\u27 specifications

    Optimización en GPU de algoritmos para la mejora del realce y segmentación en imágenes hepáticas

    Get PDF
    This doctoral thesis deepens the GPU acceleration for liver enhancement and segmentation. With this motivation, detailed research is carried out here in a compendium of articles. The work developed is structured in three scientific contributions, the first one is based upon enhancement and tumor segmentation, the second one explores the vessel segmentation and the last is published on liver segmentation. These works are implemented on GPU with significant speedups with great scientific impact and relevance in this doctoral thesis The first work proposes cross-modality based contrast enhancement for tumor segmentation on GPU. To do this, it takes target and guidance images as an input and enhance the low quality target image by applying two dimensional histogram approach. Further it has been observed that the enhanced image provides more accurate tumor segmentation using GPU based dynamic seeded region growing. The second contribution is about fast parallel gradient based seeded region growing where static approach has been proposed and implemented on GPU for accurate vessel segmentation. The third contribution describes GPU acceleration of Chan-Vese model and cross-modality based contrast enhancement for liver segmentation

    Towards multiprogrammed GPUs

    Get PDF
    Programmable Graphics Processing Units (GPUs) have recently become the most pervasitheve massively parallel processors. They have come a long way, from fixed function ASICs designed to accelerate graphics tasks to a programmable architecture that can also execute general-purpose computations. Because of their performance and efficiency, an increasing amount of software is relying on them to accelerate data parallel and computationally intensive sections of code. They have earned a place in many systems, from low power mobile devices to the biggest data centers in the world. However, GPUs are still plagued by the fact that they essentially have no multiprogramming support, resulting in low system performance if the GPU is shared among multiple programs. In this dissertation we set to provide the rich GPU multiprogramming support by improving the multitasking capabilities and increasing the virtual memory functionality and performance. The main issue hindering the multitasking support in GPUs is the nonpreemptive execution of GPU kernels. Here we propose two preemption mechanisms with dierent design philosophies, that can be used by a scheduler to preempt execution on GPU cores and make room for some other process. We also argue for the spatial sharing of the GPU and propose a concrete hardware scheduler implementation that dynamically partitions the GPU cores among running kernels, according to their set priorities. Opposing the assumptions made in the related work, we demonstrate that preemptive execution is feasible and the desired approach to GPU multitasking. We further show improved system fairness and responsiveness with our scheduling policy. We also pinpoint that at the core of the insufficient virtual memory support lies the exceptions handling mechanism used by modern GPUs. Currently, GPUs offload the actual exception handling work to the CPU, while the faulting instruction is stalled in the GPU core. This stall-on-fault model prevents some of the virtual memory features and optimizations and is especially harmful in multiprogrammed environments because it prevents context switching the GPU unless all the in-flight faults are resolved. In this disseritation, we propose three GPU core organizations with varying performance-complexity trade-off that get rid of the stall-on-fault execution and enable preemptible exceptions on the GPU (i.e., the faulting instruction can be squashed and restarted later). Building on this support, we implement two use cases and demonstrate their utility. One is a scheme that performs context switch of the faulted threads and tries to find some other useful work to do in the meantime, hiding the latency of the fault and improving the system performance. The other enables the fault handling code to run locally, on the GPU, instead of relying on the CPU offloading and show that the local fault handling can also improve performance.Las Unidades de Procesamiento de Gráficos Programables (GPU, por sus siglas en inglés) se han convertido recientemente en los procesadores masivamente paralelos más difundidos. Han recorrido un largo camino desde ASICs de función fija diseñados para acelerar tareas gráficas, hasta una arquitectura programable que también puede ejecutar cálculos de propósito general. Debido a su rendimiento y eficiencia, una cantidad creciente de software se basa en ellas para acelerar las secciones de código computacionalmente intensivas que disponen de paralelismo de datos. Se han ganado un lugar en muchos sistemas, desde dispositivos móviles de baja potencia hasta los centros de datos más grandes del mundo. Sin embargo, las GPUs siguen plagadas por el hecho de que esencialmente no tienen soporte de multiprogramación, lo que resulta en un bajo rendimiento del sistema si la GPU se comparte entre múltiples programas. En esta disertación nos centramos en proporcionar soporte de multiprogramación para GPUs mediante la mejora de las capacidades de multitarea y del soporte de memoria virtual. El principal problema que dificulta el soporte multitarea en las GPUs es la ejecución no apropiativa de los núcleos de la GPU. Proponemos dos mecanismos de apropiación con diferentes filosofías de diseño, que pueden ser utilizados por un planificador para apropiarse de los núcleos de la GPU y asignarlos a otros procesos. También abogamos por la división espacial de la GPU y proponemos una implementación concreta de un planificador hardware que divide dinámicamente los núcleos de la GPU entre los kernels en ejecución, de acuerdo con sus prioridades establecidas. Oponiéndose a las suposiciones hechas por otros en trabajos relacionados, demostramos que la ejecución apropiativa es factible y el enfoque deseado para la multitarea en GPUs. Además, mostramos una mayor equidad y capacidad de respuesta del sistema con nuestra política de asignación de núcleos de la GPU. También señalamos que la causa principal del insuficiente soporte de la memoria virtual en las GPUs es el mecanismo de manejo de excepciones utilizado por las GPUs modernas. En la actualidad, las GPUs descargan el manejo de las excepciones a la CPU, mientras que la instrucción que causo la fallada se encuentra esperando en el núcleo de la GPU. Este modelo de bloqueo en fallada impide algunas de las funciones y optimizaciones de la memoria virtual y es especialmente perjudicial en entornos multiprogramados porque evita el cambio de contexto de la GPU a menos que se resuelvan todas las fallas pendientes. En esta disertación, proponemos tres implementaciones del pipeline de los núcleos de la GPU que ofrecen distintos balances de rendimiento-complejidad y permiten la apropiación del núcleo aunque haya excepciones pendientes (es decir, la instrucción que produjo la fallada puede ser reiniciada más tarde). Basándonos en esta nueva funcionalidad, implementamos dos casos de uso para demostrar su utilidad. El primero es un planificador que asigna el núcleo a otros subprocesos cuando hay una fallada para tratar de hacer trabajo útil mientras esta se resuelve, ocultando así la latencia de la fallada y mejorando el rendimiento del sistema. El segundo permite que el código de manejo de las falladas se ejecute localmente en la GPU, en lugar de descargar el manejo a la CPU, mostrando que el manejo local de falladas también puede mejorar el rendimiento.Postprint (published version

    A study of replacing CUDA by OpenCL in KGPU

    Get PDF
    Monografia (graduação)—Universidade de Brasília, Faculdade UnB Gama, 2015.GPU is a very high parallel device which became popular. Nowadays, many processors already coming with a minimal GPU in the same die, this characteristic creates a new and unexplored application area for this device. CUDA and OpenCL are two non-graphics libraries commonly used for take advantages of GPU. CUDA was created by NVidia, and it was designed to run on NVidia’s GPUs. On the other hand, OpenCL was created to run on many different devices. Those libraries, interacts with the operating system by using device drivers, and usually this is the unique connection between them. A group of researchers from Utah proposed the use of GPU as a coprocessor, they developed a device driver based on CUDA for achieving this goal (they called it as KGPU). In this work we improved KGPU’s code, added support to OpenCL, and we analyzed the possibility of use this project as a mature solution

    Workload-aware Scheduling Techniques for General Purpose Applications on Graphics Processing Units

    Get PDF
    In the last decade, there has been a wide scale adoption of Graphics Processing Units (GPUs) as a co-processor for accelerating data-parallel general purpose applications. A primary driver of this adoption is that GPUs offer orders of magnitude higher floating point arithmetic throughput and memory bandwidth compared to their CPU counterparts. As GPU architectures are designed as throughput processors, they adopt a manycore architecture with 10 to 100s of cores, each with multiple vector processing pipelines. A significant amount of the die area is dedicated to floating point units, at the expense of not having hardware units used for memory latency hiding in conventional CPU architectures. The quintessential technique used for memory latency tolerance is exploiting data-level parallelism in the workload, and interleaving execution of multiple SIMD threads, to overlap the latency of threads waiting on data from memory with computation from other threads. With each architecture generation, GPU architectures are providing an increasing amount of floating point throughput and memory bandwidth. Alongside, the architectures support an increasing number of simultaneously active threads. We envision that to continue making advancements in GPU computing, workload-aware scheduling techniques are required. In the GPU computing work flow, scheduling is performed at three levels - the system or chip level, the core level and the thread level. The work proposed in the research aims at designing novel workload aware scheduling techniques at each of the three levels of scheduling. We show that GPU computing workloads have significantly varying characteristics, and design techniques that monitor the hardware state to aide at each of the three levels of scheduling. Each technique is implemented in a cycle level GPU architecture simulator, and their effect on performance is analyzed against state of the art scheduling techniques used in GPU architectures

    Parallel and Distributed Computing

    Get PDF
    The 14 chapters presented in this book cover a wide variety of representative works ranging from hardware design to application development. Particularly, the topics that are addressed are programmable and reconfigurable devices and systems, dependability of GPUs (General Purpose Units), network topologies, cache coherence protocols, resource allocation, scheduling algorithms, peertopeer networks, largescale network simulation, and parallel routines and algorithms. In this way, the articles included in this book constitute an excellent reference for engineers and researchers who have particular interests in each of these topics in parallel and distributed computing
    corecore