Search CORE

15 research outputs found

Investigation of Parallel Data Processing Using Hybrid High Performance CPU + GPU Systems and CUDA Streams

Author: Czarnul Paweł
Publication venue: Institute of Informatics, Slovak Academy of Sciences
Publication date: 16/12/2020
Field of study

The paper investigates parallel data processing in a hybrid CPU+GPU(s) system using multiple CUDA streams for overlapping communication and computations. This is crucial for efficient processing of data, in particular incoming data stream processing that would naturally be forwarded using multiple CUDA streams to GPUs. Performance is evaluated for various compute time to host-device communication time ratios, numbers of CUDA streams, for various numbers of threads managing computations on GPUs. Tests also reveal benefits of using CUDA MPS for overlapping communication and computations when using multiple processes. Furthermore, using standard memory allocation on a GPU and Unified Memory versions are compared, the latter including programmer added prefetching. Performance of a hybrid CPU+GPU version as well as scaling across multiple GPUs are demonstrated showing good speed-ups of the approach. Finally, the performance per power consumption of selected configurations are presented for various numbers of streams and various relative performances of GPUs and CPUs

Computing and Informatics (E-Journal - Institute of Informatics, SAS, Bratislava)

An Intelligent Framework for Oversubscription Management in CPU-GPU Unified Memory

Author: Gong Xiangyang
Long Xinjian
Zhou Huiyang
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 06/04/2022
Field of study

This paper proposes a novel intelligent framework for oversubscription management in CPU-GPU UVM. We analyze the current rule-based methods of GPU memory oversubscription with unified memory, and the current learning-based methods for other computer architectural components. We then identify the performance gap between the existing rule-based methods and the theoretical upper bound. We also identify the advantages of applying machine intelligence and the limitations of the existing learning-based methods. This paper proposes a novel intelligent framework for oversubscription management in CPU-GPU UVM. It consists of an access pattern classifier followed by a pattern-specific Transformer-based model using a novel loss function aiming for reducing page thrashing. A policy engine is designed to leverage the model's result to perform accurate page prefetching and pre-eviction. We evaluate our intelligent framework on a set of 11 memory-intensive benchmarks from popular benchmark suites. Our solution outperforms the state-of-the-art (SOTA) methods for oversubscription management, reducing the number of pages thrashed by 64.4\% under 125\% memory oversubscription compared to the baseline, while the SOTA method reduces the number of pages thrashed by 17.3\%. Our solution achieves an average IPC improvement of 1.52X under 125\% memory oversubscription, and our solution achieves an average IPC improvement of 3.66X under 150\% memory oversubscription. Our solution outperforms the existing learning-based methods for page address prediction, improving top-1 accuracy by 6.45\% (up to 41.2\%) on average for a single GPGPU workload, improving top-1 accuracy by 10.2\% (up to 30.2\%) on average for multiple concurrent GPGPU workloads.Comment: arXiv admin note: text overlap with arXiv:2203.1267

arXiv.org e-Print Archive

Holistic Performance Analysis and Optimization of Unified Virtual Memory

Author: Allen Tyler
Publication venue: Clemson University Libraries
Publication date: 01/08/2022
Field of study

The programming difficulty of creating GPU-accelerated high performance computing (HPC) codes has been greatly reduced by the advent of Unified Memory technologies that abstract the management of physical memory away from the developer. However, these systems incur substantial overhead that paradoxically grows for codes where these technologies are most useful. While these technologies are increasingly adopted for use in modern HPC frameworks and applications, the performance cost reduces the efficiency of these systems and turns away some developers from adoption entirely. These systems are naturally difficult to optimize due to the large number of interconnected hardware and software components that must be untangled to perform thorough analysis. In this thesis, we take the first deep dive into a functional implementation of a Unified Memory system, NVIDIA UVM, to evaluate the performance and characteristics of these systems. We show specific hardware and software interactions that cause serialization between host and devices. We further provide a quantitative evaluation of fault handling for various applications under different scenarios, including prefetching and oversubscription. Through lower-level analysis, we find that the driver workload is dependent on the interactions among application access patterns, GPU hardware constraints, and Host OS components. These findings indicate that the cost of host OS components is significant and present across UM implementations. We also provide a proof-of-concept asynchronous approach to memory management in UVM that allows for reduced system overhead and improved application performance. This study provides constructive insight into future implementations and systems, such as Heterogeneous Memory Management

Clemson University: TigerPrints

CUDA Unified Memory를 위한 데이터 관리 및 프리페칭 기법

Author: 정재훈
Publication venue: 서울대학교 대학원
Publication date: 01/08/2022
Field of study

학위논문(박사) -- 서울대학교대학원 : 공과대학 컴퓨터공학부, 2022. 8. 이재진.Unified Memory (UM) is a component of CUDA programming model which provides a memory pool that has a single address space and can be accessed by both the host and the GPU. When UM is used, a CUDA program does not need to explicitly move data between the host and the device. It also allows GPU memory oversubscription by using CPU memory as a backing store. UM significantly lessens the burden of a programmer and provides great programmability. However, using UM solely does not guarantee good performance. To fully exploit UM and improve performance, the programmer needs to add user hints to the source code to prefetch pages that are going to be accessed during the CUDA kernel execution. In this thesis, we propose three frameworks that exploits UM to improve the ease-of-programming while maximizing the application performance. The first framework is HUM, which hides host-to-device memory copy time of traditional CUDA program without any code modification. It overlaps the host-to-device memory copy with host computation or CUDA kernel computation by exploiting Unified Memory and fault mechanisms. The evaluation result shows that executing the applications under HUM is, on average, 1.21 times faster than executing them under original CUDA. The speedup is comparable to the average speedup 1.22 of the hand-optimized implementations for Unified Memory. The second framework is DeepUM which exploits UM to allow GPU memory oversubscription for deep neural networks. While UM allows memory oversubscription using a page fault mechanism, page fault handling introduces enormous overhead. We use a correlation prefetching technique to solve the problem and hide the overhead. The evaluation result shows that DeepUM achieves comparable performance to the other state-of-the-art approaches. At the same time, our framework can run larger batch size that other methods fail to run. The last framework is SnuRHAC that provides an illusion of a single GPU for the multiple GPUs in a cluster. Under SnuRHAC, a CUDA program designed to use a single GPU can utilize multiple GPUs in a cluster without any source code modification. SnuRHAC automatically distributes workload to multiple GPUs in a cluster and manages data across the nodes. To manage data efficiently, SnuRHAC extends Unified Memory and exploits its page fault mechanism. We also propose two prefetching techniques to fully exploit UM and to maximize performance. The evaluation result shows that while SnuRHAC significantly improves ease-of-programming, it shows scalable performance for the cluster environment depending on the application characteristics.Unified Memory (UM)는 CUDA 프로그래밍 모델에서 제공하는 기능 중 하나로 단일 메모리 주소 공간에 CPU와 GPU가 동시에 접근할 수 있도록 해준다. 이에 따라, UM을 사용할 경우 CUDA 프로그램에서 명시적으로 프로세서간에 데이터를 이동시켜주지 않아도 된다. 또한, CPU 메모리를 backing store로 사용하여 GPU의 메모리 크기 보다 더 많은 양의 메모리를 필요로 하는 프로그램을 실행할 수 있도록 해준다. 결과적으로, UM은 프로그래머의 부담을 크게 덜어주고 쉽게 프로그래밍 할 수 있도록 도와준다. 하지만, UM을 있는 그대로 사용하는 것은 성능 측면에서 좋지 않다. UM은 page fault mechanism을 통해 동작하는데 page fault를 처리하기 위해서는 많은 오버헤드가 발생하기 때문이다. UM을 사용하면서 최대의 성능을 얻기 위해서는 프로그래머가 소스 코드에 여러 힌트나 앞으로 CUDA 커널에서 사용될 메모리 영역에 대한 프리페치 명령을 삽입해주어야 한다. 본 논문은 UM을 사용하면서도 쉬운 프로그래밍과 최대의 성능이라는 두마리 토끼를 동시에 잡기 위한 방법들을 소개한다. 첫째로, HUM은 기존 CUDA 프로그램의 소스 코드를 수정하지 않고 호스트와 디바이스 간에 메모리 전송 시간을 최소화한다. 이를 위해, UM과 fault mechanism을 사용하여 호스트-디바이스 간 메모리 전송을 호스트 계산 혹은 CUDA 커널 실행과 중첩시킨다. 실험 결과를 통해 HUM을 통해 애플리케이션을 실행하는 것이 그렇지 않고 CUDA만을 사용하는 것에 비해 평균 1.21배 빠른 것을 확인하였다. 또한, Unified Memory를 기반으로 프로그래머가 소스 코드를 최적화한 것과 유사한 성능을 내는 것을 확인하였다. 두번째로, DeepUM은 UM을 활용하여 GPU의 메모리 크기 보다 더 많은 양의 메모리를 필요로 하는 딥 러닝 모델을 실행할 수 있게 한다. UM을 통해 GPU 메모리를 초과해서 사용할 경우 CPU와 GPU간에 페이지가 매우 빈번하게 이동하는데, 이때 많은 오버헤드가 발생한다. 두번째 방법에서는 correlation 프리페칭 기법을 통해 이 오버헤드를 최소화한다. 실험 결과를 통해 DeepUM은 기존에 연구된 결과들과 비슷한 성능을 보이면서 더 큰 배치 사이즈 혹은 더 큰 하이퍼파라미터를 사용하는 모델을 실행할 수 있음을 확인하였다. 마지막으로, SnuRHAC은 클러스터에 장착된 여러 GPU를 마치 하나의 통합된 GPU처럼 보여준다. 따라서, 프로그래머는 여러 GPU를 대상으로 프로그래밍 하지 않고 하나의 가상 GPU를 대상으로 프로그래밍하면 클러스터에 장착된 모든 GPU를 활용할 수 있다. 이는 SnuRHAC이 Unified Memory를 클러스터 환경에서 동작하도록 확장하고, 필요한 데이터를 자동으로 GPU간에 전송하고 관리해주기 때문이다. 또한, UM을 사용하면서 발생할 수 있는 오버헤드를 최소화하기 위해 다양한 프리페칭 기법을 소개한다. 실험 결과를 통해 SnuRHAC이 쉽게 GPU 클러스터를 위한 프로그래밍을 할 수 있도록 도와줄 뿐만 아니라, 애플리케이션 특성에 따라 최적의 성능을 낼 수 있음을 보인다.1 Introduction 1 2 Related Work 7 3 CUDA Unified Memory 12 4 Framework for Maximizing the Performance of Traditional CUDA Program 17 4.1 Overall Structure of HUM 17 4.2 Overlapping H2Dmemcpy and Computation 19 4.3 Data Consistency and Correctness 23 4.4 HUM Driver 25 4.5 HUM H2Dmemcpy Mechanism 26 4.6 Parallelizing Memory Copy Commands 29 4.7 Scheduling Memory Copy Commands 31 5 Framework for Running Large-scale DNNs on a Single GPU 33 5.1 Structure of DeepUM 33 5.1.1 DeepUM Runtime 34 5.1.2 DeepUM Driver 35 5.2 Correlation Prefetching for GPU Pages 36 5.2.1 Pair-based Correlation Prefetching 37 5.2.2 Correlation Prefetching in DeepUM 38 5.3 Optimizations for GPU Page Fault Handling 42 5.3.1 Page Pre-eviction 42 5.3.2 Invalidating UM Blocks of Inactive PyTorch Blocks 43 6 Framework for Virtualizing a Single Device Image for a GPU Cluster 45 6.1 Overall Structure of SnuRHAC 45 6.2 Workload Distribution 48 6.3 Cluster Unified Memory 50 6.4 Additional Optimizations 57 6.5 Prefetching 58 6.5.1 Static Prefetching 58 6.5.2 Dynamic Prefetching 61 7 Evaluation 62 7.1 Framework for Maximizing the Performance of Traditional CUDA Program 62 7.1.1 Methodology 63 7.1.2 Results 64 7.2 Framework for Running Large-scale DNNs on a Single GPU 70 7.2.1 Methodology 70 7.2.2 Comparison with Naive UM and IBM LMS 72 7.2.3 Parameters of the UM Block Correlation Table 78 7.2.4 Comparison with TensorFlow-based Approaches 79 7.3 Framework for Virtualizing Single Device Image for a GPU Cluster 81 7.3.1 Methodology 81 7.3.2 Results 84 8 Discussions and Future Work 91 9 Conclusion 93 초록 111박

SNU Open Repository and Archive

Recommended from our members

Architecture Optimizations for Memory Systems of Throughput Processors

Author: Gu Yongbin
Publication venue: 'Oregon State University'
Publication date
Field of study

Throughput-oriented processors, such as graphics processing units (GPUs), have been increasingly used to accelerate general purpose computing, including machine learning models that are being utilized in numerous disciplines. Thousands of concurrently running threads in a GPU demand a highly efficient memory subsystem for data supply in GPUs. In this dissertation, we have studied the memory architecture of the traditional GPUs and revealed that the traditional memory architecture, initially designed for graphics processing, is less efficient in handling general purpose computing tasks. We propose several memory architecture optimizations for two primary objectives: (1) optimize current memory architecture for more efficient handling of general purpose computing tasks; (2) improve the overall performance of GPUs. This dissertation has four major parts: (1) The first part deals with the L2 cache inefficiency. A key factor that affects the memory subsystem is the order of memory accesses. While reordering memory accesses at L2 cache has large potential benefits to both cache and DRAM, little work has been conducted to exploit this. In this work, we investigate the largely unexplored opportunity of L2 cache access reordering. We propose Cache Access Reordering Tree (CART), a novel architecture that can improve memory subsystem efficiency by actively reordering memory accesses at L2 cache to be cache-friendly and DRAM-friendly. (2) The second part deals with miss handling architecture (MHA) in GPUs. Conventional MHA is static in sense that it provides a fixed number of MSHR entries to track primary misses, and a fixed number of slots within each entry to track secondary misses. This leads to severe entry or slot under-utilization and poor match to practical workloads, as the number of memory requests to different cache lines can vary significantly. We propose Dynamically Linked MSHR (DL-MSHR), a novel approach that dynamically forms MSHR entries from a pool of available slots. This approach can self-adapt to primary-miss-predominant applications by forming more entries with fewer slots, and self-adapt to secondary-miss-predominant applications by having fewer entries but more slots per entry. (3) The third part aims to improve the performance of Unified Virtual Memory (UVM), which is recently introduced into GPUs. We propose CAPTURE(Capacity-Aware Prefetch with True Usage Reflected Eviction), a novel microarchitecture scheme that implements coordinated prefetch-eviction for GPU UVM management. CAPTURE utilizes GPU memory status and memory access history to dynamically adjust the prefetching and ``capture'' accurate remaining page reusing opportunities for improved eviction. (4) In the fourth part, we propose a comprehensive UVM benchmark suite named UVMBench to facilitate future research on the UVM research

ScholarsArchive@OSU

Evaluation of Directive-Based GPU Programming Models on a Block Eigensolver with Consideration of Large Sparse Matrices

Author: A Dziekonski
AV Knyazev
AV Knyazev
C Yang
G Ortega
JW Choi
M Knap
M Shao
P Maris
P Maris
X Yang
Y Wang
Publication venue: eScholarship, University of California
Publication date: 01/01/2020
Field of study

Achieving high performance and performance portability for large-scale scientific applications is a major challenge on heterogeneous computing systems such as many-core CPUs and accelerators like GPUs. In this work, we implement a widely used block eigensolver, Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG), using two popular directive based programming models (OpenMP and OpenACC) for GPU-accelerated systems. Our work differs from existing work in that it adopts a holistic approach that optimizes the full solver performance rather than narrowing the problem into small kernels (e.g., SpMM, SpMV). Our LOPBCG GPU implementation achieves a 2.8

{\times }

–4.3

{\times }

speedup over an optimized CPU implementation when tested with four different input matrices. The evaluated configuration compared one Skylake CPU to one Skylake CPU and one NVIDIA V100 GPU. Our OpenMP and OpenACC LOBPCG GPU implementations gave nearly identical performance. We also consider how to create an efficient LOBPCG solver that can solve problems larger than GPU memory capacity. To this end, we create microbenchmarks representing the two dominant kernels (inner product and SpMM kernel) in LOBPCG and then evaluate performance when using two different programming approaches: tiling the kernels, and using Unified Memory with the original kernels. Our tiled SpMM implementation achieves a 2.9

{\times }

and 48.2

{\times }

speedup over the Unified Memory implementation on supercomputers with PCIe Gen3 and NVLink 2.0 CPU to GPU interconnects, respectively

Crossref

eScholarship - University of California

GPU accelerated path tracing of massive scenes

Author: Jaroš Milan
Strakoš Petr
Říha Lubomír
Špeťko Matěj
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2021
Field of study

This article presents a solution to path tracing of massive scenes on multiple GPUs. Our approach analyzes the memory access pattern of a path tracer and defines how the scene data should be distributed across up to 16 CPUs with minimal effect on performance. The key concept is that the parts of the scene that have the highest amount of memory accesses are replicated on all GPUs. We propose two methods for maximizing the performance of path tracing when working with partially distributed scene data. Both methods work on the memory management level and therefore path tracer data structures do not have to be redesigned, making our approach applicable to other path tracers with only minor changes in their code. As a proof of concept, we have enhanced the open-source Blender Cycles path tracer. The approach was validated on scenes of sizes up to 169 GB. We show that only 1 5% of the scene data needs to be replicated to all machines for such large scenes. On smaller scenes we have verified that the performance is very close to rendering a fully replicated scene. In terms of scalability we have achieved a parallel efficiency of over 94% using up to 16 GPUs.Web of Science402art. no. 1

DSpace at VSB Technical University of Ostrava

GPU Resource Optimization and Scheduling for Shared Execution Environments

Author: Luley Ryan Seamus
Publication venue: SURFACE at Syracuse University
Publication date: 23/08/2020
Field of study

General purpose graphics processing units have become a computing workhorse for a variety of data- and compute-intensive applications, from large supercomputing systems for massive data analytics to small, mobile embedded devices for autonomous vehicles. Making effective and efficient use of these processors traditionally relies on extensive programmer expertise to design and develop kernel methods which simultaneously trade off task decomposition and resource exploitation. Often, new architecture designs force code refinements in order to continue to achieve optimal performance. At the same time, not all applications require full utilization of the system to achieve that optimal performance. In this case, the increased capability of new architectures introduces an ever-widening gap between the level of resources necessary for optimal performance and the level necessary to maintain system efficiency. The ability to schedule and execute multiple independent tasks on a GPU, known generally as concurrent kernel execution, enables application programmers and system developers to balance application performance and system efficiency. Various approaches to develop both coarse- and fine-grained scheduling mechanisms to achieve a high degree of resource utilization and improved application performance have been studied. Most of these works focus on mechanisms for the management of compute resources, while a small percentage consider the data transfer channels. In this dissertation, we propose a pragmatic approach to scheduling and managing both types of resources – data transfer and compute – that is transparent to an application programmer and capable of providing near-optimal system performance. Furthermore, the approaches described herein rely on reinforcement learning methods, which enable the scheduling solutions to be flexible to a variety of factors, such as transient application behaviors, changing system designs, and tunable objective functions. Finally, we describe a framework for the practical implementation of learned scheduling policies to achieve high resource utilization and efficient system performance

Syracuse University Research Facility and Collaborative Environment