4,926 research outputs found

    CUDA Unified Memory๋ฅผ ์œ„ํ•œ ๋ฐ์ดํ„ฐ ๊ด€๋ฆฌ ๋ฐ ํ”„๋ฆฌํŽ˜์นญ ๊ธฐ๋ฒ•

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2022. 8. ์ด์žฌ์ง„.Unified Memory (UM) is a component of CUDA programming model which provides a memory pool that has a single address space and can be accessed by both the host and the GPU. When UM is used, a CUDA program does not need to explicitly move data between the host and the device. It also allows GPU memory oversubscription by using CPU memory as a backing store. UM significantly lessens the burden of a programmer and provides great programmability. However, using UM solely does not guarantee good performance. To fully exploit UM and improve performance, the programmer needs to add user hints to the source code to prefetch pages that are going to be accessed during the CUDA kernel execution. In this thesis, we propose three frameworks that exploits UM to improve the ease-of-programming while maximizing the application performance. The first framework is HUM, which hides host-to-device memory copy time of traditional CUDA program without any code modification. It overlaps the host-to-device memory copy with host computation or CUDA kernel computation by exploiting Unified Memory and fault mechanisms. The evaluation result shows that executing the applications under HUM is, on average, 1.21 times faster than executing them under original CUDA. The speedup is comparable to the average speedup 1.22 of the hand-optimized implementations for Unified Memory. The second framework is DeepUM which exploits UM to allow GPU memory oversubscription for deep neural networks. While UM allows memory oversubscription using a page fault mechanism, page fault handling introduces enormous overhead. We use a correlation prefetching technique to solve the problem and hide the overhead. The evaluation result shows that DeepUM achieves comparable performance to the other state-of-the-art approaches. At the same time, our framework can run larger batch size that other methods fail to run. The last framework is SnuRHAC that provides an illusion of a single GPU for the multiple GPUs in a cluster. Under SnuRHAC, a CUDA program designed to use a single GPU can utilize multiple GPUs in a cluster without any source code modification. SnuRHAC automatically distributes workload to multiple GPUs in a cluster and manages data across the nodes. To manage data efficiently, SnuRHAC extends Unified Memory and exploits its page fault mechanism. We also propose two prefetching techniques to fully exploit UM and to maximize performance. The evaluation result shows that while SnuRHAC significantly improves ease-of-programming, it shows scalable performance for the cluster environment depending on the application characteristics.Unified Memory (UM)๋Š” CUDA ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋ชจ๋ธ์—์„œ ์ œ๊ณตํ•˜๋Š” ๊ธฐ๋Šฅ ์ค‘ ํ•˜๋‚˜๋กœ ๋‹จ์ผ ๋ฉ”๋ชจ๋ฆฌ ์ฃผ์†Œ ๊ณต๊ฐ„์— CPU์™€ GPU๊ฐ€ ๋™์‹œ์— ์ ‘๊ทผํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ด์ค€๋‹ค. ์ด์— ๋”ฐ๋ผ, UM์„ ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ CUDA ํ”„๋กœ๊ทธ๋žจ์—์„œ ๋ช…์‹œ์ ์œผ๋กœ ํ”„๋กœ์„ธ์„œ๊ฐ„์— ๋ฐ์ดํ„ฐ๋ฅผ ์ด๋™์‹œ์ผœ์ฃผ์ง€ ์•Š์•„๋„ ๋œ๋‹ค. ๋˜ํ•œ, CPU ๋ฉ”๋ชจ๋ฆฌ๋ฅผ backing store๋กœ ์‚ฌ์šฉํ•˜์—ฌ GPU์˜ ๋ฉ”๋ชจ๋ฆฌ ํฌ๊ธฐ ๋ณด๋‹ค ๋” ๋งŽ์€ ์–‘์˜ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ•„์š”๋กœ ํ•˜๋Š” ํ”„๋กœ๊ทธ๋žจ์„ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ด์ค€๋‹ค. ๊ฒฐ๊ณผ์ ์œผ๋กœ, UM์€ ํ”„๋กœ๊ทธ๋ž˜๋จธ์˜ ๋ถ€๋‹ด์„ ํฌ๊ฒŒ ๋œ์–ด์ฃผ๊ณ  ์‰ฝ๊ฒŒ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ํ•  ์ˆ˜ ์žˆ๋„๋ก ๋„์™€์ค€๋‹ค. ํ•˜์ง€๋งŒ, UM์„ ์žˆ๋Š” ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์€ ์„ฑ๋Šฅ ์ธก๋ฉด์—์„œ ์ข‹์ง€ ์•Š๋‹ค. UM์€ page fault mechanism์„ ํ†ตํ•ด ๋™์ž‘ํ•˜๋Š”๋ฐ page fault๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋งŽ์€ ์˜ค๋ฒ„ํ—ค๋“œ๊ฐ€ ๋ฐœ์ƒํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. UM์„ ์‚ฌ์šฉํ•˜๋ฉด์„œ ์ตœ๋Œ€์˜ ์„ฑ๋Šฅ์„ ์–ป๊ธฐ ์œ„ํ•ด์„œ๋Š” ํ”„๋กœ๊ทธ๋ž˜๋จธ๊ฐ€ ์†Œ์Šค ์ฝ”๋“œ์— ์—ฌ๋Ÿฌ ํžŒํŠธ๋‚˜ ์•ž์œผ๋กœ CUDA ์ปค๋„์—์„œ ์‚ฌ์šฉ๋  ๋ฉ”๋ชจ๋ฆฌ ์˜์—ญ์— ๋Œ€ํ•œ ํ”„๋ฆฌํŽ˜์น˜ ๋ช…๋ น์„ ์‚ฝ์ž…ํ•ด์ฃผ์–ด์•ผ ํ•œ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์€ UM์„ ์‚ฌ์šฉํ•˜๋ฉด์„œ๋„ ์‰ฌ์šด ํ”„๋กœ๊ทธ๋ž˜๋ฐ๊ณผ ์ตœ๋Œ€์˜ ์„ฑ๋Šฅ์ด๋ผ๋Š” ๋‘๋งˆ๋ฆฌ ํ† ๋ผ๋ฅผ ๋™์‹œ์— ์žก๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ•๋“ค์„ ์†Œ๊ฐœํ•œ๋‹ค. ์ฒซ์งธ๋กœ, HUM์€ ๊ธฐ์กด CUDA ํ”„๋กœ๊ทธ๋žจ์˜ ์†Œ์Šค ์ฝ”๋“œ๋ฅผ ์ˆ˜์ •ํ•˜์ง€ ์•Š๊ณ  ํ˜ธ์ŠคํŠธ์™€ ๋””๋ฐ”์ด์Šค ๊ฐ„์— ๋ฉ”๋ชจ๋ฆฌ ์ „์†ก ์‹œ๊ฐ„์„ ์ตœ์†Œํ™”ํ•œ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด, UM๊ณผ fault mechanism์„ ์‚ฌ์šฉํ•˜์—ฌ ํ˜ธ์ŠคํŠธ-๋””๋ฐ”์ด์Šค ๊ฐ„ ๋ฉ”๋ชจ๋ฆฌ ์ „์†ก์„ ํ˜ธ์ŠคํŠธ ๊ณ„์‚ฐ ํ˜น์€ CUDA ์ปค๋„ ์‹คํ–‰๊ณผ ์ค‘์ฒฉ์‹œํ‚จ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ๋ฅผ ํ†ตํ•ด HUM์„ ํ†ตํ•ด ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์„ ์‹คํ–‰ํ•˜๋Š” ๊ฒƒ์ด ๊ทธ๋ ‡์ง€ ์•Š๊ณ  CUDA๋งŒ์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์— ๋น„ํ•ด ํ‰๊ท  1.21๋ฐฐ ๋น ๋ฅธ ๊ฒƒ์„ ํ™•์ธํ•˜์˜€๋‹ค. ๋˜ํ•œ, Unified Memory๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ”„๋กœ๊ทธ๋ž˜๋จธ๊ฐ€ ์†Œ์Šค ์ฝ”๋“œ๋ฅผ ์ตœ์ ํ™”ํ•œ ๊ฒƒ๊ณผ ์œ ์‚ฌํ•œ ์„ฑ๋Šฅ์„ ๋‚ด๋Š” ๊ฒƒ์„ ํ™•์ธํ•˜์˜€๋‹ค. ๋‘๋ฒˆ์งธ๋กœ, DeepUM์€ UM์„ ํ™œ์šฉํ•˜์—ฌ GPU์˜ ๋ฉ”๋ชจ๋ฆฌ ํฌ๊ธฐ ๋ณด๋‹ค ๋” ๋งŽ์€ ์–‘์˜ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ•„์š”๋กœ ํ•˜๋Š” ๋”ฅ ๋Ÿฌ๋‹ ๋ชจ๋ธ์„ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•œ๋‹ค. UM์„ ํ†ตํ•ด GPU ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์ดˆ๊ณผํ•ด์„œ ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ CPU์™€ GPU๊ฐ„์— ํŽ˜์ด์ง€๊ฐ€ ๋งค์šฐ ๋นˆ๋ฒˆํ•˜๊ฒŒ ์ด๋™ํ•˜๋Š”๋ฐ, ์ด๋•Œ ๋งŽ์€ ์˜ค๋ฒ„ํ—ค๋“œ๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค. ๋‘๋ฒˆ์งธ ๋ฐฉ๋ฒ•์—์„œ๋Š” correlation ํ”„๋ฆฌํŽ˜์นญ ๊ธฐ๋ฒ•์„ ํ†ตํ•ด ์ด ์˜ค๋ฒ„ํ—ค๋“œ๋ฅผ ์ตœ์†Œํ™”ํ•œ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ๋ฅผ ํ†ตํ•ด DeepUM์€ ๊ธฐ์กด์— ์—ฐ๊ตฌ๋œ ๊ฒฐ๊ณผ๋“ค๊ณผ ๋น„์Šทํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ด๋ฉด์„œ ๋” ํฐ ๋ฐฐ์น˜ ์‚ฌ์ด์ฆˆ ํ˜น์€ ๋” ํฐ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋ชจ๋ธ์„ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ์Œ์„ ํ™•์ธํ•˜์˜€๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ, SnuRHAC์€ ํด๋Ÿฌ์Šคํ„ฐ์— ์žฅ์ฐฉ๋œ ์—ฌ๋Ÿฌ GPU๋ฅผ ๋งˆ์น˜ ํ•˜๋‚˜์˜ ํ†ตํ•ฉ๋œ GPU์ฒ˜๋Ÿผ ๋ณด์—ฌ์ค€๋‹ค. ๋”ฐ๋ผ์„œ, ํ”„๋กœ๊ทธ๋ž˜๋จธ๋Š” ์—ฌ๋Ÿฌ GPU๋ฅผ ๋Œ€์ƒ์œผ๋กœ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ํ•˜์ง€ ์•Š๊ณ  ํ•˜๋‚˜์˜ ๊ฐ€์ƒ GPU๋ฅผ ๋Œ€์ƒ์œผ๋กœ ํ”„๋กœ๊ทธ๋ž˜๋ฐํ•˜๋ฉด ํด๋Ÿฌ์Šคํ„ฐ์— ์žฅ์ฐฉ๋œ ๋ชจ๋“  GPU๋ฅผ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด๋Š” SnuRHAC์ด Unified Memory๋ฅผ ํด๋Ÿฌ์Šคํ„ฐ ํ™˜๊ฒฝ์—์„œ ๋™์ž‘ํ•˜๋„๋ก ํ™•์žฅํ•˜๊ณ , ํ•„์š”ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ์ž๋™์œผ๋กœ GPU๊ฐ„์— ์ „์†กํ•˜๊ณ  ๊ด€๋ฆฌํ•ด์ฃผ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ๋˜ํ•œ, UM์„ ์‚ฌ์šฉํ•˜๋ฉด์„œ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋Š” ์˜ค๋ฒ„ํ—ค๋“œ๋ฅผ ์ตœ์†Œํ™”ํ•˜๊ธฐ ์œ„ํ•ด ๋‹ค์–‘ํ•œ ํ”„๋ฆฌํŽ˜์นญ ๊ธฐ๋ฒ•์„ ์†Œ๊ฐœํ•œ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ๋ฅผ ํ†ตํ•ด SnuRHAC์ด ์‰ฝ๊ฒŒ GPU ํด๋Ÿฌ์Šคํ„ฐ๋ฅผ ์œ„ํ•œ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์„ ํ•  ์ˆ˜ ์žˆ๋„๋ก ๋„์™€์ค„ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ, ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜ ํŠน์„ฑ์— ๋”ฐ๋ผ ์ตœ์ ์˜ ์„ฑ๋Šฅ์„ ๋‚ผ ์ˆ˜ ์žˆ์Œ์„ ๋ณด์ธ๋‹ค.1 Introduction 1 2 Related Work 7 3 CUDA Unified Memory 12 4 Framework for Maximizing the Performance of Traditional CUDA Program 17 4.1 Overall Structure of HUM 17 4.2 Overlapping H2Dmemcpy and Computation 19 4.3 Data Consistency and Correctness 23 4.4 HUM Driver 25 4.5 HUM H2Dmemcpy Mechanism 26 4.6 Parallelizing Memory Copy Commands 29 4.7 Scheduling Memory Copy Commands 31 5 Framework for Running Large-scale DNNs on a Single GPU 33 5.1 Structure of DeepUM 33 5.1.1 DeepUM Runtime 34 5.1.2 DeepUM Driver 35 5.2 Correlation Prefetching for GPU Pages 36 5.2.1 Pair-based Correlation Prefetching 37 5.2.2 Correlation Prefetching in DeepUM 38 5.3 Optimizations for GPU Page Fault Handling 42 5.3.1 Page Pre-eviction 42 5.3.2 Invalidating UM Blocks of Inactive PyTorch Blocks 43 6 Framework for Virtualizing a Single Device Image for a GPU Cluster 45 6.1 Overall Structure of SnuRHAC 45 6.2 Workload Distribution 48 6.3 Cluster Unified Memory 50 6.4 Additional Optimizations 57 6.5 Prefetching 58 6.5.1 Static Prefetching 58 6.5.2 Dynamic Prefetching 61 7 Evaluation 62 7.1 Framework for Maximizing the Performance of Traditional CUDA Program 62 7.1.1 Methodology 63 7.1.2 Results 64 7.2 Framework for Running Large-scale DNNs on a Single GPU 70 7.2.1 Methodology 70 7.2.2 Comparison with Naive UM and IBM LMS 72 7.2.3 Parameters of the UM Block Correlation Table 78 7.2.4 Comparison with TensorFlow-based Approaches 79 7.3 Framework for Virtualizing Single Device Image for a GPU Cluster 81 7.3.1 Methodology 81 7.3.2 Results 84 8 Discussions and Future Work 91 9 Conclusion 93 ์ดˆ๋ก 111๋ฐ•

    Parallelizing the QUDA Library for Multi-GPU Calculations in Lattice Quantum Chromodynamics

    Full text link
    Graphics Processing Units (GPUs) are having a transformational effect on numerical lattice quantum chromodynamics (LQCD) calculations of importance in nuclear and particle physics. The QUDA library provides a package of mixed precision sparse matrix linear solvers for LQCD applications, supporting single GPUs based on NVIDIA's Compute Unified Device Architecture (CUDA). This library, interfaced to the QDP++/Chroma framework for LQCD calculations, is currently in production use on the "9g" cluster at the Jefferson Laboratory, enabling unprecedented price/performance for a range of problems in LQCD. Nevertheless, memory constraints on current GPU devices limit the problem sizes that can be tackled. In this contribution we describe the parallelization of the QUDA library onto multiple GPUs using MPI, including strategies for the overlapping of communication and computation. We report on both weak and strong scaling for up to 32 GPUs interconnected by InfiniBand, on which we sustain in excess of 4 Tflops.Comment: 11 pages, 7 figures, to appear in the Proceedings of Supercomputing 2010 (submitted April 12, 2010

    GPUs as Storage System Accelerators

    Full text link
    Massively multicore processors, such as Graphics Processing Units (GPUs), provide, at a comparable price, a one order of magnitude higher peak performance than traditional CPUs. This drop in the cost of computation, as any order-of-magnitude drop in the cost per unit of performance for a class of system components, triggers the opportunity to redesign systems and to explore new ways to engineer them to recalibrate the cost-to-performance relation. This project explores the feasibility of harnessing GPUs' computational power to improve the performance, reliability, or security of distributed storage systems. In this context, we present the design of a storage system prototype that uses GPU offloading to accelerate a number of computationally intensive primitives based on hashing, and introduce techniques to efficiently leverage the processing power of GPUs. We evaluate the performance of this prototype under two configurations: as a content addressable storage system that facilitates online similarity detection between successive versions of the same file and as a traditional system that uses hashing to preserve data integrity. Further, we evaluate the impact of offloading to the GPU on competing applications' performance. Our results show that this technique can bring tangible performance gains without negatively impacting the performance of concurrently running applications.Comment: IEEE Transactions on Parallel and Distributed Systems, 201

    MGG: Accelerating Graph Neural Networks with Fine-grained intra-kernel Communication-Computation Pipelining on Multi-GPU Platforms

    Full text link
    The increasing size of input graphs for graph neural networks (GNNs) highlights the demand for using multi-GPU platforms. However, existing multi-GPU GNN systems optimize the computation and communication individually based on the conventional practice of scaling dense DNNs. For irregularly sparse and fine-grained GNN workloads, such solutions miss the opportunity to jointly schedule/optimize the computation and communication operations for high-performance delivery. To this end, we propose MGG, a novel system design to accelerate full-graph GNNs on multi-GPU platforms. The core of MGG is its novel dynamic software pipeline to facilitate fine-grained computation-communication overlapping within a GPU kernel. Specifically, MGG introduces GNN-tailored pipeline construction and GPU-aware pipeline mapping to facilitate workload balancing and operation overlapping. MGG also incorporates an intelligent runtime design with analytical modeling and optimization heuristics to dynamically improve the execution performance. Extensive evaluation reveals that MGG outperforms state-of-the-art full-graph GNN systems across various settings: on average 4.41X, 4.81X, and 10.83X faster than DGL, MGG-UVM, and ROC, respectively

    Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation

    Full text link
    TensorFlow has been the most widely adopted Machine/Deep Learning framework. However, little exists in the literature that provides a thorough understanding of the capabilities which TensorFlow offers for the distributed training of large ML/DL models that need computation and communication at scale. Most commonly used distributed training approaches for TF can be categorized as follows: 1) Google Remote Procedure Call (gRPC), 2) gRPC+X: X=(InfiniBand Verbs, Message Passing Interface, and GPUDirect RDMA), and 3) No-gRPC: Baidu Allreduce with MPI, Horovod with MPI, and Horovod with NVIDIA NCCL. In this paper, we provide an in-depth performance characterization and analysis of these distributed training approaches on various GPU clusters including the Piz Daint system (6 on Top500). We perform experiments to gain novel insights along the following vectors: 1) Application-level scalability of DNN training, 2) Effect of Batch Size on scaling efficiency, 3) Impact of the MPI library used for no-gRPC approaches, and 4) Type and size of DNN architectures. Based on these experiments, we present two key insights: 1) Overall, No-gRPC designs achieve better performance compared to gRPC-based approaches for most configurations, and 2) The performance of No-gRPC is heavily influenced by the gradient aggregation using Allreduce. Finally, we propose a truly CUDA-Aware MPI Allreduce design that exploits CUDA kernels and pointer caching to perform large reductions efficiently. Our proposed designs offer 5-17X better performance than NCCL2 for small and medium messages, and reduces latency by 29% for large messages. The proposed optimizations help Horovod-MPI to achieve approximately 90% scaling efficiency for ResNet-50 training on 64 GPUs. Further, Horovod-MPI achieves 1.8X and 3.2X higher throughput than the native gRPC method for ResNet-50 and MobileNet, respectively, on the Piz Daint cluster.Comment: 10 pages, 9 figures, submitted to IEEE IPDPS 2019 for peer-revie

    HeTM: Transactional Memory for Heterogeneous Systems

    Full text link
    Modern heterogeneous computing architectures, which couple multi-core CPUs with discrete many-core GPUs (or other specialized hardware accelerators), enable unprecedented peak performance and energy efficiency levels. Unfortunately, though, developing applications that can take full advantage of the potential of heterogeneous systems is a notoriously hard task. This work takes a step towards reducing the complexity of programming heterogeneous systems by introducing the abstraction of Heterogeneous Transactional Memory (HeTM). HeTM provides programmers with the illusion of a single memory region, shared among the CPUs and the (discrete) GPU(s) of a heterogeneous system, with support for atomic transactions. Besides introducing the abstract semantics and programming model of HeTM, we present the design and evaluation of a concrete implementation of the proposed abstraction, which we named Speculative HeTM (SHeTM). SHeTM makes use of a novel design that leverages on speculative techniques and aims at hiding the inherently large communication latency between CPUs and discrete GPUs and at minimizing inter-device synchronization overhead. SHeTM is based on a modular and extensible design that allows for easily integrating alternative TM implementations on the CPU's and GPU's sides, which allows the flexibility to adopt, on either side, the TM implementation (e.g., in hardware or software) that best fits the applications' workload and the architectural characteristics of the processing unit. We demonstrate the efficiency of the SHeTM via an extensive quantitative study based both on synthetic benchmarks and on a porting of a popular object caching system.Comment: The current work was accepted in the 28th International Conference on Parallel Architectures and Compilation Techniques (PACT'19
    • โ€ฆ
    corecore