760 research outputs found

    CUDA Unified Memory๋ฅผ ์œ„ํ•œ ๋ฐ์ดํ„ฐ ๊ด€๋ฆฌ ๋ฐ ํ”„๋ฆฌํŽ˜์นญ ๊ธฐ๋ฒ•

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2022. 8. ์ด์žฌ์ง„.Unified Memory (UM) is a component of CUDA programming model which provides a memory pool that has a single address space and can be accessed by both the host and the GPU. When UM is used, a CUDA program does not need to explicitly move data between the host and the device. It also allows GPU memory oversubscription by using CPU memory as a backing store. UM significantly lessens the burden of a programmer and provides great programmability. However, using UM solely does not guarantee good performance. To fully exploit UM and improve performance, the programmer needs to add user hints to the source code to prefetch pages that are going to be accessed during the CUDA kernel execution. In this thesis, we propose three frameworks that exploits UM to improve the ease-of-programming while maximizing the application performance. The first framework is HUM, which hides host-to-device memory copy time of traditional CUDA program without any code modification. It overlaps the host-to-device memory copy with host computation or CUDA kernel computation by exploiting Unified Memory and fault mechanisms. The evaluation result shows that executing the applications under HUM is, on average, 1.21 times faster than executing them under original CUDA. The speedup is comparable to the average speedup 1.22 of the hand-optimized implementations for Unified Memory. The second framework is DeepUM which exploits UM to allow GPU memory oversubscription for deep neural networks. While UM allows memory oversubscription using a page fault mechanism, page fault handling introduces enormous overhead. We use a correlation prefetching technique to solve the problem and hide the overhead. The evaluation result shows that DeepUM achieves comparable performance to the other state-of-the-art approaches. At the same time, our framework can run larger batch size that other methods fail to run. The last framework is SnuRHAC that provides an illusion of a single GPU for the multiple GPUs in a cluster. Under SnuRHAC, a CUDA program designed to use a single GPU can utilize multiple GPUs in a cluster without any source code modification. SnuRHAC automatically distributes workload to multiple GPUs in a cluster and manages data across the nodes. To manage data efficiently, SnuRHAC extends Unified Memory and exploits its page fault mechanism. We also propose two prefetching techniques to fully exploit UM and to maximize performance. The evaluation result shows that while SnuRHAC significantly improves ease-of-programming, it shows scalable performance for the cluster environment depending on the application characteristics.Unified Memory (UM)๋Š” CUDA ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋ชจ๋ธ์—์„œ ์ œ๊ณตํ•˜๋Š” ๊ธฐ๋Šฅ ์ค‘ ํ•˜๋‚˜๋กœ ๋‹จ์ผ ๋ฉ”๋ชจ๋ฆฌ ์ฃผ์†Œ ๊ณต๊ฐ„์— CPU์™€ GPU๊ฐ€ ๋™์‹œ์— ์ ‘๊ทผํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ด์ค€๋‹ค. ์ด์— ๋”ฐ๋ผ, UM์„ ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ CUDA ํ”„๋กœ๊ทธ๋žจ์—์„œ ๋ช…์‹œ์ ์œผ๋กœ ํ”„๋กœ์„ธ์„œ๊ฐ„์— ๋ฐ์ดํ„ฐ๋ฅผ ์ด๋™์‹œ์ผœ์ฃผ์ง€ ์•Š์•„๋„ ๋œ๋‹ค. ๋˜ํ•œ, CPU ๋ฉ”๋ชจ๋ฆฌ๋ฅผ backing store๋กœ ์‚ฌ์šฉํ•˜์—ฌ GPU์˜ ๋ฉ”๋ชจ๋ฆฌ ํฌ๊ธฐ ๋ณด๋‹ค ๋” ๋งŽ์€ ์–‘์˜ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ•„์š”๋กœ ํ•˜๋Š” ํ”„๋กœ๊ทธ๋žจ์„ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ด์ค€๋‹ค. ๊ฒฐ๊ณผ์ ์œผ๋กœ, UM์€ ํ”„๋กœ๊ทธ๋ž˜๋จธ์˜ ๋ถ€๋‹ด์„ ํฌ๊ฒŒ ๋œ์–ด์ฃผ๊ณ  ์‰ฝ๊ฒŒ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ํ•  ์ˆ˜ ์žˆ๋„๋ก ๋„์™€์ค€๋‹ค. ํ•˜์ง€๋งŒ, UM์„ ์žˆ๋Š” ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์€ ์„ฑ๋Šฅ ์ธก๋ฉด์—์„œ ์ข‹์ง€ ์•Š๋‹ค. UM์€ page fault mechanism์„ ํ†ตํ•ด ๋™์ž‘ํ•˜๋Š”๋ฐ page fault๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋งŽ์€ ์˜ค๋ฒ„ํ—ค๋“œ๊ฐ€ ๋ฐœ์ƒํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. UM์„ ์‚ฌ์šฉํ•˜๋ฉด์„œ ์ตœ๋Œ€์˜ ์„ฑ๋Šฅ์„ ์–ป๊ธฐ ์œ„ํ•ด์„œ๋Š” ํ”„๋กœ๊ทธ๋ž˜๋จธ๊ฐ€ ์†Œ์Šค ์ฝ”๋“œ์— ์—ฌ๋Ÿฌ ํžŒํŠธ๋‚˜ ์•ž์œผ๋กœ CUDA ์ปค๋„์—์„œ ์‚ฌ์šฉ๋  ๋ฉ”๋ชจ๋ฆฌ ์˜์—ญ์— ๋Œ€ํ•œ ํ”„๋ฆฌํŽ˜์น˜ ๋ช…๋ น์„ ์‚ฝ์ž…ํ•ด์ฃผ์–ด์•ผ ํ•œ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์€ UM์„ ์‚ฌ์šฉํ•˜๋ฉด์„œ๋„ ์‰ฌ์šด ํ”„๋กœ๊ทธ๋ž˜๋ฐ๊ณผ ์ตœ๋Œ€์˜ ์„ฑ๋Šฅ์ด๋ผ๋Š” ๋‘๋งˆ๋ฆฌ ํ† ๋ผ๋ฅผ ๋™์‹œ์— ์žก๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ•๋“ค์„ ์†Œ๊ฐœํ•œ๋‹ค. ์ฒซ์งธ๋กœ, HUM์€ ๊ธฐ์กด CUDA ํ”„๋กœ๊ทธ๋žจ์˜ ์†Œ์Šค ์ฝ”๋“œ๋ฅผ ์ˆ˜์ •ํ•˜์ง€ ์•Š๊ณ  ํ˜ธ์ŠคํŠธ์™€ ๋””๋ฐ”์ด์Šค ๊ฐ„์— ๋ฉ”๋ชจ๋ฆฌ ์ „์†ก ์‹œ๊ฐ„์„ ์ตœ์†Œํ™”ํ•œ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด, UM๊ณผ fault mechanism์„ ์‚ฌ์šฉํ•˜์—ฌ ํ˜ธ์ŠคํŠธ-๋””๋ฐ”์ด์Šค ๊ฐ„ ๋ฉ”๋ชจ๋ฆฌ ์ „์†ก์„ ํ˜ธ์ŠคํŠธ ๊ณ„์‚ฐ ํ˜น์€ CUDA ์ปค๋„ ์‹คํ–‰๊ณผ ์ค‘์ฒฉ์‹œํ‚จ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ๋ฅผ ํ†ตํ•ด HUM์„ ํ†ตํ•ด ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์„ ์‹คํ–‰ํ•˜๋Š” ๊ฒƒ์ด ๊ทธ๋ ‡์ง€ ์•Š๊ณ  CUDA๋งŒ์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์— ๋น„ํ•ด ํ‰๊ท  1.21๋ฐฐ ๋น ๋ฅธ ๊ฒƒ์„ ํ™•์ธํ•˜์˜€๋‹ค. ๋˜ํ•œ, Unified Memory๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ”„๋กœ๊ทธ๋ž˜๋จธ๊ฐ€ ์†Œ์Šค ์ฝ”๋“œ๋ฅผ ์ตœ์ ํ™”ํ•œ ๊ฒƒ๊ณผ ์œ ์‚ฌํ•œ ์„ฑ๋Šฅ์„ ๋‚ด๋Š” ๊ฒƒ์„ ํ™•์ธํ•˜์˜€๋‹ค. ๋‘๋ฒˆ์งธ๋กœ, DeepUM์€ UM์„ ํ™œ์šฉํ•˜์—ฌ GPU์˜ ๋ฉ”๋ชจ๋ฆฌ ํฌ๊ธฐ ๋ณด๋‹ค ๋” ๋งŽ์€ ์–‘์˜ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ•„์š”๋กœ ํ•˜๋Š” ๋”ฅ ๋Ÿฌ๋‹ ๋ชจ๋ธ์„ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•œ๋‹ค. UM์„ ํ†ตํ•ด GPU ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์ดˆ๊ณผํ•ด์„œ ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ CPU์™€ GPU๊ฐ„์— ํŽ˜์ด์ง€๊ฐ€ ๋งค์šฐ ๋นˆ๋ฒˆํ•˜๊ฒŒ ์ด๋™ํ•˜๋Š”๋ฐ, ์ด๋•Œ ๋งŽ์€ ์˜ค๋ฒ„ํ—ค๋“œ๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค. ๋‘๋ฒˆ์งธ ๋ฐฉ๋ฒ•์—์„œ๋Š” correlation ํ”„๋ฆฌํŽ˜์นญ ๊ธฐ๋ฒ•์„ ํ†ตํ•ด ์ด ์˜ค๋ฒ„ํ—ค๋“œ๋ฅผ ์ตœ์†Œํ™”ํ•œ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ๋ฅผ ํ†ตํ•ด DeepUM์€ ๊ธฐ์กด์— ์—ฐ๊ตฌ๋œ ๊ฒฐ๊ณผ๋“ค๊ณผ ๋น„์Šทํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ด๋ฉด์„œ ๋” ํฐ ๋ฐฐ์น˜ ์‚ฌ์ด์ฆˆ ํ˜น์€ ๋” ํฐ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋ชจ๋ธ์„ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ์Œ์„ ํ™•์ธํ•˜์˜€๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ, SnuRHAC์€ ํด๋Ÿฌ์Šคํ„ฐ์— ์žฅ์ฐฉ๋œ ์—ฌ๋Ÿฌ GPU๋ฅผ ๋งˆ์น˜ ํ•˜๋‚˜์˜ ํ†ตํ•ฉ๋œ GPU์ฒ˜๋Ÿผ ๋ณด์—ฌ์ค€๋‹ค. ๋”ฐ๋ผ์„œ, ํ”„๋กœ๊ทธ๋ž˜๋จธ๋Š” ์—ฌ๋Ÿฌ GPU๋ฅผ ๋Œ€์ƒ์œผ๋กœ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ํ•˜์ง€ ์•Š๊ณ  ํ•˜๋‚˜์˜ ๊ฐ€์ƒ GPU๋ฅผ ๋Œ€์ƒ์œผ๋กœ ํ”„๋กœ๊ทธ๋ž˜๋ฐํ•˜๋ฉด ํด๋Ÿฌ์Šคํ„ฐ์— ์žฅ์ฐฉ๋œ ๋ชจ๋“  GPU๋ฅผ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด๋Š” SnuRHAC์ด Unified Memory๋ฅผ ํด๋Ÿฌ์Šคํ„ฐ ํ™˜๊ฒฝ์—์„œ ๋™์ž‘ํ•˜๋„๋ก ํ™•์žฅํ•˜๊ณ , ํ•„์š”ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ์ž๋™์œผ๋กœ GPU๊ฐ„์— ์ „์†กํ•˜๊ณ  ๊ด€๋ฆฌํ•ด์ฃผ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ๋˜ํ•œ, UM์„ ์‚ฌ์šฉํ•˜๋ฉด์„œ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋Š” ์˜ค๋ฒ„ํ—ค๋“œ๋ฅผ ์ตœ์†Œํ™”ํ•˜๊ธฐ ์œ„ํ•ด ๋‹ค์–‘ํ•œ ํ”„๋ฆฌํŽ˜์นญ ๊ธฐ๋ฒ•์„ ์†Œ๊ฐœํ•œ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ๋ฅผ ํ†ตํ•ด SnuRHAC์ด ์‰ฝ๊ฒŒ GPU ํด๋Ÿฌ์Šคํ„ฐ๋ฅผ ์œ„ํ•œ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์„ ํ•  ์ˆ˜ ์žˆ๋„๋ก ๋„์™€์ค„ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ, ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜ ํŠน์„ฑ์— ๋”ฐ๋ผ ์ตœ์ ์˜ ์„ฑ๋Šฅ์„ ๋‚ผ ์ˆ˜ ์žˆ์Œ์„ ๋ณด์ธ๋‹ค.1 Introduction 1 2 Related Work 7 3 CUDA Unified Memory 12 4 Framework for Maximizing the Performance of Traditional CUDA Program 17 4.1 Overall Structure of HUM 17 4.2 Overlapping H2Dmemcpy and Computation 19 4.3 Data Consistency and Correctness 23 4.4 HUM Driver 25 4.5 HUM H2Dmemcpy Mechanism 26 4.6 Parallelizing Memory Copy Commands 29 4.7 Scheduling Memory Copy Commands 31 5 Framework for Running Large-scale DNNs on a Single GPU 33 5.1 Structure of DeepUM 33 5.1.1 DeepUM Runtime 34 5.1.2 DeepUM Driver 35 5.2 Correlation Prefetching for GPU Pages 36 5.2.1 Pair-based Correlation Prefetching 37 5.2.2 Correlation Prefetching in DeepUM 38 5.3 Optimizations for GPU Page Fault Handling 42 5.3.1 Page Pre-eviction 42 5.3.2 Invalidating UM Blocks of Inactive PyTorch Blocks 43 6 Framework for Virtualizing a Single Device Image for a GPU Cluster 45 6.1 Overall Structure of SnuRHAC 45 6.2 Workload Distribution 48 6.3 Cluster Unified Memory 50 6.4 Additional Optimizations 57 6.5 Prefetching 58 6.5.1 Static Prefetching 58 6.5.2 Dynamic Prefetching 61 7 Evaluation 62 7.1 Framework for Maximizing the Performance of Traditional CUDA Program 62 7.1.1 Methodology 63 7.1.2 Results 64 7.2 Framework for Running Large-scale DNNs on a Single GPU 70 7.2.1 Methodology 70 7.2.2 Comparison with Naive UM and IBM LMS 72 7.2.3 Parameters of the UM Block Correlation Table 78 7.2.4 Comparison with TensorFlow-based Approaches 79 7.3 Framework for Virtualizing Single Device Image for a GPU Cluster 81 7.3.1 Methodology 81 7.3.2 Results 84 8 Discussions and Future Work 91 9 Conclusion 93 ์ดˆ๋ก 111๋ฐ•

    Comprehensive characterization of an open source document search engine

    Get PDF
    This work performs a thorough characterization and analysis of the open source Lucene search library. The article describes in detail the architecture, functionality, and micro-architectural behavior of the search engine, and investigates prominent online document search research issues. In particular, we study how intra-server index partitioning affects the response time and throughput, explore the potential use of low power servers for document search, and examine the sources of performance degradation ands the causes of tail latencies. Some of our main conclusions are the following: (a) intra-server index partitioning can reduce tail latencies but with diminishing benefits as incoming query traffic increases, (b) low power servers given enough partitioning can provide same average and tail response times as conventional high performance servers, (c) index search is a CPU-intensive cache-friendly application, and (d) C-states are the main culprits for performance degradation in document search.Web of Science162art. no. 1

    Holistic Performance Analysis and Optimization of Unified Virtual Memory

    Get PDF
    The programming difficulty of creating GPU-accelerated high performance computing (HPC) codes has been greatly reduced by the advent of Unified Memory technologies that abstract the management of physical memory away from the developer. However, these systems incur substantial overhead that paradoxically grows for codes where these technologies are most useful. While these technologies are increasingly adopted for use in modern HPC frameworks and applications, the performance cost reduces the efficiency of these systems and turns away some developers from adoption entirely. These systems are naturally difficult to optimize due to the large number of interconnected hardware and software components that must be untangled to perform thorough analysis. In this thesis, we take the first deep dive into a functional implementation of a Unified Memory system, NVIDIA UVM, to evaluate the performance and characteristics of these systems. We show specific hardware and software interactions that cause serialization between host and devices. We further provide a quantitative evaluation of fault handling for various applications under different scenarios, including prefetching and oversubscription. Through lower-level analysis, we find that the driver workload is dependent on the interactions among application access patterns, GPU hardware constraints, and Host OS components. These findings indicate that the cost of host OS components is significant and present across UM implementations. We also provide a proof-of-concept asynchronous approach to memory management in UVM that allows for reduced system overhead and improved application performance. This study provides constructive insight into future implementations and systems, such as Heterogeneous Memory Management

    IMP: Indirect Memory Prefetcher

    Get PDF
    Machine learning, graph analytics and sparse linear algebra-based applications are dominated by irregular memory accesses resulting from following edges in a graph or non-zero elements in a sparse matrix. These accesses have little temporal or spatial locality, and thus incur long memory stalls and large bandwidth requirements. A traditional streaming or striding prefetcher cannot capture these irregular access patterns. A majority of these irregular accesses come from indirect patterns of the form A[B[i]]. We propose an efficient hardware indirect memory prefetcher (IMP) to capture this access pattern and hide latency. We also propose a partial cacheline accessing mechanism for these prefetches to reduce the network and DRAM bandwidth pressure from the lack of spatial locality. Evaluated on 7 applications, IMP shows 56% speedup on average (up to 2.3ร—) compared to a baseline 64 core system with streaming prefetchers. This is within 23% of an idealized system. With partial cacheline accessing, we see another 9.4% speedup on average (up to 46.6%).Intel Science and Technology Center for Big Dat

    TLB pre-loading for Java applications

    Get PDF
    The increasing memory requirement for today\u27s applications is causing more stress for the memory system. This side effect puts pressure into available caches, and specifically the TLB cache. TLB misses are responsible for a considerable ratio of the total memory latency, since an average of 10% of execution time is wasted on miss penalties. Java applications are not in a better position. Their attractive features increase the memory footprint. Generally, Java applications TLB miss rate tends to be multiples of miss rate for non-java applications. The high miss rate will cause the application to loose valuable execution time. Our experiments show that on average, miss penalty can constitute about 24% of execution time. Several hardware modifications were suggested to reduce TLB misses for general applications. However, to the best of our knowledge, there have been no similar efforts for Java applications. Here we propose a software-based prediction model that relies on information available to the virtual machine. The model uses the write barrier operation to predict TLB misses with an average 41% accuracy rate
    • โ€ฆ
    corecore