721 research outputs found

    CUDA Unified Memory๋ฅผ ์œ„ํ•œ ๋ฐ์ดํ„ฐ ๊ด€๋ฆฌ ๋ฐ ํ”„๋ฆฌํŽ˜์นญ ๊ธฐ๋ฒ•

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2022. 8. ์ด์žฌ์ง„.Unified Memory (UM) is a component of CUDA programming model which provides a memory pool that has a single address space and can be accessed by both the host and the GPU. When UM is used, a CUDA program does not need to explicitly move data between the host and the device. It also allows GPU memory oversubscription by using CPU memory as a backing store. UM significantly lessens the burden of a programmer and provides great programmability. However, using UM solely does not guarantee good performance. To fully exploit UM and improve performance, the programmer needs to add user hints to the source code to prefetch pages that are going to be accessed during the CUDA kernel execution. In this thesis, we propose three frameworks that exploits UM to improve the ease-of-programming while maximizing the application performance. The first framework is HUM, which hides host-to-device memory copy time of traditional CUDA program without any code modification. It overlaps the host-to-device memory copy with host computation or CUDA kernel computation by exploiting Unified Memory and fault mechanisms. The evaluation result shows that executing the applications under HUM is, on average, 1.21 times faster than executing them under original CUDA. The speedup is comparable to the average speedup 1.22 of the hand-optimized implementations for Unified Memory. The second framework is DeepUM which exploits UM to allow GPU memory oversubscription for deep neural networks. While UM allows memory oversubscription using a page fault mechanism, page fault handling introduces enormous overhead. We use a correlation prefetching technique to solve the problem and hide the overhead. The evaluation result shows that DeepUM achieves comparable performance to the other state-of-the-art approaches. At the same time, our framework can run larger batch size that other methods fail to run. The last framework is SnuRHAC that provides an illusion of a single GPU for the multiple GPUs in a cluster. Under SnuRHAC, a CUDA program designed to use a single GPU can utilize multiple GPUs in a cluster without any source code modification. SnuRHAC automatically distributes workload to multiple GPUs in a cluster and manages data across the nodes. To manage data efficiently, SnuRHAC extends Unified Memory and exploits its page fault mechanism. We also propose two prefetching techniques to fully exploit UM and to maximize performance. The evaluation result shows that while SnuRHAC significantly improves ease-of-programming, it shows scalable performance for the cluster environment depending on the application characteristics.Unified Memory (UM)๋Š” CUDA ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋ชจ๋ธ์—์„œ ์ œ๊ณตํ•˜๋Š” ๊ธฐ๋Šฅ ์ค‘ ํ•˜๋‚˜๋กœ ๋‹จ์ผ ๋ฉ”๋ชจ๋ฆฌ ์ฃผ์†Œ ๊ณต๊ฐ„์— CPU์™€ GPU๊ฐ€ ๋™์‹œ์— ์ ‘๊ทผํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ด์ค€๋‹ค. ์ด์— ๋”ฐ๋ผ, UM์„ ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ CUDA ํ”„๋กœ๊ทธ๋žจ์—์„œ ๋ช…์‹œ์ ์œผ๋กœ ํ”„๋กœ์„ธ์„œ๊ฐ„์— ๋ฐ์ดํ„ฐ๋ฅผ ์ด๋™์‹œ์ผœ์ฃผ์ง€ ์•Š์•„๋„ ๋œ๋‹ค. ๋˜ํ•œ, CPU ๋ฉ”๋ชจ๋ฆฌ๋ฅผ backing store๋กœ ์‚ฌ์šฉํ•˜์—ฌ GPU์˜ ๋ฉ”๋ชจ๋ฆฌ ํฌ๊ธฐ ๋ณด๋‹ค ๋” ๋งŽ์€ ์–‘์˜ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ•„์š”๋กœ ํ•˜๋Š” ํ”„๋กœ๊ทธ๋žจ์„ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ด์ค€๋‹ค. ๊ฒฐ๊ณผ์ ์œผ๋กœ, UM์€ ํ”„๋กœ๊ทธ๋ž˜๋จธ์˜ ๋ถ€๋‹ด์„ ํฌ๊ฒŒ ๋œ์–ด์ฃผ๊ณ  ์‰ฝ๊ฒŒ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ํ•  ์ˆ˜ ์žˆ๋„๋ก ๋„์™€์ค€๋‹ค. ํ•˜์ง€๋งŒ, UM์„ ์žˆ๋Š” ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์€ ์„ฑ๋Šฅ ์ธก๋ฉด์—์„œ ์ข‹์ง€ ์•Š๋‹ค. UM์€ page fault mechanism์„ ํ†ตํ•ด ๋™์ž‘ํ•˜๋Š”๋ฐ page fault๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋งŽ์€ ์˜ค๋ฒ„ํ—ค๋“œ๊ฐ€ ๋ฐœ์ƒํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. UM์„ ์‚ฌ์šฉํ•˜๋ฉด์„œ ์ตœ๋Œ€์˜ ์„ฑ๋Šฅ์„ ์–ป๊ธฐ ์œ„ํ•ด์„œ๋Š” ํ”„๋กœ๊ทธ๋ž˜๋จธ๊ฐ€ ์†Œ์Šค ์ฝ”๋“œ์— ์—ฌ๋Ÿฌ ํžŒํŠธ๋‚˜ ์•ž์œผ๋กœ CUDA ์ปค๋„์—์„œ ์‚ฌ์šฉ๋  ๋ฉ”๋ชจ๋ฆฌ ์˜์—ญ์— ๋Œ€ํ•œ ํ”„๋ฆฌํŽ˜์น˜ ๋ช…๋ น์„ ์‚ฝ์ž…ํ•ด์ฃผ์–ด์•ผ ํ•œ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์€ UM์„ ์‚ฌ์šฉํ•˜๋ฉด์„œ๋„ ์‰ฌ์šด ํ”„๋กœ๊ทธ๋ž˜๋ฐ๊ณผ ์ตœ๋Œ€์˜ ์„ฑ๋Šฅ์ด๋ผ๋Š” ๋‘๋งˆ๋ฆฌ ํ† ๋ผ๋ฅผ ๋™์‹œ์— ์žก๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ•๋“ค์„ ์†Œ๊ฐœํ•œ๋‹ค. ์ฒซ์งธ๋กœ, HUM์€ ๊ธฐ์กด CUDA ํ”„๋กœ๊ทธ๋žจ์˜ ์†Œ์Šค ์ฝ”๋“œ๋ฅผ ์ˆ˜์ •ํ•˜์ง€ ์•Š๊ณ  ํ˜ธ์ŠคํŠธ์™€ ๋””๋ฐ”์ด์Šค ๊ฐ„์— ๋ฉ”๋ชจ๋ฆฌ ์ „์†ก ์‹œ๊ฐ„์„ ์ตœ์†Œํ™”ํ•œ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด, UM๊ณผ fault mechanism์„ ์‚ฌ์šฉํ•˜์—ฌ ํ˜ธ์ŠคํŠธ-๋””๋ฐ”์ด์Šค ๊ฐ„ ๋ฉ”๋ชจ๋ฆฌ ์ „์†ก์„ ํ˜ธ์ŠคํŠธ ๊ณ„์‚ฐ ํ˜น์€ CUDA ์ปค๋„ ์‹คํ–‰๊ณผ ์ค‘์ฒฉ์‹œํ‚จ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ๋ฅผ ํ†ตํ•ด HUM์„ ํ†ตํ•ด ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์„ ์‹คํ–‰ํ•˜๋Š” ๊ฒƒ์ด ๊ทธ๋ ‡์ง€ ์•Š๊ณ  CUDA๋งŒ์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์— ๋น„ํ•ด ํ‰๊ท  1.21๋ฐฐ ๋น ๋ฅธ ๊ฒƒ์„ ํ™•์ธํ•˜์˜€๋‹ค. ๋˜ํ•œ, Unified Memory๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ”„๋กœ๊ทธ๋ž˜๋จธ๊ฐ€ ์†Œ์Šค ์ฝ”๋“œ๋ฅผ ์ตœ์ ํ™”ํ•œ ๊ฒƒ๊ณผ ์œ ์‚ฌํ•œ ์„ฑ๋Šฅ์„ ๋‚ด๋Š” ๊ฒƒ์„ ํ™•์ธํ•˜์˜€๋‹ค. ๋‘๋ฒˆ์งธ๋กœ, DeepUM์€ UM์„ ํ™œ์šฉํ•˜์—ฌ GPU์˜ ๋ฉ”๋ชจ๋ฆฌ ํฌ๊ธฐ ๋ณด๋‹ค ๋” ๋งŽ์€ ์–‘์˜ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ•„์š”๋กœ ํ•˜๋Š” ๋”ฅ ๋Ÿฌ๋‹ ๋ชจ๋ธ์„ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•œ๋‹ค. UM์„ ํ†ตํ•ด GPU ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์ดˆ๊ณผํ•ด์„œ ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ CPU์™€ GPU๊ฐ„์— ํŽ˜์ด์ง€๊ฐ€ ๋งค์šฐ ๋นˆ๋ฒˆํ•˜๊ฒŒ ์ด๋™ํ•˜๋Š”๋ฐ, ์ด๋•Œ ๋งŽ์€ ์˜ค๋ฒ„ํ—ค๋“œ๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค. ๋‘๋ฒˆ์งธ ๋ฐฉ๋ฒ•์—์„œ๋Š” correlation ํ”„๋ฆฌํŽ˜์นญ ๊ธฐ๋ฒ•์„ ํ†ตํ•ด ์ด ์˜ค๋ฒ„ํ—ค๋“œ๋ฅผ ์ตœ์†Œํ™”ํ•œ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ๋ฅผ ํ†ตํ•ด DeepUM์€ ๊ธฐ์กด์— ์—ฐ๊ตฌ๋œ ๊ฒฐ๊ณผ๋“ค๊ณผ ๋น„์Šทํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ด๋ฉด์„œ ๋” ํฐ ๋ฐฐ์น˜ ์‚ฌ์ด์ฆˆ ํ˜น์€ ๋” ํฐ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋ชจ๋ธ์„ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ์Œ์„ ํ™•์ธํ•˜์˜€๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ, SnuRHAC์€ ํด๋Ÿฌ์Šคํ„ฐ์— ์žฅ์ฐฉ๋œ ์—ฌ๋Ÿฌ GPU๋ฅผ ๋งˆ์น˜ ํ•˜๋‚˜์˜ ํ†ตํ•ฉ๋œ GPU์ฒ˜๋Ÿผ ๋ณด์—ฌ์ค€๋‹ค. ๋”ฐ๋ผ์„œ, ํ”„๋กœ๊ทธ๋ž˜๋จธ๋Š” ์—ฌ๋Ÿฌ GPU๋ฅผ ๋Œ€์ƒ์œผ๋กœ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ํ•˜์ง€ ์•Š๊ณ  ํ•˜๋‚˜์˜ ๊ฐ€์ƒ GPU๋ฅผ ๋Œ€์ƒ์œผ๋กœ ํ”„๋กœ๊ทธ๋ž˜๋ฐํ•˜๋ฉด ํด๋Ÿฌ์Šคํ„ฐ์— ์žฅ์ฐฉ๋œ ๋ชจ๋“  GPU๋ฅผ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด๋Š” SnuRHAC์ด Unified Memory๋ฅผ ํด๋Ÿฌ์Šคํ„ฐ ํ™˜๊ฒฝ์—์„œ ๋™์ž‘ํ•˜๋„๋ก ํ™•์žฅํ•˜๊ณ , ํ•„์š”ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ์ž๋™์œผ๋กœ GPU๊ฐ„์— ์ „์†กํ•˜๊ณ  ๊ด€๋ฆฌํ•ด์ฃผ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ๋˜ํ•œ, UM์„ ์‚ฌ์šฉํ•˜๋ฉด์„œ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋Š” ์˜ค๋ฒ„ํ—ค๋“œ๋ฅผ ์ตœ์†Œํ™”ํ•˜๊ธฐ ์œ„ํ•ด ๋‹ค์–‘ํ•œ ํ”„๋ฆฌํŽ˜์นญ ๊ธฐ๋ฒ•์„ ์†Œ๊ฐœํ•œ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ๋ฅผ ํ†ตํ•ด SnuRHAC์ด ์‰ฝ๊ฒŒ GPU ํด๋Ÿฌ์Šคํ„ฐ๋ฅผ ์œ„ํ•œ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์„ ํ•  ์ˆ˜ ์žˆ๋„๋ก ๋„์™€์ค„ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ, ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜ ํŠน์„ฑ์— ๋”ฐ๋ผ ์ตœ์ ์˜ ์„ฑ๋Šฅ์„ ๋‚ผ ์ˆ˜ ์žˆ์Œ์„ ๋ณด์ธ๋‹ค.1 Introduction 1 2 Related Work 7 3 CUDA Unified Memory 12 4 Framework for Maximizing the Performance of Traditional CUDA Program 17 4.1 Overall Structure of HUM 17 4.2 Overlapping H2Dmemcpy and Computation 19 4.3 Data Consistency and Correctness 23 4.4 HUM Driver 25 4.5 HUM H2Dmemcpy Mechanism 26 4.6 Parallelizing Memory Copy Commands 29 4.7 Scheduling Memory Copy Commands 31 5 Framework for Running Large-scale DNNs on a Single GPU 33 5.1 Structure of DeepUM 33 5.1.1 DeepUM Runtime 34 5.1.2 DeepUM Driver 35 5.2 Correlation Prefetching for GPU Pages 36 5.2.1 Pair-based Correlation Prefetching 37 5.2.2 Correlation Prefetching in DeepUM 38 5.3 Optimizations for GPU Page Fault Handling 42 5.3.1 Page Pre-eviction 42 5.3.2 Invalidating UM Blocks of Inactive PyTorch Blocks 43 6 Framework for Virtualizing a Single Device Image for a GPU Cluster 45 6.1 Overall Structure of SnuRHAC 45 6.2 Workload Distribution 48 6.3 Cluster Unified Memory 50 6.4 Additional Optimizations 57 6.5 Prefetching 58 6.5.1 Static Prefetching 58 6.5.2 Dynamic Prefetching 61 7 Evaluation 62 7.1 Framework for Maximizing the Performance of Traditional CUDA Program 62 7.1.1 Methodology 63 7.1.2 Results 64 7.2 Framework for Running Large-scale DNNs on a Single GPU 70 7.2.1 Methodology 70 7.2.2 Comparison with Naive UM and IBM LMS 72 7.2.3 Parameters of the UM Block Correlation Table 78 7.2.4 Comparison with TensorFlow-based Approaches 79 7.3 Framework for Virtualizing Single Device Image for a GPU Cluster 81 7.3.1 Methodology 81 7.3.2 Results 84 8 Discussions and Future Work 91 9 Conclusion 93 ์ดˆ๋ก 111๋ฐ•

    Lightweight process migration and memory prefetching in openMosix

    Get PDF
    We propose a lightweight process migration mechanism and an adaptive memory prefetching scheme called AMPoM (Adaptive Memory Prefetching in openMosix), whose goal is to reduce the migration freeze time in openMosix while ensuring the execution efficiency of migrants. To minimize the freeze time, our system transfers only a few pages to the destination node during process migration. After the migration, AMPoM analyzes the spatial locality of memory access and iteratively prefetches memory pages from remote to hide the latency of inter-node page faults. AMPoM adopts a unique algorithm to decide which and how many pages to prefetch. It tends to prefetch more aggressively when a sequential access pattern is developed, when the paging rate of the process is high or when the network is busy. This advanced strategy makes AMPoM highly adaptive to different application behaviors and system dynamics. The HPC Challenge benchmark results show that AMPoM can avoid 98% of migration freeze time while preventing 85-99% of page fault requests after the migration. Compared to openMosix which does not have remote page fault, AMPoM induces a modest overhead of 0-5% additional runtime. When the working set of a migrant is small, AMPoM outperforms openMosix considerably due to the reduced amount of data transfer. These results indicate that by exploiting memory access locality and prefetching, process migration can be a lightweight operation with little software overhead in remote paging. ยฉ2008 IEEE.published_or_final_versionThe 2008 IEEE International Symposium on Parallel and Distributed Processing (IPDPS 2008), Miami, FL., 14-18 April 2008. In Proceedings of the 22nd IPDPS, 2008, p. 1-1

    3PO: Programmed Far-Memory Prefetching for Oblivious Applications

    Full text link
    Using memory located on remote machines, or far memory, as a swap space is a promising approach to meet the increasing memory demands of modern datacenter applications. Operating systems have long relied on prefetchers to mask the increased latency of fetching pages from swap space to main memory. Unfortunately, with traditional prefetching heuristics, performance still degrades when applications use far memory. In this paper we propose a new prefetching technique for far-memory applications. We focus our efforts on memory-intensive, oblivious applications whose memory access patterns are independent of their inputs, such as matrix multiplication. For this class of applications we observe that we can perfectly prefetch pages without relying on heuristics. However, prefetching perfectly without requiring significant application modifications is challenging. In this paper we describe the design and implementation of 3PO, a system that provides pre-planned prefetching for general oblivious applications. We demonstrate that 3PO can accelerate applications, e.g., running them 30-150% faster than with Linux's prefetcher with 20% local memory. We also use 3PO to understand the fundamental software overheads of prefetching in a paging-based system, and the minimum performance penalty that they impose when we run applications under constrained local memory.Comment: 14 page

    An accurate prefetching policy for object oriented systems

    Get PDF
    PhD ThesisIn the latest high-performance computers, there is a growing requirement for accurate prefetching(AP) methodologies for advanced object management schemes in virtual memory and migration systems. The major issue for achieving this goal is that of finding a simple way of accurately predicting the objects that will be referenced in the near future and to group them so as to allow them to be fetched same time. The basic notion of AP involves building a relationship for logically grouping related objects and prefetching them, rather than using their physical grouping and it relies on demand fetching such as is done in existing restructuring or grouping schemes. By this, AP tries to overcome some of the shortcomings posed by physical grouping methods. Prefetching also makes use of the properties of object oriented languages to build inter and intra object relationships as a means of logical grouping. This thesis describes how this relationship can be established at compile time and how it can be used for accurate object prefetching in virtual memory systems. In addition, AP performs control flow and data dependency analysis to reinforce the relationships and to find the dependencies of a program. The user program is decomposed into prefetching blocks which contain all the information needed for block prefetching such as long branches and function calls at major branch points. The proposed prefetching scheme is implemented by extending a C++ compiler and evaluated on a virtual memory simulator. The results show a significant reduction both in the number of page fault and memory pollution. In particular, AP can suppress many page faults that occur during transition phases which are unmanageable by other ways of fetching. AP can be applied to a local and distributed virtual memory system so as to reduce the fault rate by fetching groups of objects at the same time and consequently lessening operating system overheads.British Counci

    Traffic Characteristics of a Distributed Memory System

    Get PDF
    We believe that many distributed computing systems of the future will use distributed shared memory as a technique for interprocess communication. Thus, traffic generated by memory requests will be a major component of the traffic for any networks which connect nodes in such a system. In this paper, we study memory reference strings gathered with a tracing program we devised. We study several models. First, we look at raw reference data, as would be seen if the network were a backplane. Second, we examine references in units of blocks , first using a one-block cache model and then with an infinite cache. Finally, we study the effect of predictive prepaging of these blocks on the traffic. We provide a novel representation of memory reference data which can be used to calculate interarrival distributions directly. Integrating communication with computation can be used to control both traffic and performance

    On the classification and evaluation of prefetching schemes

    Get PDF
    Abstract available: p. [2

    Prefetching techniques for client server object-oriented database systems

    Get PDF
    The performance of many object-oriented database applications suffers from the page fetch latency which is determined by the expense of disk access. In this work we suggest several prefetching techniques to avoid, or at least to reduce, page fetch latency. In practice no prediction technique is perfect and no prefetching technique can entirely eliminate delay due to page fetch latency. Therefore we are interested in the trade-off between the level of accuracy required for obtaining good results in terms of elapsed time reduction and the processing overhead needed to achieve this level of accuracy. If prefetching accuracy is high then the total elapsed time of an application can be reduced significantly otherwise if the prefetching accuracy is low, many incorrect pages are prefetched and the extra load on the client, network, server and disks decreases the whole system performance. Access pattern of object-oriented databases are often complex and usually hard to predict accurately. The ..
    • โ€ฆ
    corecore