426 research outputs found

    vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design

    Full text link
    The most widely used machine learning frameworks require users to carefully tune their memory usage so that the deep neural network (DNN) fits into the DRAM capacity of a GPU. This restriction hampers a researcher's flexibility to study different machine learning algorithms, forcing them to either use a less desirable network architecture or parallelize the processing across multiple GPUs. We propose a runtime memory manager that virtualizes the memory usage of DNNs such that both GPU and CPU memory can simultaneously be utilized for training larger DNNs. Our virtualized DNN (vDNN) reduces the average GPU memory usage of AlexNet by up to 89%, OverFeat by 91%, and GoogLeNet by 95%, a significant reduction in memory requirements of DNNs. Similar experiments on VGG-16, one of the deepest and memory hungry DNNs to date, demonstrate the memory-efficiency of our proposal. vDNN enables VGG-16 with batch size 256 (requiring 28 GB of memory) to be trained on a single NVIDIA Titan X GPU card containing 12 GB of memory, with 18% performance loss compared to a hypothetical, oracular GPU with enough memory to hold the entire DNN.Comment: Published as a conference paper at the 49th IEEE/ACM International Symposium on Microarchitecture (MICRO-49), 201

    Efficient Memory Management for GPU-based Deep Learning Systems

    Get PDF
    GPU (graphics processing unit) has been used for many data-intensive applications. Among them, deep learning systems are one of the most important consumer systems for GPU nowadays. As deep learning applications impose deeper and larger models in order to achieve higher accuracy, memory management becomes an important research topic for deep learning systems, given that GPU has limited memory size. Many approaches have been proposed towards this issue, e.g., model compression and memory swapping. However, they either degrade the model accuracy or require a lot of manual intervention. In this paper, we propose two orthogonal approaches to reduce the memory cost from the system perspective. Our approaches are transparent to the models, and thus do not affect the model accuracy. They are achieved by exploiting the iterative nature of the training algorithm of deep learning to derive the lifetime and read/write order of all variables. With the lifetime semantics, we are able to implement a memory pool with minimal fragments. However, the optimization problem is NP-complete. We propose a heuristic algorithm that reduces up to 13.3% of memory compared with Nvidia's default memory pool with equal time complexity. With the read/write semantics, the variables that are not in use can be swapped out from GPU to CPU to reduce the memory footprint. We propose multiple swapping strategies to automatically decide which variable to swap and when to swap out (in), which reduces the memory cost by up to 34.2% without communication overhead

    Efficient Memory Management for GPU-based Deep Learning Systems

    Full text link
    GPU (graphics processing unit) has been used for many data-intensive applications. Among them, deep learning systems are one of the most important consumer systems for GPU nowadays. As deep learning applications impose deeper and larger models in order to achieve higher accuracy, memory management becomes an important research topic for deep learning systems, given that GPU has limited memory size. Many approaches have been proposed towards this issue, e.g., model compression and memory swapping. However, they either degrade the model accuracy or require a lot of manual intervention. In this paper, we propose two orthogonal approaches to reduce the memory cost from the system perspective. Our approaches are transparent to the models, and thus do not affect the model accuracy. They are achieved by exploiting the iterative nature of the training algorithm of deep learning to derive the lifetime and read/write order of all variables. With the lifetime semantics, we are able to implement a memory pool with minimal fragments. However, the optimization problem is NP-complete. We propose a heuristic algorithm that reduces up to 13.3% of memory compared with Nvidia's default memory pool with equal time complexity. With the read/write semantics, the variables that are not in use can be swapped out from GPU to CPU to reduce the memory footprint. We propose multiple swapping strategies to automatically decide which variable to swap and when to swap out (in), which reduces the memory cost by up to 34.2% without communication overhead

    TrIMS: Transparent and Isolated Model Sharing for Low Latency Deep LearningInference in Function as a Service Environments

    Full text link
    Deep neural networks (DNNs) have become core computation components within low latency Function as a Service (FaaS) prediction pipelines: including image recognition, object detection, natural language processing, speech synthesis, and personalized recommendation pipelines. Cloud computing, as the de-facto backbone of modern computing infrastructure for both enterprise and consumer applications, has to be able to handle user-defined pipelines of diverse DNN inference workloads while maintaining isolation and latency guarantees, and minimizing resource waste. The current solution for guaranteeing isolation within FaaS is suboptimal -- suffering from "cold start" latency. A major cause of such inefficiency is the need to move large amount of model data within and across servers. We propose TrIMS as a novel solution to address these issues. Our proposed solution consists of a persistent model store across the GPU, CPU, local storage, and cloud storage hierarchy, an efficient resource management layer that provides isolation, and a succinct set of application APIs and container technologies for easy and transparent integration with FaaS, Deep Learning (DL) frameworks, and user code. We demonstrate our solution by interfacing TrIMS with the Apache MXNet framework and demonstrate up to 24x speedup in latency for image classification models and up to 210x speedup for large models. We achieve up to 8x system throughput improvement.Comment: In Proceedings CLOUD 201

    Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation

    Full text link
    TensorFlow has been the most widely adopted Machine/Deep Learning framework. However, little exists in the literature that provides a thorough understanding of the capabilities which TensorFlow offers for the distributed training of large ML/DL models that need computation and communication at scale. Most commonly used distributed training approaches for TF can be categorized as follows: 1) Google Remote Procedure Call (gRPC), 2) gRPC+X: X=(InfiniBand Verbs, Message Passing Interface, and GPUDirect RDMA), and 3) No-gRPC: Baidu Allreduce with MPI, Horovod with MPI, and Horovod with NVIDIA NCCL. In this paper, we provide an in-depth performance characterization and analysis of these distributed training approaches on various GPU clusters including the Piz Daint system (6 on Top500). We perform experiments to gain novel insights along the following vectors: 1) Application-level scalability of DNN training, 2) Effect of Batch Size on scaling efficiency, 3) Impact of the MPI library used for no-gRPC approaches, and 4) Type and size of DNN architectures. Based on these experiments, we present two key insights: 1) Overall, No-gRPC designs achieve better performance compared to gRPC-based approaches for most configurations, and 2) The performance of No-gRPC is heavily influenced by the gradient aggregation using Allreduce. Finally, we propose a truly CUDA-Aware MPI Allreduce design that exploits CUDA kernels and pointer caching to perform large reductions efficiently. Our proposed designs offer 5-17X better performance than NCCL2 for small and medium messages, and reduces latency by 29% for large messages. The proposed optimizations help Horovod-MPI to achieve approximately 90% scaling efficiency for ResNet-50 training on 64 GPUs. Further, Horovod-MPI achieves 1.8X and 3.2X higher throughput than the native gRPC method for ResNet-50 and MobileNet, respectively, on the Piz Daint cluster.Comment: 10 pages, 9 figures, submitted to IEEE IPDPS 2019 for peer-revie

    CUDA Unified Memory๋ฅผ ์œ„ํ•œ ๋ฐ์ดํ„ฐ ๊ด€๋ฆฌ ๋ฐ ํ”„๋ฆฌํŽ˜์นญ ๊ธฐ๋ฒ•

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2022. 8. ์ด์žฌ์ง„.Unified Memory (UM) is a component of CUDA programming model which provides a memory pool that has a single address space and can be accessed by both the host and the GPU. When UM is used, a CUDA program does not need to explicitly move data between the host and the device. It also allows GPU memory oversubscription by using CPU memory as a backing store. UM significantly lessens the burden of a programmer and provides great programmability. However, using UM solely does not guarantee good performance. To fully exploit UM and improve performance, the programmer needs to add user hints to the source code to prefetch pages that are going to be accessed during the CUDA kernel execution. In this thesis, we propose three frameworks that exploits UM to improve the ease-of-programming while maximizing the application performance. The first framework is HUM, which hides host-to-device memory copy time of traditional CUDA program without any code modification. It overlaps the host-to-device memory copy with host computation or CUDA kernel computation by exploiting Unified Memory and fault mechanisms. The evaluation result shows that executing the applications under HUM is, on average, 1.21 times faster than executing them under original CUDA. The speedup is comparable to the average speedup 1.22 of the hand-optimized implementations for Unified Memory. The second framework is DeepUM which exploits UM to allow GPU memory oversubscription for deep neural networks. While UM allows memory oversubscription using a page fault mechanism, page fault handling introduces enormous overhead. We use a correlation prefetching technique to solve the problem and hide the overhead. The evaluation result shows that DeepUM achieves comparable performance to the other state-of-the-art approaches. At the same time, our framework can run larger batch size that other methods fail to run. The last framework is SnuRHAC that provides an illusion of a single GPU for the multiple GPUs in a cluster. Under SnuRHAC, a CUDA program designed to use a single GPU can utilize multiple GPUs in a cluster without any source code modification. SnuRHAC automatically distributes workload to multiple GPUs in a cluster and manages data across the nodes. To manage data efficiently, SnuRHAC extends Unified Memory and exploits its page fault mechanism. We also propose two prefetching techniques to fully exploit UM and to maximize performance. The evaluation result shows that while SnuRHAC significantly improves ease-of-programming, it shows scalable performance for the cluster environment depending on the application characteristics.Unified Memory (UM)๋Š” CUDA ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋ชจ๋ธ์—์„œ ์ œ๊ณตํ•˜๋Š” ๊ธฐ๋Šฅ ์ค‘ ํ•˜๋‚˜๋กœ ๋‹จ์ผ ๋ฉ”๋ชจ๋ฆฌ ์ฃผ์†Œ ๊ณต๊ฐ„์— CPU์™€ GPU๊ฐ€ ๋™์‹œ์— ์ ‘๊ทผํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ด์ค€๋‹ค. ์ด์— ๋”ฐ๋ผ, UM์„ ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ CUDA ํ”„๋กœ๊ทธ๋žจ์—์„œ ๋ช…์‹œ์ ์œผ๋กœ ํ”„๋กœ์„ธ์„œ๊ฐ„์— ๋ฐ์ดํ„ฐ๋ฅผ ์ด๋™์‹œ์ผœ์ฃผ์ง€ ์•Š์•„๋„ ๋œ๋‹ค. ๋˜ํ•œ, CPU ๋ฉ”๋ชจ๋ฆฌ๋ฅผ backing store๋กœ ์‚ฌ์šฉํ•˜์—ฌ GPU์˜ ๋ฉ”๋ชจ๋ฆฌ ํฌ๊ธฐ ๋ณด๋‹ค ๋” ๋งŽ์€ ์–‘์˜ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ•„์š”๋กœ ํ•˜๋Š” ํ”„๋กœ๊ทธ๋žจ์„ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ด์ค€๋‹ค. ๊ฒฐ๊ณผ์ ์œผ๋กœ, UM์€ ํ”„๋กœ๊ทธ๋ž˜๋จธ์˜ ๋ถ€๋‹ด์„ ํฌ๊ฒŒ ๋œ์–ด์ฃผ๊ณ  ์‰ฝ๊ฒŒ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ํ•  ์ˆ˜ ์žˆ๋„๋ก ๋„์™€์ค€๋‹ค. ํ•˜์ง€๋งŒ, UM์„ ์žˆ๋Š” ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์€ ์„ฑ๋Šฅ ์ธก๋ฉด์—์„œ ์ข‹์ง€ ์•Š๋‹ค. UM์€ page fault mechanism์„ ํ†ตํ•ด ๋™์ž‘ํ•˜๋Š”๋ฐ page fault๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋งŽ์€ ์˜ค๋ฒ„ํ—ค๋“œ๊ฐ€ ๋ฐœ์ƒํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. UM์„ ์‚ฌ์šฉํ•˜๋ฉด์„œ ์ตœ๋Œ€์˜ ์„ฑ๋Šฅ์„ ์–ป๊ธฐ ์œ„ํ•ด์„œ๋Š” ํ”„๋กœ๊ทธ๋ž˜๋จธ๊ฐ€ ์†Œ์Šค ์ฝ”๋“œ์— ์—ฌ๋Ÿฌ ํžŒํŠธ๋‚˜ ์•ž์œผ๋กœ CUDA ์ปค๋„์—์„œ ์‚ฌ์šฉ๋  ๋ฉ”๋ชจ๋ฆฌ ์˜์—ญ์— ๋Œ€ํ•œ ํ”„๋ฆฌํŽ˜์น˜ ๋ช…๋ น์„ ์‚ฝ์ž…ํ•ด์ฃผ์–ด์•ผ ํ•œ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์€ UM์„ ์‚ฌ์šฉํ•˜๋ฉด์„œ๋„ ์‰ฌ์šด ํ”„๋กœ๊ทธ๋ž˜๋ฐ๊ณผ ์ตœ๋Œ€์˜ ์„ฑ๋Šฅ์ด๋ผ๋Š” ๋‘๋งˆ๋ฆฌ ํ† ๋ผ๋ฅผ ๋™์‹œ์— ์žก๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ•๋“ค์„ ์†Œ๊ฐœํ•œ๋‹ค. ์ฒซ์งธ๋กœ, HUM์€ ๊ธฐ์กด CUDA ํ”„๋กœ๊ทธ๋žจ์˜ ์†Œ์Šค ์ฝ”๋“œ๋ฅผ ์ˆ˜์ •ํ•˜์ง€ ์•Š๊ณ  ํ˜ธ์ŠคํŠธ์™€ ๋””๋ฐ”์ด์Šค ๊ฐ„์— ๋ฉ”๋ชจ๋ฆฌ ์ „์†ก ์‹œ๊ฐ„์„ ์ตœ์†Œํ™”ํ•œ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด, UM๊ณผ fault mechanism์„ ์‚ฌ์šฉํ•˜์—ฌ ํ˜ธ์ŠคํŠธ-๋””๋ฐ”์ด์Šค ๊ฐ„ ๋ฉ”๋ชจ๋ฆฌ ์ „์†ก์„ ํ˜ธ์ŠคํŠธ ๊ณ„์‚ฐ ํ˜น์€ CUDA ์ปค๋„ ์‹คํ–‰๊ณผ ์ค‘์ฒฉ์‹œํ‚จ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ๋ฅผ ํ†ตํ•ด HUM์„ ํ†ตํ•ด ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์„ ์‹คํ–‰ํ•˜๋Š” ๊ฒƒ์ด ๊ทธ๋ ‡์ง€ ์•Š๊ณ  CUDA๋งŒ์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์— ๋น„ํ•ด ํ‰๊ท  1.21๋ฐฐ ๋น ๋ฅธ ๊ฒƒ์„ ํ™•์ธํ•˜์˜€๋‹ค. ๋˜ํ•œ, Unified Memory๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ”„๋กœ๊ทธ๋ž˜๋จธ๊ฐ€ ์†Œ์Šค ์ฝ”๋“œ๋ฅผ ์ตœ์ ํ™”ํ•œ ๊ฒƒ๊ณผ ์œ ์‚ฌํ•œ ์„ฑ๋Šฅ์„ ๋‚ด๋Š” ๊ฒƒ์„ ํ™•์ธํ•˜์˜€๋‹ค. ๋‘๋ฒˆ์งธ๋กœ, DeepUM์€ UM์„ ํ™œ์šฉํ•˜์—ฌ GPU์˜ ๋ฉ”๋ชจ๋ฆฌ ํฌ๊ธฐ ๋ณด๋‹ค ๋” ๋งŽ์€ ์–‘์˜ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ•„์š”๋กœ ํ•˜๋Š” ๋”ฅ ๋Ÿฌ๋‹ ๋ชจ๋ธ์„ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•œ๋‹ค. UM์„ ํ†ตํ•ด GPU ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์ดˆ๊ณผํ•ด์„œ ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ CPU์™€ GPU๊ฐ„์— ํŽ˜์ด์ง€๊ฐ€ ๋งค์šฐ ๋นˆ๋ฒˆํ•˜๊ฒŒ ์ด๋™ํ•˜๋Š”๋ฐ, ์ด๋•Œ ๋งŽ์€ ์˜ค๋ฒ„ํ—ค๋“œ๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค. ๋‘๋ฒˆ์งธ ๋ฐฉ๋ฒ•์—์„œ๋Š” correlation ํ”„๋ฆฌํŽ˜์นญ ๊ธฐ๋ฒ•์„ ํ†ตํ•ด ์ด ์˜ค๋ฒ„ํ—ค๋“œ๋ฅผ ์ตœ์†Œํ™”ํ•œ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ๋ฅผ ํ†ตํ•ด DeepUM์€ ๊ธฐ์กด์— ์—ฐ๊ตฌ๋œ ๊ฒฐ๊ณผ๋“ค๊ณผ ๋น„์Šทํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ด๋ฉด์„œ ๋” ํฐ ๋ฐฐ์น˜ ์‚ฌ์ด์ฆˆ ํ˜น์€ ๋” ํฐ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋ชจ๋ธ์„ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ์Œ์„ ํ™•์ธํ•˜์˜€๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ, SnuRHAC์€ ํด๋Ÿฌ์Šคํ„ฐ์— ์žฅ์ฐฉ๋œ ์—ฌ๋Ÿฌ GPU๋ฅผ ๋งˆ์น˜ ํ•˜๋‚˜์˜ ํ†ตํ•ฉ๋œ GPU์ฒ˜๋Ÿผ ๋ณด์—ฌ์ค€๋‹ค. ๋”ฐ๋ผ์„œ, ํ”„๋กœ๊ทธ๋ž˜๋จธ๋Š” ์—ฌ๋Ÿฌ GPU๋ฅผ ๋Œ€์ƒ์œผ๋กœ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ํ•˜์ง€ ์•Š๊ณ  ํ•˜๋‚˜์˜ ๊ฐ€์ƒ GPU๋ฅผ ๋Œ€์ƒ์œผ๋กœ ํ”„๋กœ๊ทธ๋ž˜๋ฐํ•˜๋ฉด ํด๋Ÿฌ์Šคํ„ฐ์— ์žฅ์ฐฉ๋œ ๋ชจ๋“  GPU๋ฅผ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด๋Š” SnuRHAC์ด Unified Memory๋ฅผ ํด๋Ÿฌ์Šคํ„ฐ ํ™˜๊ฒฝ์—์„œ ๋™์ž‘ํ•˜๋„๋ก ํ™•์žฅํ•˜๊ณ , ํ•„์š”ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ์ž๋™์œผ๋กœ GPU๊ฐ„์— ์ „์†กํ•˜๊ณ  ๊ด€๋ฆฌํ•ด์ฃผ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ๋˜ํ•œ, UM์„ ์‚ฌ์šฉํ•˜๋ฉด์„œ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋Š” ์˜ค๋ฒ„ํ—ค๋“œ๋ฅผ ์ตœ์†Œํ™”ํ•˜๊ธฐ ์œ„ํ•ด ๋‹ค์–‘ํ•œ ํ”„๋ฆฌํŽ˜์นญ ๊ธฐ๋ฒ•์„ ์†Œ๊ฐœํ•œ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ๋ฅผ ํ†ตํ•ด SnuRHAC์ด ์‰ฝ๊ฒŒ GPU ํด๋Ÿฌ์Šคํ„ฐ๋ฅผ ์œ„ํ•œ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์„ ํ•  ์ˆ˜ ์žˆ๋„๋ก ๋„์™€์ค„ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ, ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜ ํŠน์„ฑ์— ๋”ฐ๋ผ ์ตœ์ ์˜ ์„ฑ๋Šฅ์„ ๋‚ผ ์ˆ˜ ์žˆ์Œ์„ ๋ณด์ธ๋‹ค.1 Introduction 1 2 Related Work 7 3 CUDA Unified Memory 12 4 Framework for Maximizing the Performance of Traditional CUDA Program 17 4.1 Overall Structure of HUM 17 4.2 Overlapping H2Dmemcpy and Computation 19 4.3 Data Consistency and Correctness 23 4.4 HUM Driver 25 4.5 HUM H2Dmemcpy Mechanism 26 4.6 Parallelizing Memory Copy Commands 29 4.7 Scheduling Memory Copy Commands 31 5 Framework for Running Large-scale DNNs on a Single GPU 33 5.1 Structure of DeepUM 33 5.1.1 DeepUM Runtime 34 5.1.2 DeepUM Driver 35 5.2 Correlation Prefetching for GPU Pages 36 5.2.1 Pair-based Correlation Prefetching 37 5.2.2 Correlation Prefetching in DeepUM 38 5.3 Optimizations for GPU Page Fault Handling 42 5.3.1 Page Pre-eviction 42 5.3.2 Invalidating UM Blocks of Inactive PyTorch Blocks 43 6 Framework for Virtualizing a Single Device Image for a GPU Cluster 45 6.1 Overall Structure of SnuRHAC 45 6.2 Workload Distribution 48 6.3 Cluster Unified Memory 50 6.4 Additional Optimizations 57 6.5 Prefetching 58 6.5.1 Static Prefetching 58 6.5.2 Dynamic Prefetching 61 7 Evaluation 62 7.1 Framework for Maximizing the Performance of Traditional CUDA Program 62 7.1.1 Methodology 63 7.1.2 Results 64 7.2 Framework for Running Large-scale DNNs on a Single GPU 70 7.2.1 Methodology 70 7.2.2 Comparison with Naive UM and IBM LMS 72 7.2.3 Parameters of the UM Block Correlation Table 78 7.2.4 Comparison with TensorFlow-based Approaches 79 7.3 Framework for Virtualizing Single Device Image for a GPU Cluster 81 7.3.1 Methodology 81 7.3.2 Results 84 8 Discussions and Future Work 91 9 Conclusion 93 ์ดˆ๋ก 111๋ฐ•

    DiviML: A Module-based Heuristic for Mapping Neural Networks onto Heterogeneous Platforms

    Full text link
    Datacenters are increasingly becoming heterogeneous, and are starting to include specialized hardware for networking, video processing, and especially deep learning. To leverage the heterogeneous compute capability of modern datacenters, we develop an approach for compiler-level partitioning of deep neural networks (DNNs) onto multiple interconnected hardware devices. We present a general framework for heterogeneous DNN compilation, offering automatic partitioning and device mapping. Our scheduler integrates both an exact solver, through a mixed integer linear programming (MILP) formulation, and a modularity-based heuristic for scalability. Furthermore, we propose a theoretical lower bound formula for the optimal solution, which enables the assessment of the heuristic solutions' quality. We evaluate our scheduler in optimizing both conventional DNNs and randomly-wired neural networks, subject to latency and throughput constraints, on a heterogeneous system comprised of a CPU and two distinct GPUs. Compared to na\"ively running DNNs on the fastest GPU, he proposed framework can achieve more than 3ร—\times times lower latency and up to 2.9ร—\times higher throughput by automatically leveraging both data and model parallelism to deploy DNNs on our sample heterogeneous server node. Moreover, our modularity-based "splitting" heuristic improves the solution runtime up to 395ร—\times without noticeably sacrificing solution quality compared to an exact MILP solution, and outperforms all other heuristics by 30-60% solution quality. Finally, our case study shows how we can extend our framework to schedule large language models across multiple heterogeneous servers by exploiting symmetry in the hardware setup. Our code can be easily plugged in to existing frameworks, and is available at https://github.com/abdelfattah-lab/diviml.Comment: accepted at ICCAD'2
    • โ€ฆ
    corecore