A Model-Similarity-Based Scheduling Policy for Deep Learning Training Workload in a GPU Cluster

Abstract

Deep Learning (DL) has witnessed a surge in popularity in recent years, evident from its extensive utilisation and diverse applications spanning various domains. DL has played a pivotal role in tackling challenges such as image recognition, video segmentation, and natural language processing. As its name suggests, DL involves several layers to process matrix computations that are carried out layer by layer. Furthermore, the training process for a DL model demands a significant volume of data to develop proficiency in a specific task. Consequently, training DL models entail a considerable consumption of time and resources. Addressing the issue of resource-intensive demand in DL depends on factors such as the DL architecture and the size of the training dataset, which can be challenging to tackle. However, one effective strategy to tackle the time-consuming nature of DL training is the utilisation of Graphics Processing Units (GPUs). A GPU is preferred for its parallel processing capabilities, which are necessary for training DL models efficiently, especially with large datasets. Due to the ability of parallelisation, distributed training across multiple GPUs has emerged as a practical solution for completing the training process within a reasonable time. This approach is typically implemented in GPU clusters equipped with multiple GPUs. In a GPU cluster, there can be various GPU architectures available as options. Nevertheless, each GPU architecture exhibits different training performances depending on DL models. To maximise the utilisation of a GPU cluster, the scheduler plays an important role in managing resources by appropriately allocating resources to jobs. When handling DL training tasks, an effective scheduling policy ought to consider the varying training performance of each GPU architecture for different DL models. Furthermore, factors such as the number of GPUs for distributed training and batch size significantly impact training performance. Addressing the variability in training performance depending on DL models and accounting for the influential factors are critical for optimising resource usage. In this thesis, we propose a model-similarity-based scheduling policy designed specifically for managing DL training tasks in a heterogeneous GPU cluster. To take into account the variability in training performance depending on DL models, similarity measurement is utilised to compare the DL characteristics of a given job to those in the database. The training behaviour of the closest reference model is then provided to the scheduler to inform proper scheduling decisions based on cluster availability. The findings illustrate that employing the model-similarity-based scheduling policy and allowing the adjustment of batch size according to the scheduling objective can significantly decrease the makespan. Furthermore, our scheduling policy surpasses the performance of the state-of-the-art scheduling policy. To enhance the model-similarity-based scheduling policy, we incorporate cutting-edge scheduling approaches such as the round-based mechanism and job packing. The round-based mechanism enables the scheduler to periodically adjust the scheduling decisions, optimising resource allocation over time. On the other hand, job packing enhances GPU utilisation by accommodating an additional job on a GPU that trains smaller models. The results demonstrate that implementing the round-based mechanism effectively reduces the makespan compared to scenarios without it. Furthermore, integrating job packing further decreases the makespan and reduces queuing delay

    Similar works