3 research outputs found

    A Model-Similarity-Based Scheduling Policy for Deep Learning Training Workload in a GPU Cluster

    No full text
    Deep Learning (DL) has witnessed a surge in popularity in recent years, evident from its extensive utilisation and diverse applications spanning various domains. DL has played a pivotal role in tackling challenges such as image recognition, video segmentation, and natural language processing. As its name suggests, DL involves several layers to process matrix computations that are carried out layer by layer. Furthermore, the training process for a DL model demands a significant volume of data to develop proficiency in a specific task. Consequently, training DL models entail a considerable consumption of time and resources. Addressing the issue of resource-intensive demand in DL depends on factors such as the DL architecture and the size of the training dataset, which can be challenging to tackle. However, one effective strategy to tackle the time-consuming nature of DL training is the utilisation of Graphics Processing Units (GPUs). A GPU is preferred for its parallel processing capabilities, which are necessary for training DL models efficiently, especially with large datasets. Due to the ability of parallelisation, distributed training across multiple GPUs has emerged as a practical solution for completing the training process within a reasonable time. This approach is typically implemented in GPU clusters equipped with multiple GPUs. In a GPU cluster, there can be various GPU architectures available as options. Nevertheless, each GPU architecture exhibits different training performances depending on DL models. To maximise the utilisation of a GPU cluster, the scheduler plays an important role in managing resources by appropriately allocating resources to jobs. When handling DL training tasks, an effective scheduling policy ought to consider the varying training performance of each GPU architecture for different DL models. Furthermore, factors such as the number of GPUs for distributed training and batch size significantly impact training performance. Addressing the variability in training performance depending on DL models and accounting for the influential factors are critical for optimising resource usage. In this thesis, we propose a model-similarity-based scheduling policy designed specifically for managing DL training tasks in a heterogeneous GPU cluster. To take into account the variability in training performance depending on DL models, similarity measurement is utilised to compare the DL characteristics of a given job to those in the database. The training behaviour of the closest reference model is then provided to the scheduler to inform proper scheduling decisions based on cluster availability. The findings illustrate that employing the model-similarity-based scheduling policy and allowing the adjustment of batch size according to the scheduling objective can significantly decrease the makespan. Furthermore, our scheduling policy surpasses the performance of the state-of-the-art scheduling policy. To enhance the model-similarity-based scheduling policy, we incorporate cutting-edge scheduling approaches such as the round-based mechanism and job packing. The round-based mechanism enables the scheduler to periodically adjust the scheduling decisions, optimising resource allocation over time. On the other hand, job packing enhances GPU utilisation by accommodating an additional job on a GPU that trains smaller models. The results demonstrate that implementing the round-based mechanism effectively reduces the makespan compared to scenarios without it. Furthermore, integrating job packing further decreases the makespan and reduces queuing delay

    Scheduling Deep Learning Training in GPU Cluster Using the Model-Similarity-Based Policy

    No full text
    peer reviewedTraining large neural networks with huge amount of data using multiple Graphic Processing Units (GPUs) became widespread with the emergence of Deep Learning (DL) technology. It is usually operated in datacenters featuring multiple GPU clusters, which are shared amongst users. However, different GPU architectures co-exist on the market and differ in training performance. To maximise the utilisation of a GPU cluster, the scheduler plays an important role in managing the resources by dispatching the jobs to the GPUs. An efficient scheduling strategy should take into account that the training performance of each GPU architecture varies for the different DL models. In this work, an original model-similarity-based scheduling policy is introduced that takes into account the GPU architectures that match with the DL models. The results show that using the model-similarity-based scheduling policy for distributed training across multiple GPUs of a DL model with a large batch size can reduce the makespan
    corecore