4 research outputs found

    Algorithms for Preemptive Co-scheduling of Kernels on GPUs

    Get PDF
    International audienceModern GPUs allow concurrent kernel execution and preemption to improve hardware utilization and responsiveness. Currently, the decision on the simultaneous execution of kernels is performed by the hardware, which can lead to unreasonable use of resources. In this work, we tackle the problem of co-scheduling for GPUs in high competition scenarios. We propose a novel graphbased preemptive co-scheduling algorithm, with the focus on reducing the number of preemptions. We show that the optimal preemptive makespan can be computed by solving a Linear Program in polynomial time. Based on this solution we propose graph theoretical model and an algorithm to build preemptive schedules which minimizes the number of preemptions. We show, however, that finding the minimal amount of preemptions among all preemptive solutions of optimal makespan is a NP-hard problem. We performed experiments on real-world GPU applications and our approach can achieve optimal makespan by preempting 6 to 9% of the tasks

    Independent tasks on 2 resources with co-scheduling effects

    Get PDF
    Concurrent kernel execution is a relatively new feature in modern GPUs, which was designed to improve hardware utilization and the overall system throughput. However, the decision on the simultaneous execution of tasks is performed by the hardware with a leftover policy, that assigns as many resources as possible for one task and then assigns the remaining resources to the next task. This can lead to unreasonable use of resources. In this work, we tackle the problem of co-scheduling for GPUs with and without preemption, with the focus on determining the kernels submission order to reduce the number of preemptions and the kernels makespan, respectively. We propose a graph-based theoretical model to build preemptive and non-preemptive schedules. We show that the optimal preemptive makespan can be computed by solving a Linear Program in polynomial time, and we propose an algorithm based on this solution which minimizes the number of preemptions. We also propose an algorithm that transforms a preemptive solution of optimal makespan into a non-preemptive solution with the smallest possible preemption overhead. We show, however, that finding the minimal amount of preemptions among all preemptive solutions of optimal makespan is a NP-hard problem, and computing the optimal non-preemptive schedule is also NP-hard. In addition, we study the non-preemptive problem, without searching first for a good preemptive solution, and present a Mixed Integer Linear Program solution to this problem. We performed experiments on real-world GPU applications and our approach can achieve optimal makespan by preempting 6 to 9% of the tasks. Our non-preemptive approach, on the other side, obtains makespan within 2.5% of the optimal preemptive schedules, while previous approaches exceed the preemptive makespan by 5 to 12%

    GPU Resource Optimization and Scheduling for Shared Execution Environments

    Get PDF
    General purpose graphics processing units have become a computing workhorse for a variety of data- and compute-intensive applications, from large supercomputing systems for massive data analytics to small, mobile embedded devices for autonomous vehicles. Making effective and efficient use of these processors traditionally relies on extensive programmer expertise to design and develop kernel methods which simultaneously trade off task decomposition and resource exploitation. Often, new architecture designs force code refinements in order to continue to achieve optimal performance. At the same time, not all applications require full utilization of the system to achieve that optimal performance. In this case, the increased capability of new architectures introduces an ever-widening gap between the level of resources necessary for optimal performance and the level necessary to maintain system efficiency. The ability to schedule and execute multiple independent tasks on a GPU, known generally as concurrent kernel execution, enables application programmers and system developers to balance application performance and system efficiency. Various approaches to develop both coarse- and fine-grained scheduling mechanisms to achieve a high degree of resource utilization and improved application performance have been studied. Most of these works focus on mechanisms for the management of compute resources, while a small percentage consider the data transfer channels. In this dissertation, we propose a pragmatic approach to scheduling and managing both types of resources – data transfer and compute – that is transparent to an application programmer and capable of providing near-optimal system performance. Furthermore, the approaches described herein rely on reinforcement learning methods, which enable the scheduling solutions to be flexible to a variety of factors, such as transient application behaviors, changing system designs, and tunable objective functions. Finally, we describe a framework for the practical implementation of learned scheduling policies to achieve high resource utilization and efficient system performance
    corecore