38 research outputs found

    HeteroCore GPU to exploit TLP-resource diversity

    Get PDF

    THROUGHPUT OPTIMIZATION AND RESOURCE ALLOCATION ON GPUS UNDER MULTI-APPLICATION EXECUTION

    Get PDF
    Platform heterogeneity prevails as a solution to the throughput and computational chal- lenges imposed by parallel applications and technology scaling. Specifically, Graphics Processing Units (GPUs) are based on the Single Instruction Multiple Thread (SIMT) paradigm and they can offer tremendous speed-up for parallel applications. However, GPUs were designed to execute a single application at a time. In case of simultaneous multi-application execution, due to the GPUs’ massive multi-threading paradigm, ap- plications compete against each other using destructively the shared resources (caches and memory controllers) resulting in significant throughput degradation. In this thesis, a methodology for minimizing interference in shared resources and provide efficient con- current execution of multiple applications on GPUs is presented. Particularly, the pro- posed methodology (i) performs application classification; (ii) analyzes the per-class in- terference; (iii) finds the best matching between classes; and (iv) employs an efficient re- source allocation. Experimental results showed that the proposed approach increases the throughput of the system for two concurrent applications by an average of 36% compared to other optimization techniques, while for three concurrent applications the proposed approach achieved an average gain of 23%

    Towards multiprogrammed GPUs

    Get PDF
    Programmable Graphics Processing Units (GPUs) have recently become the most pervasitheve massively parallel processors. They have come a long way, from fixed function ASICs designed to accelerate graphics tasks to a programmable architecture that can also execute general-purpose computations. Because of their performance and efficiency, an increasing amount of software is relying on them to accelerate data parallel and computationally intensive sections of code. They have earned a place in many systems, from low power mobile devices to the biggest data centers in the world. However, GPUs are still plagued by the fact that they essentially have no multiprogramming support, resulting in low system performance if the GPU is shared among multiple programs. In this dissertation we set to provide the rich GPU multiprogramming support by improving the multitasking capabilities and increasing the virtual memory functionality and performance. The main issue hindering the multitasking support in GPUs is the nonpreemptive execution of GPU kernels. Here we propose two preemption mechanisms with dierent design philosophies, that can be used by a scheduler to preempt execution on GPU cores and make room for some other process. We also argue for the spatial sharing of the GPU and propose a concrete hardware scheduler implementation that dynamically partitions the GPU cores among running kernels, according to their set priorities. Opposing the assumptions made in the related work, we demonstrate that preemptive execution is feasible and the desired approach to GPU multitasking. We further show improved system fairness and responsiveness with our scheduling policy. We also pinpoint that at the core of the insufficient virtual memory support lies the exceptions handling mechanism used by modern GPUs. Currently, GPUs offload the actual exception handling work to the CPU, while the faulting instruction is stalled in the GPU core. This stall-on-fault model prevents some of the virtual memory features and optimizations and is especially harmful in multiprogrammed environments because it prevents context switching the GPU unless all the in-flight faults are resolved. In this disseritation, we propose three GPU core organizations with varying performance-complexity trade-off that get rid of the stall-on-fault execution and enable preemptible exceptions on the GPU (i.e., the faulting instruction can be squashed and restarted later). Building on this support, we implement two use cases and demonstrate their utility. One is a scheme that performs context switch of the faulted threads and tries to find some other useful work to do in the meantime, hiding the latency of the fault and improving the system performance. The other enables the fault handling code to run locally, on the GPU, instead of relying on the CPU offloading and show that the local fault handling can also improve performance.Las Unidades de Procesamiento de Gráficos Programables (GPU, por sus siglas en inglés) se han convertido recientemente en los procesadores masivamente paralelos más difundidos. Han recorrido un largo camino desde ASICs de función fija diseñados para acelerar tareas gráficas, hasta una arquitectura programable que también puede ejecutar cálculos de propósito general. Debido a su rendimiento y eficiencia, una cantidad creciente de software se basa en ellas para acelerar las secciones de código computacionalmente intensivas que disponen de paralelismo de datos. Se han ganado un lugar en muchos sistemas, desde dispositivos móviles de baja potencia hasta los centros de datos más grandes del mundo. Sin embargo, las GPUs siguen plagadas por el hecho de que esencialmente no tienen soporte de multiprogramación, lo que resulta en un bajo rendimiento del sistema si la GPU se comparte entre múltiples programas. En esta disertación nos centramos en proporcionar soporte de multiprogramación para GPUs mediante la mejora de las capacidades de multitarea y del soporte de memoria virtual. El principal problema que dificulta el soporte multitarea en las GPUs es la ejecución no apropiativa de los núcleos de la GPU. Proponemos dos mecanismos de apropiación con diferentes filosofías de diseño, que pueden ser utilizados por un planificador para apropiarse de los núcleos de la GPU y asignarlos a otros procesos. También abogamos por la división espacial de la GPU y proponemos una implementación concreta de un planificador hardware que divide dinámicamente los núcleos de la GPU entre los kernels en ejecución, de acuerdo con sus prioridades establecidas. Oponiéndose a las suposiciones hechas por otros en trabajos relacionados, demostramos que la ejecución apropiativa es factible y el enfoque deseado para la multitarea en GPUs. Además, mostramos una mayor equidad y capacidad de respuesta del sistema con nuestra política de asignación de núcleos de la GPU. También señalamos que la causa principal del insuficiente soporte de la memoria virtual en las GPUs es el mecanismo de manejo de excepciones utilizado por las GPUs modernas. En la actualidad, las GPUs descargan el manejo de las excepciones a la CPU, mientras que la instrucción que causo la fallada se encuentra esperando en el núcleo de la GPU. Este modelo de bloqueo en fallada impide algunas de las funciones y optimizaciones de la memoria virtual y es especialmente perjudicial en entornos multiprogramados porque evita el cambio de contexto de la GPU a menos que se resuelvan todas las fallas pendientes. En esta disertación, proponemos tres implementaciones del pipeline de los núcleos de la GPU que ofrecen distintos balances de rendimiento-complejidad y permiten la apropiación del núcleo aunque haya excepciones pendientes (es decir, la instrucción que produjo la fallada puede ser reiniciada más tarde). Basándonos en esta nueva funcionalidad, implementamos dos casos de uso para demostrar su utilidad. El primero es un planificador que asigna el núcleo a otros subprocesos cuando hay una fallada para tratar de hacer trabajo útil mientras esta se resuelve, ocultando así la latencia de la fallada y mejorando el rendimiento del sistema. El segundo permite que el código de manejo de las falladas se ejecute localmente en la GPU, en lugar de descargar el manejo a la CPU, mostrando que el manejo local de falladas también puede mejorar el rendimiento.Postprint (published version

    D-STACK: High Throughput DNN Inference by Effective Multiplexing and Spatio-Temporal Scheduling of GPUs

    Full text link
    Hardware accelerators such as GPUs are required for real-time, low-latency inference with Deep Neural Networks (DNN). However, due to the inherent limits to the parallelism they can exploit, DNNs often under-utilize the capacity of today's high-end accelerators. Although spatial multiplexing of the GPU, leads to higher GPU utilization and higher inference throughput, there remain a number of challenges. Finding the GPU percentage for right-sizing the GPU for each DNN through profiling, determining an optimal batching of requests to balance throughput improvement while meeting application-specific deadlines and service level objectives (SLOs), and maximizing throughput by appropriately scheduling DNNs are still significant challenges. This paper introduces a dynamic and fair spatio-temporal scheduler (D-STACK) that enables multiple DNNs to run in the GPU concurrently. To help allocate the appropriate GPU percentage (we call it the "Knee"), we develop and validate a model that estimates the parallelism each DNN can utilize. We also develop a lightweight optimization formulation to find an efficient batch size for each DNN operating with D-STACK. We bring together our optimizations and our spatio-temporal scheduler to provide a holistic inference framework. We demonstrate its ability to provide high throughput while meeting application SLOs. We compare D-STACK with an ideal scheduler that can allocate the right GPU percentage for every DNN kernel. D-STACK gets higher than 90 percent throughput and GPU utilization compared to the ideal scheduler. We also compare D-STACK with other GPU multiplexing and scheduling methods (e.g., NVIDIA Triton, Clipper, Nexus), using popular DNN models. Our controlled experiments with multiplexing several popular DNN models achieve up to 1.6X improvement in GPU utilization and up to 4X improvement in inference throughput
    corecore