63 research outputs found

    Intelligent Scheduling and Memory Management Techniques for Modern GPU Architectures

    Get PDF
    abstract: With the massive multithreading execution feature, graphics processing units (GPUs) have been widely deployed to accelerate general-purpose parallel workloads (GPGPUs). However, using GPUs to accelerate computation does not always gain good performance improvement. This is mainly due to three inefficiencies in modern GPU and system architectures. First, not all parallel threads have a uniform amount of workload to fully utilize GPU’s computation ability, leading to a sub-optimal performance problem, called warp criticality. To mitigate the degree of warp criticality, I propose a Criticality-Aware Warp Acceleration mechanism, called CAWA. CAWA predicts and accelerates the critical warp execution by allocating larger execution time slices and additional cache resources to the critical warp. The evaluation result shows that with CAWA, GPUs can achieve an average of 1.23x speedup. Second, the shared cache storage in GPUs is often insufficient to accommodate demands of the large number of concurrent threads. As a result, cache thrashing is commonly experienced in GPU’s cache memories, particularly in the L1 data caches. To alleviate the cache contention and thrashing problem, I develop an instruction aware Control Loop Based Adaptive Bypassing algorithm, called Ctrl-C. Ctrl-C learns the cache reuse behavior and bypasses a portion of memory requests with the help of feedback control loops. The evaluation result shows that Ctrl-C can effectively improve cache utilization in GPUs and achieve an average of 1.42x speedup for cache sensitive GPGPU workloads. Finally, GPU workloads and the co-located processes running on the host chip multiprocessor (CMP) in a heterogeneous system setup can contend for memory resources in multiple levels, resulting in significant performance degradation. To maximize the system throughput and balance the performance degradation of all co-located applications, I design a scalable performance degradation predictor specifically for heterogeneous systems, called HeteroPDP. HeteroPDP predicts the application execution time and schedules OpenCL workloads to run on different devices based on the optimization goal. The evaluation result shows HeteroPDP can improve the system fairness from 24% to 65% when an OpenCL application is co-located with other processes, and gain an additional 50% speedup compared with always offloading the OpenCL workload to GPUs. In summary, this dissertation aims to provide insights for the future microarchitecture and system architecture designs by identifying, analyzing, and addressing three critical performance problems in modern GPUs.Dissertation/ThesisDoctoral Dissertation Computer Engineering 201

    ¿Son las GPUs dispositivos eficientes energéticamente?

    Get PDF
    With energy consumption emerging as one of the biggest issues in the development of HPC (High Performance Computing) applications, the importance of detailed power-related research works becomes a priority. In the last years, GPU coprocessors have been increasingly used to accelerate many of these high-priced systems even though they are embedding millions of transistors on their chips delivering an immediate increase on power consumption necessities. This paper analyzes a set of applications from the Rodinia benchmark suite in terms of CPU and GPU performance and energy consumption. Specifically, it compares single-threaded and multi-threaded CPU versions with GPU implementations, and characterize the execution time, true instant power and average energy consumption to test the idea that GPUs are power-hungry computing devices.Con el consumo de energía emergiendo como uno de los mayores problemas en el desarrollo de aplicaciones HPC (High Performance Computing), la importancia de trabajos específicos de investigación en este campo se convierte en una prioridad. En los últimos años, los coprocesadores GPU se han utilizado frecuentemente para acelerar muchos de estos costosos sistemas, a pesar de que incorporan millones de transistores en sus chips, lo que genera un aumento considerable en los requerimientos de energía. Este artículo analiza un conjunto de aplicaciones del benchmark Rodinia en términos de rendimiento y consumo de energía de CPU y GPU. Específicamente, se comparan las versiones secuenciales y multihilo en CPU con implementaciones GPU, caracterizando el tiempo de ejecución, la potencia real instantánea y el consumo promedio de energía, con el objetivo de probar la idea de que las GPU son dispositivos de baja eficiencia energética.Facultad de Informátic

    ¿Son las GPUs dispositivos eficientes energéticamente?

    Get PDF
    With energy consumption emerging as one of the biggest issues in the development of HPC (High Performance Computing) applications, the importance of detailed power-related research works becomes a priority. In the last years, GPU coprocessors have been increasingly used to accelerate many of these high-priced systems even though they are embedding millions of transistors on their chips delivering an immediate increase on power consumption necessities. This paper analyzes a set of applications from the Rodinia benchmark suite in terms of CPU and GPU performance and energy consumption. Specifically, it compares single-threaded and multi-threaded CPU versions with GPU implementations, and characterize the execution time, true instant power and average energy consumption to test the idea that GPUs are power-hungry computing devices.Con el consumo de energía emergiendo como uno de los mayores problemas en el desarrollo de aplicaciones HPC (High Performance Computing), la importancia de trabajos específicos de investigación en este campo se convierte en una prioridad. En los últimos años, los coprocesadores GPU se han utilizado frecuentemente para acelerar muchos de estos costosos sistemas, a pesar de que incorporan millones de transistores en sus chips, lo que genera un aumento considerable en los requerimientos de energía. Este artículo analiza un conjunto de aplicaciones del benchmark Rodinia en términos de rendimiento y consumo de energía de CPU y GPU. Específicamente, se comparan las versiones secuenciales y multihilo en CPU con implementaciones GPU, caracterizando el tiempo de ejecución, la potencia real instantánea y el consumo promedio de energía, con el objetivo de probar la idea de que las GPU son dispositivos de baja eficiencia energética.Facultad de Informátic

    Performance and Energy Trade-Offs for Parallel Applications on Heterogeneous Multi-Processing Systems

    Get PDF
    This work proposes a methodology to find performance and energy trade-offs for parallel applications running on Heterogeneous Multi-Processing systems with a single instruction-set architecture. These offer flexibility in the form of different core types and voltage and frequency pairings, defining a vast design space to explore. Therefore, for a given application, choosing a configuration that optimizes the performance and energy consumption is not straightforward. Our method proposes novel analytical models for performance and power consumption whose parameters can be fitted using only a few strategically sampled offline measurements. These models are then used to estimate an application’s performance and energy consumption for the whole configuration space. In turn, these offline predictions define the choice of estimated Pareto-optimal configurations of the model, which are used to inform the selection of the configuration that the application should be executed on. The methodology was validated on an ODROID-XU3 board for eight programs from the PARSEC Benchmark, Phoronix Test Suite and Rodinia applications. The generated Pareto-optimal configuration space represented a 99% reduction of the universe of all available configurations. Energy savings of up to 59.77%, 61.38% and 17.7% were observed when compared to the performance, ondemand and powersave Linux governors, respectively, with higher or similar performance

    Software-only triple diverse redundancy on GPUs for autonomous driving platforms

    Get PDF
    Autonomous driving (AD) imposes the need for safe computations in high-performance computing (HPC) components such as GPUs, thus with capabilities to detect and recover from errors since a safe state may not exist anymore. This can be achieved with Triple Modular Redundancy (TMR) for computation components. Furthermore, error detection capabilities need to provide some form of diversity to avoid the case where a single fault leads all redundant executions lead to the same error, which would go undetected. In our past work, we assessed GPUs against dual modular redundancy (DMR) with diversity, showing their potential and limitations to provide diverse redundancy building on reset and restart for recovery. However, such recovery scheme may be too slow for some applications. This paper proposes a software-only solution to deliver diverse TMR on commercial off-the-shelf (COTS) GPUs. Our work details how staggered execution can be achieved and assesses the performance of TMR on COTS GPUs. Moreover, we identify those elements where diversity cannot be guaranteed and provide some discussion comparing the case of DMR and TMR for those elements.This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 871467 (SELENE). Leonidas Kosmidis has been partially supported by the Spanish Ministry of Economy and Competitiveness (MINECO) under a Juan de la Cierva Formacion postdoctoral fellowship with number FJCI-2017-34095.Peer ReviewedPostprint (author's final draft

    High-Integrity GPU Designs for Critical Real-Time Automotive Systems

    Get PDF
    Autonomous Driving (AD) imposes the use of highperformance hardware, such as GPUs, to perform object recognition and tracking in real-time. However, differently to the consumer electronics market, critical real-time AD functionalities require a high degree of resilience against faults, in line with the automotive ISO26262 functional safety standard requirements. ISO26262 imposes the use of some source of independent redundancy for the most critical functionalities so that a single fault cannot lead to a failure, being dual core lockstep (DCLS) with diversity the preferred choice for computing devices. Unfortunately, GPUs do not support diverse DCLS by construction, thus failing to meet ISO26262 requirements efficiently. In this paper we propose lightweight modifications to GPUs to enable diverse DCLS for critical real-time applications without diminishing their performance for non-critical applications. In particular, we show how enabling specific mechanisms for software-controlled kernel scheduling in the GPU, allows guaranteeing that redundant kernels can be executed in different resources so that a single fault cannot lead to a failure, as imposed by ISO26262. Our results on a GPU simulator and an NVIDIA GPU prove the viability of the approach and its effectiveness on high-performance GPU designs needed for AD systems.This work has been partially supported by the Spanish Ministry of Economy and Competitiveness (MINECO) under grant TIN2015-65316-P and the HiPEAC Network of Excellence. Jaume Abella has been partially supported by the MINECO under Ramon y Cajal postdoctoral fellowship number RYC-2013-14717. Carles Hernandez is jointly funded by the MINECO and FEDER funds through grant TIN2014-60404-JIN.SĂ­Postprint (author's final draft
    • …
    corecore