16 research outputs found

    Predictive runtime code scheduling for heterogeneous architectures

    Get PDF
    Heterogeneous architectures are currently widespread. With the advent of easy-to-program general purpose GPUs, virtually every re- cent desktop computer is a heterogeneous system. Combining the CPU and the GPU brings great amounts of processing power. However, such architectures are often used in a restricted way for domain-speci c appli- cations like scienti c applications and games, and they tend to be used by a single application at a time. We envision future heterogeneous com- puting systems where all their heterogeneous resources are continuously utilized by di erent applications with versioned critical parts to be able to better adapt their behavior and improve execution time, power con- sumption, response time and other constraints at runtime. Under such a model, adaptive scheduling becomes a critical component. In this paper, we propose a novel predictive user-level scheduler based on past performance history for heterogeneous systems. We developed sev- eral scheduling policies and present the study of their impact on system performance. We demonstrate that such scheduler allows multiple appli- cations to fully utilize all available processing resources in CPU/GPU- like systems and consistently achieve speedups ranging from 30% to 40% compared to just using the GPU in a single application mode.Postprint (published version

    CoreTSAR: Task Scheduling for Accelerator-aware Runtimes

    Get PDF
    Heterogeneous supercomputers that incorporate computational accelerators such as GPUs are increasingly popular due to their high peak performance, energy efficiency and comparatively low cost. Unfortunately, the programming models and frameworks designed to extract performance from all computational units still lack the flexibility of their CPU-only counterparts. Accelerated OpenMP improves this situation by supporting natural migration of OpenMP code from CPUs to a GPU. However, these implementations currently lose one of OpenMP’s best features, its flexibility: typical OpenMP applications can run on any number of CPUs. GPU implementations do not transparently employ multiple GPUs on a node or a mix of GPUs and CPUs. To address these shortcomings, we present CoreTSAR, our runtime library for dynamically scheduling tasks across heterogeneous resources, and propose straightforward extensions that incorporate this functionality into Accelerated OpenMP. We show that our approach can provide nearly linear speedup to four GPUs over only using CPUs or one GPU while increasing the overall flexibility of Accelerated OpenMP

    StarPU : un support exécutif unifié pour les architectures multicoeurs hétérogènes

    Get PDF
    National audienceEn conjonction avec les processeurs multicoeurs, désormais omniprésents, l'utilisation d'architectures spécialisées telles que les processeurs graphiques ou le Cell est une tendance forte du calcul haute performance. Atteindre les performances théoriques de ces architectures est un objectif difficile. Si de nombreux efforts ont d'ores et déjà été portés sur les accélérateurs, l'utilisation de toutes les ressources de calcul, simultanément, reste un véritable défi. Nous avons donc conçu StarPU, un support exécutif original qui fournit un modèle d'exécution unifié afin d'exploiter l'intégralité de la puissance de calcul tout en s'affranchissant des difficultés liées à la gestion des données. StarPU offre par ailleurs la possibilité de concevoir facilement des stratégies d'ordonnancement portables et efficaces. Nous avons mis en oeuvre quelques stratégies d'ordonnancement sélectionnables de manière transparente lors de l'exécution. Cela nous a permis d'étudier l'impact de l'ordonnancement sur quelques algorithmes d'algèbre linéaire. Au-delà d'une réduction substantielle des temps d'exécution, StarPU obtient des accélérations super-linéaires grâce à sa capacité à tirer un réel avantage des spécificités des machines hétérogènes

    Phase-based tuning: better utilized performance asymmetric multicores

    Get PDF
    The latest trend towards performance asymmetry among cores on a single chip of a multicore processor is posing new challenges. For effective utilization of these performance-asymmetric multicore processors, code sections of a program must be assigned to cores such that the resource needs of code sections closely matches resource availability at the assigned core. Determining this assignment manually is tedious, error prone, and significantly complicates software development. To solve this problem, this thesis describes a transparent and fully-automatic process called phase-based tuning which adapts an application to effectively utilize performance-asymmetric multicores. The basic idea behind this technique is to statically compute groups of program segments which are expected to behave similarly at runtime. Then, at runtime, the behavior of a few code segments is used to infer the behavior and preferred core assignment of all similar code segments with low overhead. Compared to the stock Linux scheduler, for systems asymmetric with respect to clock frequency, a 36% average process speedup is observed, while maintaining fairness and with negligible overheads. A key component to phase-based tuning is grouping program segments with similar behavior. The importance of various similarity metrics are likely to differ for each target asymmetric multicore processor. Determining groups using too many metrics may result in a grouping that differentiates between program segments based on irrelevant properties for a target machine. Using too few metrics may cause relevant metrics to be ignored thereby considering segments with different behavior similar. Therefore, to solve this problem and enable phase-based tuning for a wide range of a performance-asymmetric multicores, this thesis also describes a new technique called lazy grouping. Lazy grouping statically (at compile and install times) groups program segments that are expected to have similar behavior. The basic idea is to use extensive compile time analysis with intelligent install time (when the target system is known) group assignment. The accuracy of lazy grouping for a wide range of machines is shown to be more than 90% for nearly all target machines and asymmetric multicores

    Parallel For Loops on Heterogeneous Resources

    Get PDF
    In recent years, Graphics Processing Units (GPUs) have piqued the interest of researchers in scientific computing. Their immense floating point throughput and massive parallelism make them ideal for not just graphical applications, but many general algorithms as well. Load balancing applications and taking advantage of all computational resources in a machine is a difficult challenge, especially when the resources are heterogeneous. This dissertation presents the clUtil library, which vastly simplifies developing OpenCL applications for heterogeneous systems. The core focus of this dissertation lies in clUtil\u27s ParallelFor construct and our novel PINA scheduler which can efficiently load balance work onto multiple GPUs and CPUs simultaneously

    A framework for scientific computing with GPUs

    Get PDF
    Dissertação para obtenção do Grau de Mestre em Engenharia InformáticaCommodity hardware nowadays includes not only many-core CPUs but also Graphics Processing Units (GPUs) whose highly data-parallel computational capabilities have been growing at an exponential rate. This computational power can be used for purposes other than graphics-oriented applications, like processor-intensive algorithms as found in the scientific computing setting. This thesis proposes a framework that is capable of distributing computational jobs over a network of CPUs and GPUs alike. The source code for each job is an OpenCL kernel, and thus universal and independent from the specific architecture and CPU/GPU type where it will be executed. This approach releases the software developer from the burden of specific, customized revisions of the same applications for each type of processor/hardware, at the cost of a possibly sub-optimal but still very efficient solution. The proposed run-time scales up as more and more powerful computing resources become available, with no need to recompile the application. Experiments allowed to conclude that, although performance improvement achievements clearly depend on the nature of the problem and how it is coded, speedups in a distributed system containing both GPUs and multi-core CPUs can be up to two orders of magnitude.Centro de Informática e Tecnologias da Informação(CITI), and Fundação para a Ciência e Tecnologia (FCT/MCTES)- research projects PTDC/EIA/74325/2006, PTDC/EIA-EIA/108963/2008, PTDC/EIA-EIA /102579/2008, and PTDC/EIA-EIA/113613/200
    corecore