30 research outputs found

    Vers des supports d'exécution capables d'exploiter les machines multicœurs hétérogènes

    Get PDF
    Approaching the theoretical performance of heterogeneous multicore architectures, equipped with specialized accelerators, is a challenging issue. Unlike regular CPUs that can transparently access the whole global memory address range, accelerators usually embed local memory on which they perform all their computations using a specific instruction set. While many research efforts have been devoted to offloading parts of a program over such coprocessors, the real challenge is to find a programming model providing a unified view of all available computing units. In this document, we present an original runtime system providing a high-level, unified execution model allowing seamless execution of tasks over the underlying heterogeneous hardware. The runtime is based on a hierarchical memory management facility and on a codelet scheduler. We demonstrate the efficiency of our solution with a LU decomposition for both homogeneous (3.8 speedup on 4 cores) and heterogeneous machines (95% efficiency). We also show that a "granularity aware" scheduling can improve execution time by 35%

    StarPU : un support exécutif unifié pour les architectures multicoeurs hétérogènes

    Get PDF
    National audienceEn conjonction avec les processeurs multicoeurs, désormais omniprésents, l'utilisation d'architectures spécialisées telles que les processeurs graphiques ou le Cell est une tendance forte du calcul haute performance. Atteindre les performances théoriques de ces architectures est un objectif difficile. Si de nombreux efforts ont d'ores et déjà été portés sur les accélérateurs, l'utilisation de toutes les ressources de calcul, simultanément, reste un véritable défi. Nous avons donc conçu StarPU, un support exécutif original qui fournit un modèle d'exécution unifié afin d'exploiter l'intégralité de la puissance de calcul tout en s'affranchissant des difficultés liées à la gestion des données. StarPU offre par ailleurs la possibilité de concevoir facilement des stratégies d'ordonnancement portables et efficaces. Nous avons mis en oeuvre quelques stratégies d'ordonnancement sélectionnables de manière transparente lors de l'exécution. Cela nous a permis d'étudier l'impact de l'ordonnancement sur quelques algorithmes d'algèbre linéaire. Au-delà d'une réduction substantielle des temps d'exécution, StarPU obtient des accélérations super-linéaires grâce à sa capacité à tirer un réel avantage des spécificités des machines hétérogènes

    A unified runtime system for heterogeneous multicore architectures

    Get PDF
    International audienceApproaching the theoretical performance of heterogeneous multicore architectures, equipped with specialized accelerators, is a challenging issue. Unlike regular CPUs that can transparently access the whole global memory address range, accelerators usually embed local memory on which they perform all their computations using a specific instruction set. While many research efforts have been devoted to offloading parts of a program over such coprocessors, the real challenge is to find a programming model providing a unified view of all available computing units. In this paper, we present an original runtime system providing a high-level, unified execution model allowing seamless execution of tasks over the underlying heterogeneous hardware. The runtime is based on a hierarchical memory management facility and on a codelet scheduler. We demonstrate the efficiency of our solution with a LU decomposition for both homogeneous (3.8 speedup on 4 cores) and heterogeneous machines (95% efficiency). We also show that a "granularity aware" scheduling can improve execution time by 35%

    StarPU: a Runtime System for Scheduling Tasks over Accelerator-Based Multicore Machines

    Get PDF
    Multicore machines equipped with accelerators are becoming increasingly popular. The TOP500-leading RoadRunner machine is probably the most famous example of a parallel computer mixing IBM Cell Broadband Engines and AMD opteron processors. Other architectures, featuring GPU accelerators, are expected to appear in the near future. To fully tap into the potential of these hybrid machines, pure offloading approaches, in which the main core of the application runs on regular processors and offloads specific parts on accelerators, are not sufficient. The real challenge is to build systems where the application would permanently spread across the entire machine, that is, where parallel tasks would be dynamically scheduled over the full set of available processing units. To face this challenge, we propose a new runtime system capable of scheduling tasks over heterogeneous, accelerator-based machines. Our system features a software virtual shared memory that provides a weak consistency model. The system keeps track of data copies within accelerator embedded-memories and features a data-prefetching engine. Such facilities, together with a database of self-tuned per-task performance models, can be used to greatly improve the quality of scheduling policies in this context. We demonstrate the relevance of our approach by benchmarking various parallel numerical kernel implementations over our runtime system. We obtain significant speedups and a very high efficiency on various typical workloads over multicore machines equipped with multiple accelerators

    Exploiting the Cell/BE architecture with the StarPU unified runtime system

    Get PDF
    International audienceCore specialization is currently one of the most promising ways for designing power-efficient multicore chips. However, approaching the theoretical peak performance of such heterogeneous multicore architectures with specialized accelerators, is a complex issue. While substantial effort has been devoted to efficiently offloading parts of the computation, designing an execution model that unifies all computing units is the main challenge. We therefore designed the StarPU runtime system for providing portable support for heterogeneous multicore processors to high performance applications and compiler environments. StarPU provides a high-level, unified execution model which is tightly coupled to an expressive data management library. In addition to our previous results on using multicore processors alongside with graphic processors, we show that StarPU is flexible enough to efficiently exploit the heterogeneous resources in the Cell processor. We present a scalable design supporting multiple different accelerators while minimizing the overhead on the overall system. Using experiments with classical linear algebra algorithms, we show that StarPU improves programmability and provides performance portability

    StarPU-MPI: Task Programming over Clusters of Machines Enhanced with Accelerators

    Get PDF
    International audienceGPUs clusters are becoming widespread HPC platforms. Ex- ploiting them is however challenging, as this requires two separate paradigms (MPI and CUDA or OpenCL) and careful load balancing due to node heterogeneity. Current paradigms usually either limit themselves to of- fload part of the computation and leave CPUs idle, or require static CPU/GPU work partitioning. We thus have previously proposed StarPU, a runtime system able to dynamically scheduling tasks within a single heterogeneous node. We show how we extended the task paradigm of StarPU with MPI to easily map the task graph on MPI clusters and automatically benefit from optimized execution

    Data-Aware Task Scheduling on Multi-Accelerator based Platforms

    Get PDF
    International audienceTo fully tap into the potential of heterogeneous machines composed of multicore processors and multiple accelerators, simple offloading approaches in which the main trunk of the application runs on regular cores while only specific parts are offloaded on accelerators are not sufficient. The real challenge is to build systems where the application would permanently spread across the entire machine, that is, where parallel tasks would be dynamically scheduled over the full set of available processing units. To face this challenge, we previously proposed StarPU, a runtime system capable of scheduling tasks over multicore machines equipped with GPU accelerators. StarPU uses a software virtual shared memory (VSM) that provides a high-level programming interface and automates data transfers between processing units so as to enable a dynamic scheduling of tasks. We now present how we have extended StarPU to minimize the cost of transfers between processing units in order to efficiently cope with multi-GPU hardware configurations. To this end, our runtime system implements data prefetching based on asynchronous data transfers, and uses data transfer cost prediction to influence the decisions taken by the task scheduler. We demonstrate the relevance of our approach by benchmarking two parallel numerical algorithms using our runtime system. We obtain significant speedups and high efficiency over multicore machines equipped with multiple accelerators. We also evaluate the behaviour of these applications over clusters featuring multiple GPUs per node, showing how our runtime system can combine with MPI

    StarPU-MPI: Task Programming over Clusters of Machines Enhanced with Accelerators

    Get PDF
    GPUs have largely entered HPC clusters, as shown by the top entries of the latest top500 issue. Exploiting such machines is however very challenging, not only because of combining two separate paradigms, MPI and CUDA or OpenCL, but also because nodes are heterogeneous and thus require careful load balancing within nodes themselves. The current paradigms are usually limited to only offloading parts of the computation and leaving CPUs idle, or they require static work partitioning between CPUs and GPUs. To handle single-node architecture heterogeneity, we have previously proposed StarPU, a runtime system capable of dynamically scheduling tasks in an optimized way on such machines. We show here how the task paradigm of StarPU has been combined with MPI communications, and how we extended the task paradigm itself to allow mapping the task graph on MPI clusters such as to automatically achieve an optimized distributed execution. We show how a sequential-like Cholesky source code can be easily extended into a scalable distributed parallel execution, and already exhibits a speedup of 5 on 6 nodes

    Dynamically scheduled Cholesky factorization on multicore architectures with GPU accelerators.

    Get PDF
    International audienceAlthough the hardware has dramatically changed in the last few years, nodes of multicore chips augmented by Graphics Processing Units (GPUs) seem to be a trend of major importance. Previous approaches for scheduling dense linear operations on such a complex node led to high performance but at the double cost of not using the potential of all the cores and producing a static and non generic code. In this extended abstract, we present a new approach for scheduling dense linear algebra operations on multicore architectures with GPU accelerators using a dynamic scheduler capable of using the full potential of the node [1]. We underline the benefits both in terms of programmability and performance. We illustrate our approach with a Cholesky factorization relying on cutting edge GPU and CPU kernels [2], [3] achieving roughly 900 Gflop/s on an eight cores node accelerated with three NVIDIA Tesla GPUs

    Scheduling Tasks over Multicore machines enhanced with Accelerators : a Runtime System’s Perspective

    No full text
    Bien que les accélérateurs fassent désormais partie intégrante du calcul haute performance, les gains observés ont un impact direct sur la programmabilité, de telle sorte qu'un support proposant des abstractions portables est indispensable pour tirer pleinement partie de toute la puissance de calcul disponible de manière portable, malgré la complexité de la machine sous-jacente. Dans cette thèse, nous proposons un modèle de support exécutif offrant une interface expressive permettant notamment de répondre aux défis soulevés en termes d'ordonnancement et de gestion de données. Nous montrons la pertinence de notre approche à l'aide de la plateforme StarPU conçue à l'occasion de cette thèse.Multicore machines equipped with accelerators are becoming increasingly popular in the HighPerformance Computing ecosystem. Hybrid architectures provide significantly improved energyefficiency, so that they are likely to generalize in the Manycore era. However, the complexity introducedby these architectures has a direct impact on programmability, so that it is crucial toprovide portable abstractions in order to fully tap into the potential of these machines. Pure offloadingapproaches, that consist in running an application on regular processors while offloadingpredetermined parts of the code on accelerators, are not sufficient. The real challenge is to buildsystems where the application would be spread across the entire machine, that is, where computationwould be dynamically scheduled over the full set of available processing units.In this thesis, we thus propose a new task-based model of runtime system specifically designedto address the numerous challenges introduced by hybrid architectures, especially in terms of taskscheduling and of data management. In order to demonstrate the relevance of this model, we designedthe StarPU platform. It provides an expressive interface along with flexible task schedulingcapabilities tightly coupled to an efficient data management. Using these facilities, together witha database of auto-tuned per-task performance models, it for instance becomes straightforward todevelop efficient scheduling policies that take into account both computation and communicationcosts. We show that our task-based model is not only powerful enough to provide support forclusters, but also to scale on hybrid manycore architectures.We analyze the performance of our approach on both synthetic and real-life workloads, andshow that we obtain significant speedups and a very high efficiency on various types of multicoreplatforms enhanced with accelerators
    corecore