146 research outputs found

    GLT: A Unified API for Lightweight Thread Libraries

    Get PDF
    In recent years, several lightweight thread (LWT) libraries have emerged to tackle exascale challenges. These offer programming models (PMs) based on user-level threads and incorporate their own lightweight mechanisms. However, each library proposes its own PM, exposing different semantics and hindering portability. To address this drawback, we have designed Generic Lightweight Thread (GLT), an application programming interface that frames the functionality of the most popular LWT libraries for high-performance computing under a single PM. We implement GLT on top of Argobots, MassiveThreads, and Qthreads. We provide GLT as a dynamic library, as well as in the form of a static version based on macro preprocessing resolution to reduce overhead. This paper discusses the GLT PM and demonstrates its minimal performance impact.Researchers from the Universitat Jaume I de Castelló were supported by project TIN2014-53495-R of the MINECO, the Generalitat Valenciana fellowship programme Vali+d 2015, and FEDER. Antonio J. Peña is cofinancied by the Spanish Ministry of Economy and Competitiveness under Juan de la Cierva fellowship number IJCI-2015-23266. This work was partially supported by the U.S. Dept. of Energy, Office of Science, Office of Advanced Scientific Computing Research (SC-21), under contract DE-AC02-06CH11357.Peer ReviewedPostprint (author's final draft

    StarPU : un support exécutif unifié pour les architectures multicoeurs hétérogènes

    Get PDF
    National audienceEn conjonction avec les processeurs multicoeurs, désormais omniprésents, l'utilisation d'architectures spécialisées telles que les processeurs graphiques ou le Cell est une tendance forte du calcul haute performance. Atteindre les performances théoriques de ces architectures est un objectif difficile. Si de nombreux efforts ont d'ores et déjà été portés sur les accélérateurs, l'utilisation de toutes les ressources de calcul, simultanément, reste un véritable défi. Nous avons donc conçu StarPU, un support exécutif original qui fournit un modèle d'exécution unifié afin d'exploiter l'intégralité de la puissance de calcul tout en s'affranchissant des difficultés liées à la gestion des données. StarPU offre par ailleurs la possibilité de concevoir facilement des stratégies d'ordonnancement portables et efficaces. Nous avons mis en oeuvre quelques stratégies d'ordonnancement sélectionnables de manière transparente lors de l'exécution. Cela nous a permis d'étudier l'impact de l'ordonnancement sur quelques algorithmes d'algèbre linéaire. Au-delà d'une réduction substantielle des temps d'exécution, StarPU obtient des accélérations super-linéaires grâce à sa capacité à tirer un réel avantage des spécificités des machines hétérogènes

    Vers des supports d'exécution capables d'exploiter les machines multicœurs hétérogènes

    Get PDF
    Approaching the theoretical performance of heterogeneous multicore architectures, equipped with specialized accelerators, is a challenging issue. Unlike regular CPUs that can transparently access the whole global memory address range, accelerators usually embed local memory on which they perform all their computations using a specific instruction set. While many research efforts have been devoted to offloading parts of a program over such coprocessors, the real challenge is to find a programming model providing a unified view of all available computing units. In this document, we present an original runtime system providing a high-level, unified execution model allowing seamless execution of tasks over the underlying heterogeneous hardware. The runtime is based on a hierarchical memory management facility and on a codelet scheduler. We demonstrate the efficiency of our solution with a LU decomposition for both homogeneous (3.8 speedup on 4 cores) and heterogeneous machines (95% efficiency). We also show that a "granularity aware" scheduling can improve execution time by 35%

    Scheduling data flow program in xkaapi: A new affinity based Algorithm for Heterogeneous Architectures

    Get PDF
    Efficient implementations of parallel applications on heterogeneous hybrid architectures require a careful balance between computations and communications with accelerator devices. Even if most of the communication time can be overlapped by computations, it is essential to reduce the total volume of communicated data. The literature therefore abounds with ad-hoc methods to reach that balance, but that are architecture and application dependent. We propose here a generic mechanism to automatically optimize the scheduling between CPUs and GPUs, and compare two strategies within this mechanism: the classical Heterogeneous Earliest Finish Time (HEFT) algorithm and our new, parametrized, Distributed Affinity Dual Approximation algorithm (DADA), which consists in grouping the tasks by affinity before running a fast dual approximation. We ran experiments on a heterogeneous parallel machine with six CPU cores and eight NVIDIA Fermi GPUs. Three standard dense linear algebra kernels from the PLASMA library have been ported on top of the Xkaapi runtime. We report their performances. It results that HEFT and DADA perform well for various experimental conditions, but that DADA performs better for larger systems and number of GPUs, and, in most cases, generates much lower data transfers than HEFT to achieve the same performance

    A unified runtime system for heterogeneous multicore architectures

    Get PDF
    International audienceApproaching the theoretical performance of heterogeneous multicore architectures, equipped with specialized accelerators, is a challenging issue. Unlike regular CPUs that can transparently access the whole global memory address range, accelerators usually embed local memory on which they perform all their computations using a specific instruction set. While many research efforts have been devoted to offloading parts of a program over such coprocessors, the real challenge is to find a programming model providing a unified view of all available computing units. In this paper, we present an original runtime system providing a high-level, unified execution model allowing seamless execution of tasks over the underlying heterogeneous hardware. The runtime is based on a hierarchical memory management facility and on a codelet scheduler. We demonstrate the efficiency of our solution with a LU decomposition for both homogeneous (3.8 speedup on 4 cores) and heterogeneous machines (95% efficiency). We also show that a "granularity aware" scheduling can improve execution time by 35%

    StarPU: a Runtime System for Scheduling Tasks over Accelerator-Based Multicore Machines

    Get PDF
    Multicore machines equipped with accelerators are becoming increasingly popular. The TOP500-leading RoadRunner machine is probably the most famous example of a parallel computer mixing IBM Cell Broadband Engines and AMD opteron processors. Other architectures, featuring GPU accelerators, are expected to appear in the near future. To fully tap into the potential of these hybrid machines, pure offloading approaches, in which the main core of the application runs on regular processors and offloads specific parts on accelerators, are not sufficient. The real challenge is to build systems where the application would permanently spread across the entire machine, that is, where parallel tasks would be dynamically scheduled over the full set of available processing units. To face this challenge, we propose a new runtime system capable of scheduling tasks over heterogeneous, accelerator-based machines. Our system features a software virtual shared memory that provides a weak consistency model. The system keeps track of data copies within accelerator embedded-memories and features a data-prefetching engine. Such facilities, together with a database of self-tuned per-task performance models, can be used to greatly improve the quality of scheduling policies in this context. We demonstrate the relevance of our approach by benchmarking various parallel numerical kernel implementations over our runtime system. We obtain significant speedups and a very high efficiency on various typical workloads over multicore machines equipped with multiple accelerators

    Мониторинг резистентности грамотрицательных возбудителей, выделенных у пациентов с хирургическими инфекциями кожи и мягких тканей

    Get PDF
    МЯГКИЕ ТКАНИХИРУРГИЧЕСКИЕ ИНФЕКЦИИГРАМОТРИЦАТЕЛЬНЫЕ БАКТЕРИИГРАМОТРИЦАТЕЛЬНЫЕ БАКТЕРИАЛЬНЫЕ ИНФЕКЦИИРЕЗИСТЕНТНОСТЬ БАКТЕРИЙ К АНТИБИОТИКАМНОЗОКОМИАЛЬНАЯ ИНФЕКЦИЯСИНЕГНОЙНАЯ ПАЛОЧКААЦИНЕТОБАКТЕ

    Automatic Calibration of Performance Models on Heterogeneous Multicore Architectures

    Get PDF
    International audienceMulticore architectures featuring specialized accelerators are getting an increasing amount of attention, and this success will probably influence the design of future High Performance Computing hardware. Unfortunately, programmers are actually having a hard time trying to exploit all these heterogeneous computing units efficiently, and most existing efforts simply focus on providing tools to offload some computations on available accelerators. Recently, some runtime systems have been designed that exploit the idea of scheduling -- as opposed to offloading -- parallel tasks over the whole set of heterogeneous computing units. Scheduling tasks over heterogeneous platforms makes it necessary to use accurate prediction models in order to assign each task to its most adequate computing unit. A deep knowledge of the application is usually required to model per-task performance models, based on the algorithmic complexity of the underlying numeric kernel. We present an alternate, auto-tuning performance prediction approach based on performance history tables dynamically built during the application run. This approach does not require that the programmer provides some specific information. We show that, thanks to the use of a carefully chosen hash-function, our approach quickly achieves accurate performance estimations automatically. Our approach even outperforms regular algorithmic performance models with several linear algebra numerical kernels

    TaskGenX: A Hardware-Software Proposal for Accelerating Task Parallelism

    Get PDF
    As chip multi-processors (CMPs) are becoming more and more complex, software solutions such as parallel programming models are attracting a lot of attention. Task-based parallel programming models offer an appealing approach to utilize complex CMPs. However, the increasing number of cores on modern CMPs is pushing research towards the use of fine grained parallelism. Task-based programming models need to be able to handle such workloads and offer performance and scalability. Using specialized hardware for boosting performance of task-based programming models is a common practice in the research community. Our paper makes the observation that task creation becomes a bottleneck when we execute fine grained parallel applications with many task-based programming models. As the number of cores increases the time spent generating the tasks of the application is becoming more critical to the entire execution. To overcome this issue, we propose TaskGenX. TaskGenX offers a solution for minimizing task creation overheads and relies both on the runtime system and a dedicated hardware. On the runtime system side, TaskGenX decouples the task creation from the other runtime activities. It then transfers this part of the runtime to a specialized hardware. We draw the requirements for this hardware in order to boost execution of highly parallel applications. From our evaluation using 11 parallel workloads on both symmetric and asymmetric multicore systems, we obtain performance improvements up to 15×, averaging to 3.1× over the baseline.This work has been supported by the RoMoL ERC Advanced Grant (GA 321253), by the European HiPEAC Network of Excellence, by the Spanish Ministry of Science and Innovation (contracts TIN2015-65316-P), by the Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272), and by the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 671697 and No. 779877. M. Moretó has been partially supported by the Ministry of Economy and Competitiveness under Ramon y Cajal fellowship number RYC-2016-21104. Finally, the authors would like to thank Thomas Grass for his valuable help with the simulator.Peer ReviewedPostprint (author's final draft
    corecore