14 research outputs found

    Data-Aware Task Scheduling on Multi-Accelerator based Platforms

    Get PDF
    International audienceTo fully tap into the potential of heterogeneous machines composed of multicore processors and multiple accelerators, simple offloading approaches in which the main trunk of the application runs on regular cores while only specific parts are offloaded on accelerators are not sufficient. The real challenge is to build systems where the application would permanently spread across the entire machine, that is, where parallel tasks would be dynamically scheduled over the full set of available processing units. To face this challenge, we previously proposed StarPU, a runtime system capable of scheduling tasks over multicore machines equipped with GPU accelerators. StarPU uses a software virtual shared memory (VSM) that provides a high-level programming interface and automates data transfers between processing units so as to enable a dynamic scheduling of tasks. We now present how we have extended StarPU to minimize the cost of transfers between processing units in order to efficiently cope with multi-GPU hardware configurations. To this end, our runtime system implements data prefetching based on asynchronous data transfers, and uses data transfer cost prediction to influence the decisions taken by the task scheduler. We demonstrate the relevance of our approach by benchmarking two parallel numerical algorithms using our runtime system. We obtain significant speedups and high efficiency over multicore machines equipped with multiple accelerators. We also evaluate the behaviour of these applications over clusters featuring multiple GPUs per node, showing how our runtime system can combine with MPI

    A Graph-Partition-Based Scheduling Policy for Heterogeneous Architectures

    Full text link
    In order to improve system performance efficiently, a number of systems choose to equip multi-core and many-core processors (such as GPUs). Due to their discrete memory these heterogeneous architectures comprise a distributed system within a computer. A data-flow programming model is attractive in this setting for its ease of expressing concurrency. Programmers only need to define task dependencies without considering how to schedule them on the hardware. However, mapping the resulting task graph onto hardware efficiently remains a challenge. In this paper, we propose a graph-partition scheduling policy for mapping data-flow workloads to heterogeneous hardware. According to our experiments, our graph-partition-based scheduling achieves comparable performance to conventional queue-base approaches.Comment: Presented at DATE Friday Workshop on Heterogeneous Architectures and Design Methods for Embedded Image Systems (HIS 2015) (arXiv:1502.07241

    StarPU-MPI: Task Programming over Clusters of Machines Enhanced with Accelerators

    Get PDF
    GPUs have largely entered HPC clusters, as shown by the top entries of the latest top500 issue. Exploiting such machines is however very challenging, not only because of combining two separate paradigms, MPI and CUDA or OpenCL, but also because nodes are heterogeneous and thus require careful load balancing within nodes themselves. The current paradigms are usually limited to only offloading parts of the computation and leaving CPUs idle, or they require static work partitioning between CPUs and GPUs. To handle single-node architecture heterogeneity, we have previously proposed StarPU, a runtime system capable of dynamically scheduling tasks in an optimized way on such machines. We show here how the task paradigm of StarPU has been combined with MPI communications, and how we extended the task paradigm itself to allow mapping the task graph on MPI clusters such as to automatically achieve an optimized distributed execution. We show how a sequential-like Cholesky source code can be easily extended into a scalable distributed parallel execution, and already exhibits a speedup of 5 on 6 nodes

    Méthode des multipôles rapide à base de tâches pour des clusters de processeurs multicoeurs

    Get PDF
    Most high-performance, scientific libraries have adopted hybrid parallelization schemes - such as the popular MPI+OpenMP hybridization - to benefit from the capacities of modern distributed-memory machines. While these approaches have shown to achieve high performance, they require a lot of effort to design and maintain sophisticated synchronization/communication strategies. On the other hand, task-based programming paradigms aim at delegating this burden to a runtime system for maximizing productivity. In this article, we assess the potential of task-based fast multipole methods (FMM) on clusters of multicore processors. We propose both a hybrid MPI+task FMM parallelization and a pure task-based parallelization where the MPI communications are implicitly handled by the runtime system. The latter approach yields a very compact code following a sequential task-based programming model. We show that task-based approaches can compete with a hybrid MPI+OpenMP highly optimized code and that furthermore the compact task-based scheme fully matches the performance of the sophisticated, hybrid MPI+task version, ensuring performance while maximizing productivity. We illustrate our discussion with the ScalFMM FMM library and the StarPU runtime system.La plupart des bibliothèques scientifiques très performantes ont adopté des parallélisations hybrides - comme l’approche MPI+OpenMP - pour profiter des capacités des machines modernes à mémoire distribuée. Ces approches permettent d’obtenir de très hautes performances, mais elles nécessitent beaucoup d’efforts pour concevoir et pour maintenir des stratégies de synchronisation/communication sophistiquées. D’un autre côté, les paradigmes de programmation à base de tâches visent à déléguer ce fardeau à un moteur d'exécution pour maximiser la productivité. Dans cet article, nous évaluons le potentiel de la méthode des multipôles rapide (FMM) à base de tâches sur les clusters de processeurs multic\oe{}urs. Nous proposons deux types de parallélisation, une première approche hybride (MPI+Tâche) à base de tâches et d’appels à MPI pour gérer explicitement les communications et la deuxième uniquement à base de tâches où les communications MPI sont implicitement postées par le moteur d'exécution. Cette dernière approche conduit à un code très compact qui suit le modèle de programmation séquentiel à base de tâches. Nous montrons que cette approche rivalise avec le code hybride MPI+OpenMP fortement optimisé et qu'en outre le code compact atteint les performances de la version hybride MPI+Tâche, assurant une très haute performance tout en maximisant la productivité. Nous illustrons notre propos avec la bibliothèque FMM ScalFMM et le moteur d'exécution StarPU

    Development of a multi-core and multi-accelerator platform for approximate computing

    Get PDF
    Proyecto de graduación (Licenciatura en Ingeniería en Electrónica) Instituto Tecnológico de Costa Rica, Escuela de Ingeniería Electrónica, 2017.Changing environment in the current technologies have introduce a gap between the ever growing needs of users and the state of present designs. As high data and hard computation applications moved forward in the near future, the current trend reaches for a greater performance. Approximate computing enters this scheme to boost a system overall attributes, while working with intrinsic and error tolerable characteristics both in software and hardware. This work proposes a multicore and multi-accelerator platform design that uses both exact and approximate versions, also providing interaction with a software counterpart to ensure usage of both layouts. A set of five di↵erent approximate accelerator versions and one exact, are present for three di↵erent image processing filters, Laplace, Sobel and Gauss, along with their respective characterization in terms of Power, Area and Delay time. This will show better results for design versions 2 and 3. Later it will be seen three di↵erent interfaces designs for accelerators along with a softcore processor, Altera’s NIOS II. Results gathered demonstrate a definitively improvement while using approximate accelerators in comparison with software and exact accelerator implementations. Memory accessing and filter operations times, for two di↵erent matrices sizes, present a gain of 500, 2000 and 1500 cycles measure for Laplace, Gauss and Sobel filters respectively, while contrasting software times, and a range of 28-84, 20-40 and 68-100 ticks decrease against the use of an exact accelerator

    Achieving High Performance on Supercomputers with a Sequential Task-based Programming Model

    Get PDF
    International audienceThe emergence of accelerators as standard computing resources on supercomputers and the subsequent architectural complexity increase revived the need for high-level parallel programming paradigms. Sequential task-based programming model has been shown to efficiently meet this challenge on a single multicore node possibly enhanced with accelerators, which motivated its support in the OpenMP 4.0 standard. In this paper, we show that this paradigm can also be employed to achieve high performance on modern supercomputers composed of multiple such nodes, with extremely limited changes in the user code. To prove this claim, we have extended the StarPU runtime system with an advanced inter-node data management layer that supports this model by posting communications automatically. We illustrate our discussion with the task-based tile Cholesky algorithm that we implemented on top of this new runtime system layer. We show that it enables very high productivity while achieving a performance competitive with both the pure Message Passing Interface (MPI)-based ScaLAPACK Cholesky reference implementation and the DPLASMA Cholesky code, which implements another (non-sequential) task-based programming paradigm

    Portability and performance in heterogeneous many core Systems

    Get PDF
    Dissertação de mestrado em InformáticaCurrent computing systems have a multiplicity of computational resources with different architectures, such as multi-core CPUs and GPUs. These platforms are known as heterogeneous many-core systems (HMS) and as computational resources evolve they are o ering more parallelism, as well as becoming more heterogeneous. Exploring these devices requires the programmer to be aware of the multiplicity of associated architectures, computing models and development framework. Portability issues, disjoint memory address spaces, work distribution and irregular workload patterns are major examples that need to be tackled in order to e ciently explore the computational resources of an HMS. This dissertation goal is to design and evaluate a base architecture that enables the identi cation and preliminary evaluation of the potential bottlenecks and limitations of a runtime system that addresses HMS. It proposes a runtime system that eases the programmer burden of handling all the devices available in a heterogeneous system. The runtime provides a programming and execution model with a uni ed address space managed by a data management system. An API is proposed in order to enable the programmer to express applications and data in an intuitive way. Four di erent scheduling approaches are evaluated that combine di erent data partitioning mechanisms with di erent work assignment policies and a performance model is used to provide some performance insights to the scheduler. The runtime e ciency was evaluated with three di erent applications - matrix multiplication, image convolution and n-body Barnes-Hut simulation - running in multicore CPUs and GPUs. In terms of productivity the results look promising, however, combining scheduling and data partitioning revealed some ine ciencies that compromise load balancing and needs to be revised, as well as the data management system that plays a crucial role in such systems. Performance model driven decisions were also evaluated which revealed that the accuracy of a performance model is also a compromising component

    Analyse de faisabilité de l'implantation d'un protocole de communication sur processeur multicoeurs

    Get PDF
    RÉSUMÉ Les travaux de ce mémoire s’inscrivent dans le cadre d’un projet qui fait l’objet d’un parrainage industriel. Les résultats visent à comprendre le comportement d’un système de traitement opérant dans des contextes précis. Nous situons ce projet à l’intersection des principes d’ordonnancements de tâches, des systèmes d’exécution, de la virtualisation de fonctions de réseaux et surtout les contraintes associées à la virtualisation d’une pile de protocole LTE (Long Term Evolution), la norme de téléphonie cellulaire la plus en vue en ce moment. Une revue de littérature est proposée pour expliquer en détail les concepts vus plus haut, afin d’avoir une idée précise de la situation de test. D’abord, une étude des grappes d’unités de traitement temps réel est effectuée dans l’optique de l’implémentation de ce qu’il est convenu d’appeler un Cloud Radio Area Network (C-RAN), qui supporte sur une plateforme infonuagique l’électronique qui effectue le traitement de signal requis pour un point d’accès de téléphonie cellulaire. L’étude développée dans ce mémoire vise à évaluer les différents goulots d’étranglement qui peuvent survenir suite à la réception d’un paquet LTE au sein d’une trame CPRI (Common Public Radio Interface), jusqu’à l’envoi de ce paquet d’un serveur maitre jusqu’aux esclaves. Nous évaluons donc les latences et bandes passantes observées pour les différents protocoles composant la plateforme. Nous caractérisons notamment les communications CPRI des antennes vers le bassin de stations de base virtuelles, une communication de type Quick Path Interconnect (QPI) entre des cœurs de traitement et un réseau logique programmable de type FPGA, une communication dédiée point à point entre le FPGA et une carte NIC (Network Interface Card) pour finir avec l’envoi de trames Ethernet vers les serveurs esclaves. Cette étude nous permet de déduire que la virtualisation d’une pile LTE est viable sur une telle grappe de calcul temps réel.----------ABSTRACT The work performed as part of this Master thesis is done in the context of an industrially sponsored project. The objective is to understand the runtime behavior of a class of systems in specific contexts. We place this project at the intersection of the principles of task scheduling, runtimes, Network Functions Virtualisation (NFVs) and especially with the constraints associated with virtualization of an LTE (Long Term Evolution) stack that is the most prominent cellular telecommunication standard at the moment. A literature review is proposed to explain in detail the concepts discussed above, in order to have a clear idea of the target environment. First, a study of a real time processing cluster is carried out in relation to the implementation of the so-called Cloud Radio Area Network (C-RAN) that supports on a cloud platform all the electronics which performs the signal processing required for a cellular access point. The study developed in this paper is to evaluate the various bottlenecks that can occur following the receipt of an LTE packet within a Common Public Radio Interface (CPRI) frame, then as part of sending the package to a master server before routing it to the slaves. We evaluate the latencies and bandwidths observed for the different protocols used on the platform components. In particular, we characterize the CPRI communications from the antennas to the virtual base stations units, a Quick Path Interconnect (QPI) communication between processing cores and a programmable logic array in the type of a FPGA, a dedicated point to point communication between the FPGA and a NIC (Network Interface Card) to end with the sending Ethernet frames to the slave servers. This study allows us to infer that the virtualization of an LTE stack is viable on a real time computation cluster with the implied architecture. Then, to be able to validate the effectiveness of different scheduling algorithms, an emulation of a LTE Uplink stack virtualization will be made. Through a runtime called StarPU coupled with profiling tools, we deliver results to assess the need for dedicated thread or cores to manage tasks within a server

    Scheduling (ir)regular applications on heterogeneous platforms

    Get PDF
    Dissertação de mestrado em Engenharia de InformáticaCurrent computational platforms have become continuously more and more heterogeneous and parallel over the last years, as a consequence of incorporating accelerators whose architectures are parallel and different from the CPU. As a result, several frameworks were developed to aid to program these platforms mainly targeting better productivity ratios. In this context, GAMA framework is being developed by the research group involved in this work, targeting both regular and irregular algorithms to efficiently run in heterogeneous platforms. Scheduling is a key issue of GAMA-like frameworks. The state of the art solutions of scheduling on heterogeneous platforms are efficient for regular applications but lack adequate mechanisms for irregular ones. The scheduling of irregular applications is particularly complex due to the unpredictability and the differences on the execution time of their composing computational tasks. This dissertation work comprises the design and validation of a dynamic scheduler’s model and implementation, to simultaneously address regular and irregular algorithms. The devised scheduling mechanism is validated within the GAMA framework, when running relevant scientific algorithms, which include the SAXPY, the Fast Fourier Transform and two n-Body solvers. The proposed mechanism is validated regarding its efficiency in finding good scheduling decisions and the efficiency and scalability of GAMA, when using it. The results show that the model of the devised dynamic scheduler is capable of working in heterogeneous systems with high efficiency and finding good scheduling decisions in the general tested cases. It achieves not only the scheduling decision that represents the real capacity of the devices in the platform, but also enables GAMA to achieve more than 100% of efficiency as defined in [3], when running a relevant scientific irregular algorithm. Under the designed scheduling model, GAMA was also able to beat CPU and GPU efficient libraries of SAXPY, an important scientific algorithm. It was also proved GAMA’s scalability under the devised dynamic scheduler, which properly leveraged the platform computational resources, in trials with one central quad-core CPU-chip and two GPU accelerators.As plataformas computacionais actuais tornaram-se cada vez mais heterogéneas e paralelas nos últimos anos, como consequência de integrarem aceleradores cujas arquitecturas são paralelas e distintas do CPU. Como resultado, várias frameworks foram desenvolvidas para programar estas plataformas, com o objectivo de aumentar os níveis de produtividade de programação. Neste sentido, a framework GAMA está a ser desenvolvida pelo grupo de investigação envolvido nesta tese, tendo como objectivo correr eficientemente algoritmos regulares e irregulares em plataformas heterogéneas. Um aspecto chave no contexto de frameworks congéneres ao GAMA é o escalonamento. As soluções que compõem o estado da arte de escalonamento em plataformas heterogéneas são eficientes para aplicaçóes regulares, mas ineficientes para aplicações irregulares. O escalonamento destas é particularmente complexo devido à imprevisibilidade e ás diferenças no tempo de computação das tarefas computacionais que as compõem. Esta dissertação propõe o design e validação de um modelo de escalonamento e respectiva implementação, que endereça tanto aplicações regulares como irregulares. O mecanismo de escalonamento desenvolvido é validado na framework GAMA, executando algoritmos científicos relevantes, que incluem a SAXPY, a Transformada Rápida de Fourier e dois algoritmos de resolução do problema n-Corpos. O mecanismo proposto é validado quanto à sua eficiência em encontrar boas decisões de escalonamento e quanto à eficiência e escalabilidade do GAMA, quando fazendo uso do mesmo. Os resultados obtidos mostram que o modelo de escalonamento proposto é capaz de executar em plataformas heterogéneas com alto grau de eficiência, uma vez que encontra boas decisões de escalonamento na generalidade dos casos testados. Além de atingir a decisão de escalonamento que melhor representa o real poder computacional dos dispositivos na plataforma, também permite ao GAMA atingir mais de 100% de eficiência tal como definida em [3], executando um importante algoritmo científico irregular. Integrando o modelo de escalonamento desenvolvido, o GAMA superou ainda bibliotecas eficientes para CPU e GPU na execução do SAXPY, um importante algoritmo científico. Foi também provada a escalabilidade do GAMA sob o modelo desenvolvido, que aproveitou da melhor forma os recursos computacionais disponíveis, em testes para um CPU-chip de 4 núcleos e dois GPUs
    corecore