14 research outputs found
Data-Aware Task Scheduling on Multi-Accelerator based Platforms
International audienceTo fully tap into the potential of heterogeneous machines composed of multicore processors and multiple accelerators, simple offloading approaches in which the main trunk of the application runs on regular cores while only specific parts are offloaded on accelerators are not sufficient. The real challenge is to build systems where the application would permanently spread across the entire machine, that is, where parallel tasks would be dynamically scheduled over the full set of available processing units. To face this challenge, we previously proposed StarPU, a runtime system capable of scheduling tasks over multicore machines equipped with GPU accelerators. StarPU uses a software virtual shared memory (VSM) that provides a high-level programming interface and automates data transfers between processing units so as to enable a dynamic scheduling of tasks. We now present how we have extended StarPU to minimize the cost of transfers between processing units in order to efficiently cope with multi-GPU hardware configurations. To this end, our runtime system implements data prefetching based on asynchronous data transfers, and uses data transfer cost prediction to influence the decisions taken by the task scheduler. We demonstrate the relevance of our approach by benchmarking two parallel numerical algorithms using our runtime system. We obtain significant speedups and high efficiency over multicore machines equipped with multiple accelerators. We also evaluate the behaviour of these applications over clusters featuring multiple GPUs per node, showing how our runtime system can combine with MPI
A Graph-Partition-Based Scheduling Policy for Heterogeneous Architectures
In order to improve system performance efficiently, a number of systems
choose to equip multi-core and many-core processors (such as GPUs). Due to
their discrete memory these heterogeneous architectures comprise a distributed
system within a computer. A data-flow programming model is attractive in this
setting for its ease of expressing concurrency. Programmers only need to define
task dependencies without considering how to schedule them on the hardware.
However, mapping the resulting task graph onto hardware efficiently remains a
challenge. In this paper, we propose a graph-partition scheduling policy for
mapping data-flow workloads to heterogeneous hardware. According to our
experiments, our graph-partition-based scheduling achieves comparable
performance to conventional queue-base approaches.Comment: Presented at DATE Friday Workshop on Heterogeneous Architectures and
Design Methods for Embedded Image Systems (HIS 2015) (arXiv:1502.07241
StarPU-MPI: Task Programming over Clusters of Machines Enhanced with Accelerators
GPUs have largely entered HPC clusters, as shown by the top entries of the latest top500 issue. Exploiting such machines is however very challenging, not only because of combining two separate paradigms, MPI and CUDA or OpenCL, but also because nodes are heterogeneous and thus require careful load balancing within nodes themselves. The current paradigms are usually limited to only offloading parts of the computation and leaving CPUs idle, or they require static work partitioning between CPUs and GPUs. To handle single-node architecture heterogeneity, we have previously proposed StarPU, a runtime system capable of dynamically scheduling tasks in an optimized way on such machines. We show here how the task paradigm of StarPU has been combined with MPI communications, and how we extended the task paradigm itself to allow mapping the task graph on MPI clusters such as to automatically achieve an optimized distributed execution. We show how a sequential-like Cholesky source code can be easily extended into a scalable distributed parallel execution, and already exhibits a speedup of 5 on 6 nodes
Méthode des multipôles rapide à base de tâches pour des clusters de processeurs multicoeurs
Most high-performance, scientific libraries have adopted hybrid parallelization schemes - such as the popular MPI+OpenMP hybridization - to benefit from the capacities of modern distributed-memory machines. While these approaches have shown to achieve high performance, they require a lot of effort to design and maintain sophisticated synchronization/communication strategies. On the other hand, task-based programming paradigms aim at delegating this burden to a runtime system for maximizing productivity. In this article, we assess the potential of task-based fast multipole methods (FMM) on clusters of multicore processors. We propose both a hybrid MPI+task FMM parallelization and a pure task-based parallelization where the MPI communications are implicitly handled by the runtime system. The latter approach yields a very compact code following a sequential task-based programming model. We show that task-based approaches can compete with a hybrid MPI+OpenMP highly optimized code and that furthermore the compact task-based scheme fully matches the performance of the sophisticated, hybrid MPI+task version, ensuring performance while maximizing productivity. We illustrate our discussion with the ScalFMM FMM library and the StarPU runtime system.La plupart des bibliothèques scientifiques très performantes ont adopté des parallélisations hybrides - comme l’approche MPI+OpenMP - pour profiter des capacités des machines modernes à mémoire distribuée. Ces approches permettent d’obtenir de très hautes performances, mais elles nécessitent beaucoup d’efforts pour concevoir et pour maintenir des stratégies de synchronisation/communication sophistiquées. D’un autre côté, les paradigmes de programmation à base de tâches visent à déléguer ce fardeau à un moteur d'exécution pour maximiser la productivité. Dans cet article, nous évaluons le potentiel de la méthode des multipôles rapide (FMM) à base de tâches sur les clusters de processeurs multic\oe{}urs. Nous proposons deux types de parallélisation, une première approche hybride (MPI+Tâche) à base de tâches et d’appels à MPI pour gérer explicitement les communications et la deuxième uniquement à base de tâches où les communications MPI sont implicitement postées par le moteur d'exécution. Cette dernière approche conduit à un code très compact qui suit le modèle de programmation séquentiel à base de tâches. Nous montrons que cette approche rivalise avec le code hybride MPI+OpenMP fortement optimisé et qu'en outre le code compact atteint les performances de la version hybride MPI+Tâche, assurant une très haute performance tout en maximisant la productivité. Nous illustrons notre propos avec la bibliothèque FMM ScalFMM et le moteur d'exécution StarPU
Development of a multi-core and multi-accelerator platform for approximate computing
Proyecto de graduaciĂłn (Licenciatura en IngenierĂa en ElectrĂłnica) Instituto TecnolĂłgico de Costa Rica, Escuela de IngenierĂa ElectrĂłnica, 2017.Changing environment in the current technologies have introduce a gap between the
ever growing needs of users and the state of present designs. As high data and hard
computation applications moved forward in the near future, the current trend reaches
for a greater performance. Approximate computing enters this scheme to boost a system
overall attributes, while working with intrinsic and error tolerable characteristics both in
software and hardware. This work proposes a multicore and multi-accelerator platform
design that uses both exact and approximate versions, also providing interaction with a
software counterpart to ensure usage of both layouts. A set of five di↵erent approximate
accelerator versions and one exact, are present for three di↵erent image processing filters,
Laplace, Sobel and Gauss, along with their respective characterization in terms of Power,
Area and Delay time. This will show better results for design versions 2 and 3. Later
it will be seen three di↵erent interfaces designs for accelerators along with a softcore
processor, Altera’s NIOS II. Results gathered demonstrate a definitively improvement
while using approximate accelerators in comparison with software and exact accelerator
implementations. Memory accessing and filter operations times, for two di↵erent matrices
sizes, present a gain of 500, 2000 and 1500 cycles measure for Laplace, Gauss and Sobel
filters respectively, while contrasting software times, and a range of 28-84, 20-40 and
68-100 ticks decrease against the use of an exact accelerator
Achieving High Performance on Supercomputers with a Sequential Task-based Programming Model
International audienceThe emergence of accelerators as standard computing resources on supercomputers and the subsequent architectural complexity increase revived the need for high-level parallel programming paradigms. Sequential task-based programming model has been shown to efficiently meet this challenge on a single multicore node possibly enhanced with accelerators, which motivated its support in the OpenMP 4.0 standard. In this paper, we show that this paradigm can also be employed to achieve high performance on modern supercomputers composed of multiple such nodes, with extremely limited changes in the user code. To prove this claim, we have extended the StarPU runtime system with an advanced inter-node data management layer that supports this model by posting communications automatically. We illustrate our discussion with the task-based tile Cholesky algorithm that we implemented on top of this new runtime system layer. We show that it enables very high productivity while achieving a performance competitive with both the pure Message Passing Interface (MPI)-based ScaLAPACK Cholesky reference implementation and the DPLASMA Cholesky code, which implements another (non-sequential) task-based programming paradigm
Portability and performance in heterogeneous many core Systems
Dissertação de mestrado em InformáticaCurrent computing systems have a multiplicity of computational resources with different
architectures, such as multi-core CPUs and GPUs. These platforms are known as
heterogeneous many-core systems (HMS) and as computational resources evolve they
are o ering more parallelism, as well as becoming more heterogeneous. Exploring these
devices requires the programmer to be aware of the multiplicity of associated architectures,
computing models and development framework. Portability issues, disjoint
memory address spaces, work distribution and irregular workload patterns are major
examples that need to be tackled in order to e ciently explore the computational
resources of an HMS.
This dissertation goal is to design and evaluate a base architecture that enables
the identi cation and preliminary evaluation of the potential bottlenecks and limitations
of a runtime system that addresses HMS. It proposes a runtime system that
eases the programmer burden of handling all the devices available in a heterogeneous
system. The runtime provides a programming and execution model with a uni ed
address space managed by a data management system. An API is proposed in order
to enable the programmer to express applications and data in an intuitive way. Four
di erent scheduling approaches are evaluated that combine di erent data partitioning
mechanisms with di erent work assignment policies and a performance model is used
to provide some performance insights to the scheduler.
The runtime e ciency was evaluated with three di erent applications - matrix multiplication,
image convolution and n-body Barnes-Hut simulation - running in multicore
CPUs and GPUs.
In terms of productivity the results look promising, however, combining scheduling
and data partitioning revealed some ine ciencies that compromise load balancing and
needs to be revised, as well as the data management system that plays a crucial role in
such systems. Performance model driven decisions were also evaluated which revealed
that the accuracy of a performance model is also a compromising component
Analyse de faisabilité de l'implantation d'un protocole de communication sur processeur multicoeurs
RÉSUMÉ Les travaux de ce mémoire s’inscrivent dans le cadre d’un projet qui fait l’objet d’un parrainage industriel. Les résultats visent à comprendre le comportement d’un système de traitement opérant dans des contextes précis. Nous situons ce projet à l’intersection des principes d’ordonnancements de tâches, des systèmes d’exécution, de la virtualisation de fonctions de réseaux et surtout les contraintes associées à la virtualisation d’une pile de protocole LTE (Long Term Evolution), la norme de téléphonie cellulaire la plus en vue en ce moment. Une revue de littérature est proposée pour expliquer en détail les concepts vus plus haut, afin d’avoir une idée précise de la situation de test. D’abord, une étude des grappes d’unités de traitement temps réel est effectuée dans l’optique de l’implémentation de ce qu’il est convenu d’appeler un Cloud Radio Area Network (C-RAN), qui supporte sur une plateforme infonuagique l’électronique qui effectue le traitement de signal requis pour un point d’accès de téléphonie cellulaire. L’étude développée dans ce mémoire vise à évaluer les différents goulots d’étranglement qui peuvent survenir suite à la réception d’un paquet LTE au sein d’une trame CPRI (Common Public Radio Interface), jusqu’à l’envoi de ce paquet d’un serveur maitre jusqu’aux esclaves. Nous évaluons donc les latences et bandes passantes observées pour les différents protocoles composant la plateforme. Nous caractérisons notamment les communications CPRI des antennes vers le bassin de stations de base virtuelles, une communication de type Quick Path Interconnect (QPI) entre des cœurs de traitement et un réseau logique programmable de type FPGA, une communication dédiée point à point entre le FPGA et une carte NIC (Network Interface Card) pour finir avec l’envoi de trames Ethernet vers les serveurs esclaves. Cette étude nous permet de déduire que la virtualisation d’une pile LTE est viable sur une telle grappe de calcul temps réel.----------ABSTRACT The work performed as part of this Master thesis is done in the context of an industrially sponsored project. The objective is to understand the runtime behavior of a class of systems in specific contexts. We place this project at the intersection of the principles of task scheduling, runtimes, Network Functions Virtualisation (NFVs) and especially with the constraints associated with virtualization of an LTE (Long Term Evolution) stack that is the most prominent cellular telecommunication standard at the moment. A literature review is proposed to explain in detail the concepts discussed above, in order to have a clear idea of the target environment. First, a study of a real time processing cluster is carried out in relation to the implementation of the so-called Cloud Radio Area Network (C-RAN) that supports on a cloud platform all the electronics which performs the signal processing required for a cellular access point. The study developed in this paper is to evaluate the various bottlenecks that can occur following the receipt of an LTE packet within a Common Public Radio Interface (CPRI) frame, then as part of sending the package to a master server before routing it to the slaves. We evaluate the latencies and bandwidths observed for the different protocols used on the platform components. In particular, we characterize the CPRI communications from the antennas to the virtual base stations units, a Quick Path Interconnect (QPI) communication between processing cores and a programmable logic array in the type of a FPGA, a dedicated point to point communication between the FPGA and a NIC (Network Interface Card) to end with the sending Ethernet frames to the slave servers. This study allows us to infer that the virtualization of an LTE stack is viable on a real time computation cluster with the implied architecture.
Then, to be able to validate the effectiveness of different scheduling algorithms, an emulation of a LTE Uplink stack virtualization will be made. Through a runtime called StarPU coupled with profiling tools, we deliver results to assess the need for dedicated thread or cores to manage tasks within a server
Scheduling (ir)regular applications on heterogeneous platforms
Dissertação de mestrado em Engenharia de InformáticaCurrent computational platforms have become continuously more and more heterogeneous and parallel over the last years, as a consequence of incorporating accelerators whose architectures are parallel and different from the CPU. As a result, several frameworks were developed to aid to program these platforms mainly targeting better productivity ratios. In this context, GAMA framework is being developed by the research group involved in this work, targeting
both regular and irregular algorithms to efficiently run in heterogeneous platforms.
Scheduling is a key issue of GAMA-like frameworks. The state of the art solutions of scheduling on heterogeneous platforms are efficient for regular applications but lack adequate mechanisms for irregular ones. The scheduling of irregular applications is particularly complex due to the unpredictability and the differences on the execution time of their composing computational tasks.
This dissertation work comprises the design and validation of a dynamic scheduler’s model and implementation, to simultaneously address regular and irregular algorithms. The devised scheduling mechanism is validated within the GAMA framework, when running relevant scientific algorithms, which include the SAXPY, the Fast Fourier Transform and two n-Body solvers. The proposed mechanism is validated regarding its efficiency in finding good scheduling decisions and the efficiency and scalability of GAMA, when using it.
The results show that the model of the devised dynamic scheduler is capable of working in heterogeneous systems with high efficiency and finding good scheduling decisions in the general tested cases. It achieves not only the scheduling decision that represents the real capacity of the devices in the platform, but also enables GAMA to achieve more than 100% of efficiency as defined in [3], when running a relevant scientific irregular algorithm.
Under the designed scheduling model, GAMA was also able to beat CPU and GPU efficient libraries of SAXPY, an important scientific algorithm. It was also proved GAMA’s scalability under the devised dynamic scheduler, which properly leveraged the platform computational resources, in trials with one central quad-core CPU-chip and two GPU accelerators.As plataformas computacionais actuais tornaram-se cada vez mais heterogéneas e paralelas
nos Ăşltimos anos, como consequĂŞncia de integrarem aceleradores cujas arquitecturas sĂŁo
paralelas e distintas do CPU. Como resultado, várias frameworks foram desenvolvidas para
programar estas plataformas, com o objectivo de aumentar os nĂveis de produtividade de
programação. Neste sentido, a framework GAMA está a ser desenvolvida pelo grupo de
investigação envolvido nesta tese, tendo como objectivo correr eficientemente algoritmos regulares
e irregulares em plataformas heterogéneas.
Um aspecto chave no contexto de frameworks congéneres ao GAMA é o escalonamento.
As soluções que compõem o estado da arte de escalonamento em plataformas heterogéneas são
eficientes para aplicaçóes regulares, mas ineficientes para aplicações irregulares. O escalonamento
destas é particularmente complexo devido à imprevisibilidade e ás diferenças no tempo
de computação das tarefas computacionais que as compõem.
Esta dissertação propõe o design e validação de um modelo de escalonamento e respectiva
implementação, que endereça tanto aplicações regulares como irregulares. O mecanismo de
escalonamento desenvolvido Ă© validado na framework GAMA, executando algoritmos cientĂficos
relevantes, que incluem a SAXPY, a Transformada Rápida de Fourier e dois algoritmos
de resolução do problema n-Corpos. O mecanismo proposto é validado quanto à sua eficiência
em encontrar boas decisões de escalonamento e quanto à eficiência e escalabilidade do
GAMA, quando fazendo uso do mesmo.
Os resultados obtidos mostram que o modelo de escalonamento proposto Ă© capaz de executar
em plataformas heterogéneas com alto grau de eficiência, uma vez que encontra boas
decisões de escalonamento na generalidade dos casos testados. Além de atingir a decisão
de escalonamento que melhor representa o real poder computacional dos dispositivos na
plataforma, também permite ao GAMA atingir mais de 100% de eficiência tal como definida
em [3], executando um importante algoritmo cientĂfico irregular.
Integrando o modelo de escalonamento desenvolvido, o GAMA superou ainda bibliotecas
eficientes para CPU e GPU na execução do SAXPY, um importante algoritmo cientĂfico.
Foi também provada a escalabilidade do GAMA sob o modelo desenvolvido, que aproveitou
da melhor forma os recursos computacionais disponĂveis, em testes para um CPU-chip de 4
nĂşcleos e dois GPUs