8 research outputs found

    Improving Reactivity and Communication Overlap in MPI using a Generic I/O Manager

    Get PDF
    International audienceMPI applications may waste thousands of CPU cycles if they do not efficiently overlap communications and computation. In this paper, we present a generic and portable I/O manager that is able to make communication progress asynchronously using tasklets. It chooses automatically the most appropriate communication method, depending on the context: multi-threaded application or not, SMP machine or not. We have implemented and evaluated our I/O manager with Mad-MPI, our own MPI implementation, and compared it to other existing MPI implementations regarding the ability to efficiently overlap communication and computation

    An analysis of the impact of multi-threading on communication performance

    Get PDF
    International audienceAlthough processors become massively multicore and therefore new programming models mix message passing and multi-threading, the effects of threads on communication libraries remain neglected. Designing an efficient modern communication library requires precautions in order to limit the impact of thread-safety mechanisms on performance. In this paper, we present various approaches to building a thread-safe communication library and we study their benefit and impact on performance. We also describe and evaluate techniques used to exploit idle cores to balance the communication library load across multicore machines

    A multicore-enabled multirail communication engine

    Get PDF
    International audienceThe current trend in clusters architecture leads toward a massive use of multicore chips. This hardware evolution raises bottleneck issues at the network interface level. The use of multiple parallel networks allows to overcome this problem as it provides an higher aggregate bandwidth. But this bandwidth remains theoretical as only a few communication libraries are able to exploit multiple networks. In this paper, we present an optimization strategy for the NewMadeleine communication library. This strategy is able to efficiently exploit parallel interconnect links. By sampling each network's capabilities, it is possible to estimate a transfer duration a priori. Splitting messages and sending chunks of messages over parallel links can thus be performed efficiently to reach the theoretical aggregate bandwidth. NewMadeleine is multithreaded and exploits multicore chips to send small packets, that involve CPU-consuming copies, in parallel

    Bibliothèque de communication multi-threadée pour architectures multi-coeurs

    Get PDF
    National audienceL'architecture des grappes de calcul a énormément évolué depuis quelques années. Alors qu'il y a peu la plupart des noeuds ne comportaient que quelques coeurs de calcul, les machines équipées de dizaines de c{\oe}urs deviennent monnaie courante. Cette évolution du matériel s'est accompagnée d'un changement des modèles de programmation : les approches purement MPI laissent la place à des modèles mélangeant passage de messages et multi-threading. Lors de la conception de bibliothèques de communications modernes, il faut donc prendre en compte les accès concurrents et les problèmes de scalabilité liés aux processeurs multi-coeurs. Cet article présente différentes approches pour concevoir une bibliothèque de communication adaptée aux architectures actuelles. Nous étudions l'impact sur les performances de ces méthodes et plusieurs techniques permettant d'exploiter les coeurs inutilisés sont détaillées. Les évaluations montrent que de tels mécanismes permettent de répartir la charge due aux traitements des réseaux et de recouvrir les communications par du calcul

    A multithreaded communication engine for multicore architectures

    Get PDF
    International audienceThe current trend in clusters leads towards an increase of the number of cores per node. As a result, an increasing number of parallel applications is mixing message passing and multithreading as an attempt to better match the underlying architecture's structure. This naturally raises the problem of designing efficient, multithreaded implementations of MPI. In this paper, we present the design of a multithreaded communication engine able to exploit idle cores to speed up communications in two ways: it can move CPU-intensive operations out of the critical path (e.g. PIO transfers offload), and is able to let rendezvous transfers progress asynchronously. We have implemented these methods in the PM2 software suite, evaluated their behavior in typical cases, and we have observed good performance results in overlapping communication and computation

    A scalable and generic task scheduling system for communication libraries

    Get PDF
    International audienceSince the advent of multi-core processors, the physionomy of typical clusters has dramatically evolved. This new massively multi-core era is a major change in architecture, causing the evolution of programming models towards hybrid MPI+threads, therefore requiring new features at low-level. Modern communication subsystems now have to deal with multi-threading: the impact of thread-safety, the contention on network interfaces or the consequence of data locality on performance have to be studied carefully. In this paper, we present PIOMan, a scalable and generic lightweight task scheduling system for communication libraries. It is designed to ensure concurrent progression of multiple tasks of a communication library (polling, offload, multi-rail) through the use of multiple cores, while preserving locality to avoid contention and allow a scalability to a large number of cores and threads. We have implemented the model, evaluated its performance, and compared it to state of the art solutions regarding overhead, scalability, and communication and computation overlap

    A NUMA Aware Scheduler for a Parallel Sparse Direct Solver

    Get PDF
    Over the past few years, parallel sparse direct solvers made significant progress and are now able to solve efficiently industrial three-dimensional problems with several millions of unknowns. To solve efficiently these problems, PaStiX and WSMP solvers for example, provide an hybrid MPI-thread implementation well suited for SMP nodes or multi-core architectures. It enables to drastically reduce the memory overhead of the factorization and improve the scalability of the algorithms. However, today's modern architectures introduce new hierarchical memory accesses that are not handle in these solvers. We present in this paper three improvements on PaStiX solver to improve the performance on modern architectures : memory allocation, communication overlap and dynamic scheduling and some results on numerical test cases will be presented to prove the efficiency of the approach on NUMA architectures

    NewMadeleine: An Efficient Support for High-Performance Networks in MPICH2

    Get PDF
    International audienceThis paper describes how the NewMadeleine communication library has been integrated within the MPICH2 MPI implementation and the benefits brought. NewMadeleine is integrated as a Nemesis network module but the upper layers and in particular the CH3 layer has been modified. By doing so, we allow NewMadeleine to fully deliver its performance to an MPI application. NewMadeleine features sophisticated strategies for sending messages and natively supports multirail network configurations, even heterogeneous ones. It also uses a software element called PIOMan that uses multithreading in order to enhance reactivity and create more efficient progress engines. We show various results that prove that NewMadeleine is indeed well suited as a low-level communication library for building MPI implementations
    corecore