34 research outputs found

    A multicore-enabled multirail communication engine

    Get PDF
    International audienceThe current trend in clusters architecture leads toward a massive use of multicore chips. This hardware evolution raises bottleneck issues at the network interface level. The use of multiple parallel networks allows to overcome this problem as it provides an higher aggregate bandwidth. But this bandwidth remains theoretical as only a few communication libraries are able to exploit multiple networks. In this paper, we present an optimization strategy for the NewMadeleine communication library. This strategy is able to efficiently exploit parallel interconnect links. By sampling each network's capabilities, it is possible to estimate a transfer duration a priori. Splitting messages and sending chunks of messages over parallel links can thus be performed efficiently to reach the theoretical aggregate bandwidth. NewMadeleine is multithreaded and exploits multicore chips to send small packets, that involve CPU-consuming copies, in parallel

    NewMadeleine: An Efficient Support for High-Performance Networks in MPICH2

    Get PDF
    International audienceThis paper describes how the NewMadeleine communication library has been integrated within the MPICH2 MPI implementation and the benefits brought. NewMadeleine is integrated as a Nemesis network module but the upper layers and in particular the CH3 layer has been modified. By doing so, we allow NewMadeleine to fully deliver its performance to an MPI application. NewMadeleine features sophisticated strategies for sending messages and natively supports multirail network configurations, even heterogeneous ones. It also uses a software element called PIOMan that uses multithreading in order to enhance reactivity and create more efficient progress engines. We show various results that prove that NewMadeleine is indeed well suited as a low-level communication library for building MPI implementations

    A scalable and generic task scheduling system for communication libraries

    Get PDF
    International audienceSince the advent of multi-core processors, the physionomy of typical clusters has dramatically evolved. This new massively multi-core era is a major change in architecture, causing the evolution of programming models towards hybrid MPI+threads, therefore requiring new features at low-level. Modern communication subsystems now have to deal with multi-threading: the impact of thread-safety, the contention on network interfaces or the consequence of data locality on performance have to be studied carefully. In this paper, we present PIOMan, a scalable and generic lightweight task scheduling system for communication libraries. It is designed to ensure concurrent progression of multiple tasks of a communication library (polling, offload, multi-rail) through the use of multiple cores, while preserving locality to avoid contention and allow a scalability to a large number of cores and threads. We have implemented the model, evaluated its performance, and compared it to state of the art solutions regarding overhead, scalability, and communication and computation overlap

    Adaptive MPI Multirail Tuning for Non-Uniform Input/Output Access

    Get PDF
    International audienceMulticore processors have not only reintroduced Non-Uniform Memory Access (NUMA) architectures in nowadays parallel computers, but they are also responsible for non-uniform access times with respect to Input/Output devices (NUIOA). In clusters of multicore machines equipped with several Network Interfaces, performance of communication between processes thus depends on which cores these processes are scheduled on, and on their distance to the Network Interface Cards involved. We propose a technique allowing multirail communication between processes to carefully distribute data among the network interfaces so as to counterbalance NUIOA effects. We demonstrate the relevance of our approach by evaluating its implementation within OpenMPI on a Myri-10G + InfiniBand cluster

    Impacts des effets NUMA sur les communications haute performance dans les grappes de calcul

    Get PDF
    La multiplication des processeurs et des coeurs dans les machines a conduit les architectes à renoncer aux bus mémoire centralisés. Les effets NUMA (Non-Uniform Memory Access), surtout connus pour leur impact sur l'efficacité de l'ordonnancement de processus, révèlent également une influence sensible sur les performances des entrées-sorties. Dans cet article, nous présentons une évaluation de leur incidence sur les performances réseau dans les grappes de calcul, en exhibant leur ampleur et leur aspect parfois asymétrique sur le débit. Nous proposons une solution de placement automatique et portable des tâches de communication dans la bibliothèque NewMadeleine qui permet d'obtenir des performances analogues à celles d'un placement manuel, en utilisant des informations de topologie collectées auprès du système

    Hot-Spot Avoidance With Multi-Pathing Over Infiniband: An MPI Perspective

    Get PDF
    Large scale InfiniBand clusters are becoming increasingly popular, as reflected by the TOP 500 Supercomputer rankings. At the same time, fat tree has become a popular interconnection topology for these clusters, since it allows multiple paths to be available in between a pair of nodes. However, even with fat tree, hot-spots may occur in the network depending upon the route configuration between end nodes and communication pattern(s) in the application. To make matters worse, the deterministic routing nature of InfiniBand limits the application from effective use of multiple paths transparently and avoid the hot-spots in the network. Simulation based studies for switches and adapters to implement congestion control have been proposed in the literature. However, these studies have focused on providing congestion control for the communication path, and not on utilizing multiple paths in the network for hot-spot avoidance. In this paper, we design an MPI functionality, which provides hot-spot avoidance for different communications, without a priori knowledge of the pattern. We leverage LMC (LID Mask Count) mechanism of InfiniBand to create multiple paths in the network and present the design issues (scheduling policies, selecting number of paths, scalability aspects) of our design. We implement our design and evaluate it with Pallas collective communication and MPI applications. On an InfiniBand cluster with 48 processes, collective operations like MPI All-to-all Personalized and MPI Reduce Scatter show an improvement of 27% and 19% respectively. Our evaluation with MPI applications like NAS Parallel Benchmarks and PSTSWM on 64 processes shows significant improvement in execution time with this functionality

    A sampling-based approach for communication libraries auto-tuning

    Get PDF
    International audienceCommunication performance is a critical issue in HPC applications, and many solutions have been proposed on the literature (algorithmic, protocols, etc.) In the meantime, computing nodes become massively multicore, leading to a real imbalance between the number of communication sources and the number of physical communication resources. Thus it is now mandatory to share network boards between computation flows, and to take this sharing into account while performing communication optimizations. In previous papers, we have proposed a model and a framework for on-the-fly optimizations of multiplexed concurrent communication flows, and implemented this model in the \nm communication library. This library features optimization strategies able for example to aggregate several messages to reduce the number of packets emitted on the network, or to split messages to use several NICs at the same time. In this paper, we study the tuning of these dynamic optimization strategies. We show that some parameters and thresholds (\rdv threshold, aggregation packet size) depend on the actual hardware, both host and NICs. We propose and implement a method based on sampling of the actual hardware to auto-tune our strategies. Moreover, we show that multi-rail can greatly benefit from performance predictions. We propose an approach for multi-rail that dynamically balance the data between NICs using predictions based on sampling

    A Generic and High Performance Approach for Fault Tolerance in Communication Library

    Get PDF
    With the increase of the number of nodes in clusters, the probability of failures increases. In this paper, we study the failures in the network stack for high performance networks. We present the design of several fault-tolerance mechanisms for communication libraries to detect failures and to ensure message integrity. We have implemented these mechanisms in the N EW M ADELEINE communication library with a quick detection of failures in a portable way, and with fallback to available links when an error occurs. Our mechanisms ensure the integrity of messages without lowering too much the networking performance. Our evaluation show that ensuring fault-tolerance does not impact significantly the performance of most applications

    High-Performance Multi-Rail Support with the NewMadeleine Communication Library

    Get PDF
    International audienceThis paper focuses on message transfers across multiple heterogeneous high-performance networks in the NewMadeleine Communication Library. NewMadeleine features a modular design that allows the user to easily implement load-balancing strategies efficiently exploiting the underlying network but without being aware of the low-level interface. Several strategies are studied and preliminary results are given. They show that performance of network transfers can be improved by using carefully designed strategies that take into account NIC activity
    corecore