5 research outputs found

    Kernel-assisted and Topology-aware MPI Collective Communication among Multicore or Many-core Clusters

    Get PDF
    Multicore or many-core clusters have become the most prominent form of High Performance Computing (HPC) systems. Hardware complexity and hierarchies not only exist in the inter-node layer, i.e., hierarchical networks, but also exist in internals of multicore compute nodes, e.g., Non Uniform Memory Accesses (NUMA), network-style interconnect, and memory and shared cache hierarchies. Message Passing Interface (MPI), the most widely adopted in the HPC communities, suffers from decreased performance and portability due to increased hardware complexity of multiple levels. We identified three critical issues specific to collective communication: The first problem arises from the gap between logical collective topologies and underlying hardware topologies; Second, current MPI communications lack efficient shared memory message delivering approaches; Last, on distributed memory machines, like multicore clusters, a single approach cannot encompass the extreme variations not only in the bandwidth and latency capabilities, but also in features such as the aptitude to operate multiple concurrent copies simultaneously. To bridge the gap between logical collective topologies and hardware topologies, we developed a distance-aware framework to integrate the knowledge of hardware distance into collective algorithms in order to dynamically reshape the communication patterns to suit the hardware capabilities. Based on process distance information, we used graph partitioning techniques to organize the MPI processes in a multi-level hierarchy, mapping on the hardware characteristics. Meanwhile, we took advantage of the kernel-assisted one-sided single-copy approach (KNEM) as the default shared memory delivering method. Via kernel-assisted memory copy, the collective algorithms offload copy tasks onto non-leader/not-root processes to evenly distribute copy workloads among available cores. Finally, on distributed memory machines, we developed a technique to compose multi-layered collective algorithms together to express a multi-level algorithm with tight interoperability between the levels. This tight collaboration results in more overlaps between inter- and intra-node communication. Experimental results have confirmed that, by leveraging several technologies together, such as kernel-assisted memory copy, the distance-aware framework, and collective algorithm composition, not only do MPI collectives reach the potential maximum performance on a wide variation of platforms, but they also deliver a level of performance immune to modifications of the underlying process-core binding

    High performance RDMA-based multi-port all-gather on multi-rail QsNet II

    No full text
    Scientific applications written in MPI use collective communications intensively. Efficient and scalable implementation of such collective operations is therefore crucial to the performance of MPI applications running on clusters. Quadrics QsNet II is a high-performance network that implements some collectives at its Elan user-level library. Its MPI implementation uses such primitives directly. Quadrics communication software supports pointto-point message striping over multi-rail QsNet II networks. However, multi-rail collectives, other than broadcast, are not supported. In this work, we propose, design and implement a number of RDMAbased multi-port algorithms for the all-gather operation over multi-rail QsNet II clusters directly at the Elan level. Our performance results indicate that the proposed multi-port all-gather Direct algorithm gains an improvement of up to 1.49 for 32KB messages over the native elan_gather() collective
    corecore