2 research outputs found

    Scalable, High-Performance NIC-Based All-to-All Broadcast over Myrinet/GM

    No full text
    All-to-all broadcast is one of the common collective operations that involve dense communication between all processes in a parallel program. Previously, programmable Network Interface Cards (NICs) have been leveraged to efficiently support collective operations, including barrier, broadcast, and reduce. This paper explores the characteristics of all-to-all broadcast and proposes new algorithms to exploit the potential advantages of NIC programmablity. Along with these algorithms, salient strategies have been used to provide scalable topology management, global buffer management, efficient communication processing, and message reliability. The algorithms have been incorporated into a NIC-based collective protocol over Myrinet/GM. The NIC-based all-to-all broadcast operations improve allto-all broadcast bandwidth over 16 nodes by a factor of 3, compared to host-based all-to-all broadcast operation. Furthermore, the NIC-based operations have been demonstrated to achieve better scalability to large systems and very low host CPU utilization. 1

    Kernel-assisted and Topology-aware MPI Collective Communication among Multicore or Many-core Clusters

    Get PDF
    Multicore or many-core clusters have become the most prominent form of High Performance Computing (HPC) systems. Hardware complexity and hierarchies not only exist in the inter-node layer, i.e., hierarchical networks, but also exist in internals of multicore compute nodes, e.g., Non Uniform Memory Accesses (NUMA), network-style interconnect, and memory and shared cache hierarchies. Message Passing Interface (MPI), the most widely adopted in the HPC communities, suffers from decreased performance and portability due to increased hardware complexity of multiple levels. We identified three critical issues specific to collective communication: The first problem arises from the gap between logical collective topologies and underlying hardware topologies; Second, current MPI communications lack efficient shared memory message delivering approaches; Last, on distributed memory machines, like multicore clusters, a single approach cannot encompass the extreme variations not only in the bandwidth and latency capabilities, but also in features such as the aptitude to operate multiple concurrent copies simultaneously. To bridge the gap between logical collective topologies and hardware topologies, we developed a distance-aware framework to integrate the knowledge of hardware distance into collective algorithms in order to dynamically reshape the communication patterns to suit the hardware capabilities. Based on process distance information, we used graph partitioning techniques to organize the MPI processes in a multi-level hierarchy, mapping on the hardware characteristics. Meanwhile, we took advantage of the kernel-assisted one-sided single-copy approach (KNEM) as the default shared memory delivering method. Via kernel-assisted memory copy, the collective algorithms offload copy tasks onto non-leader/not-root processes to evenly distribute copy workloads among available cores. Finally, on distributed memory machines, we developed a technique to compose multi-layered collective algorithms together to express a multi-level algorithm with tight interoperability between the levels. This tight collaboration results in more overlaps between inter- and intra-node communication. Experimental results have confirmed that, by leveraging several technologies together, such as kernel-assisted memory copy, the distance-aware framework, and collective algorithm composition, not only do MPI collectives reach the potential maximum performance on a wide variation of platforms, but they also deliver a level of performance immune to modifications of the underlying process-core binding
    corecore