2 research outputs found

    NIC-based Reduction Algorithms for Large-scale Clusters

    Get PDF
    Abstract — Efficient algorithms for reduction operations across a group of processes are crucial for good performance in many large-scale, parallel scientific applications. While previous algorithms limit processing to the host CPU, we utilize the programmable processors and local memory available on modern cluster network interface cards (NICs) to explore a new dimension in the design of reduction algorithms. In this paper, we present the benefits and challenges, design issues and solutions, analytical models, and experimental evaluations of a family of NIC-based reduction algorithms. Performance and scalability evaluations were conducted on the ASCI Linux Cluster (ALC), a 960-node, 1920-processor machine at Lawrence Livermore National Laboratory, which uses the Quadrics QsNet interconnect. We find NIC-based reductions on modern interconnects to be more efficient than host-based implementations in both scalability and consistency. In particular, at large-scale—1812 processes— NIC-based reductions of small integer and floating-point arrays provided respective speedups of 121 % and 39% over the host-based, production-level MPI implementation. In addition, the standard deviations in timings for the NICbased reductions were as much as two orders of magnitude smaller than for the host-based reductions
    corecore