10 research outputs found

    Host-Assisted Zero-Copy Remote Memory Access Communication on InfiniBand

    No full text
    The remote memory access (RMA) is becoming an increasingly important communication model due to its excellent potential for overlapping communication and computations and achieving high performance on modern networks with RDMA hardware such as Infiniband. RMA plays a vital role in supporting the emerging global address space languages and management of advanced distributed data structures. This paper describes how remote memory access communication (RMA) can be implemented efficiently over InfiniBand based on the 'zero-copy' approach. The capabilities not offered directly by the Infiniband verb layer can be implemented efficiently using the novel host-assisted approach while achieving zerocopy communication and supporting a high degree of overlapping computations and communication. For contiguous case we are able to achieve a small message latency of 7.44s and a peak bandwidth of 730 MB/s for 'put' and a small message latency of 15s and a peak bandwidth of 689 MegaBytes for 'get'. These numbers are almost as good as the performance of the native VAPI layer. For the noncontiguous case, with our host assisted approach, we can support close to the peak bandwidth that was achieved for the contiguous data. We also demonstrate the superior tolerance of host-assisted datatransfer operations to CPU intensive tasks due to minimum host involvement in our approach as compared to the traditional host-based approach. Our implementation also supports a very high degree of overlap of computation and communication. 99% overlap for contiguous and up to 95% for non contiguous in case of large message sizes were achieved. Finally, the NAS MG and parallel matrix multiplication benchmarks were used to validate effectiveness of our approach, and demonstrated excellent overall performance

    OSU-CISRC-4/07-TR28 High Performance MPI over iWARP: Early Experiences ∗

    No full text
    Modern interconnects and corresponding high performance MPIs have been feeding the surge in the popularity of compute clusters and computing applications. Recently with the introduction of the iWARP (Internet Wide Area RDMA Protocol) standard, RDMA and zero-copy data transfer capabilities have been introduced and standardized for Ethernet networks. While traditional Ethernet networks had largely been limited to the traditional kernel based TCP/IP stacks and hence their limitations, iWARP capabilities of the newer GigE and 10 GigE adapters have broken this barrier and thereby exposing the available potential performance. In order to enable applications to harness the performance benefits of iWARP and to study the quantitative extent of such improvements, we present MPIiWARP, a high performance MPI implementation over the Open Fabrics verbs. Our preliminary results with Chelsio T3B adapters show an improvement of up to 37 % in bandwidth, 75 % in latency and 80 % in MPI allreduce as compared to MPICH2 over TCP/IP. To the best of our knowledge, this is the first design, implementation and evaluation of a high performance MPI over the iWARP standard. 1

    High Performance MPI over iWARP: Early Experiences ∗

    No full text
    Modern interconnects and corresponding high performance MPIs have been feeding the surge in the popularity of compute clusters and computing applications. Recently with the introduction of the iWARP (Internet Wide Area RDMA Protocol) standard, RDMA and zero-copy data transfer capabilities have been introduced and standardized for Ethernet networks. While traditional Ethernet networks had largely been limited to the traditional kernel based TCP/IP stacks and hence their limitations, iWARP capabilities of the newer GigE and 10 GigE adapters have broken this barrier and thereby exposing the available potential performance. In order to enable applications to harness the performance benefits of iWARP and to study the quantitative extent of such improvements, we present MPIiWARP, a high performance MPI implementation over the Open Fabrics verbs. Our preliminary results with Chelsio T3B adapters show an improvement of up to 37 % in bandwidth, 75 % in latency and 80 % in MPI allreduce as compared to MPICH2 over TCP/IP. To the best of our knowledge, this is the first design, implementation and evaluation of a high performance MPI over the iWARP standard. 1

    Selection of Microhabitat by the Red-Backed Vole, \u3ci\u3eClethrionomys gapperi\u3c/i\u3e

    No full text
    Clethrionomys gapperi were captured in microhabitats with greater densities of overall cover than at noncapture or random sites within the study area. Variables describing cover density and distance from free water were selected in a discriminant function analysis to differentiate between vole capture and noncapture sites. Vole capture sites had greater amounts of cover within 4 dm above ground surface and were further from standing water than noncapture sites. The preferential use by C. gapperi of microhabitats with greater densities of cover is in agreement with laboratory and field assessments of habitat use reported in the literature

    Compiler Optimizations for Non-contiguous Remote Data Movement

    No full text
    Abstract. Remote Memory Access (RMA) programming is one of the core concepts behind modern parallel programming languages such as UPC and Fortran 2008 or high-performance libraries such as MPI-3 One Sided or SHMEM. Many applications have to communicate non-contiguous data due to their data layout in main memory. Previous stud-ies showed that such non-contiguous transfers can reduce communication performance by up to an order of magnitude. In this work, we demon-strate a simple scheme for statically optimizing non-contiguous RMA transfers by combining partial packing, communication overlap, and re-mote access pipelining. We determine accurate performance models for the various operations to find near-optimal pipeline parameters. The pro-posed approach is applicable to all RMA languages and does not depend on the availability of special hardware features such as scatter-gather lists or strided copies. We show that our proposed superpipelining leads to significant improvements compared to either full packing or sending each contiguous segment individually. We outline how our approach can be used to optimize non-contiguous data transfers in PGAS programs automatically. We observed a 37 % performance gain over the fastest of either packing or individual sending for a realistic application.
    corecore