44 research outputs found

    Design and evaluation of Nemesis, a scalable, low-latency, message-passing communication subsystem

    Full text link

    Design and evaluation of Nemesis, a scalable, low-latency, message-passing communication subsystem.

    Full text link

    NewMadeleine: An Efficient Support for High-Performance Networks in MPICH2

    Get PDF
    International audienceThis paper describes how the NewMadeleine communication library has been integrated within the MPICH2 MPI implementation and the benefits brought. NewMadeleine is integrated as a Nemesis network module but the upper layers and in particular the CH3 layer has been modified. By doing so, we allow NewMadeleine to fully deliver its performance to an MPI application. NewMadeleine features sophisticated strategies for sending messages and natively supports multirail network configurations, even heterogeneous ones. It also uses a software element called PIOMan that uses multithreading in order to enhance reactivity and create more efficient progress engines. We show various results that prove that NewMadeleine is indeed well suited as a low-level communication library for building MPI implementations

    High Throughput Intra-Node MPI Communication with Open-MX

    Get PDF
    International audienceThe increasing number of cores per node in high-performance computing requires an efficient intra-node MPI communication subsystem. Most existing MPI implementations rely on two copies across a shared memory-mapped file. Open-MX offers a single-copy mechanism that is tightly integrated in its regular communication stack, making it transparently available to the MX backend of many MPI layers. We describe this implementation and its offloaded copy backend using I/OAT hardware. Memory pinning requirements are then discussed, and overlapped pinning is introduced to enable the start of Open-MX intra-node data transfer earlier. Performance evaluation shows that this local communication stack performs better than MPICH2 and Open~MPI for large messages, reaching up to 70\,\% better throughput in micro-benchmarks when using I/OAT copy offload. Thanks to a single-copy being involved, the Open-MX intra-node communication throughput also does not heavily depend on cache sharing between processing cores, making these performance improvements easier to observe in real applications

    Evaluation of messaging middleware for high-performance cloud computing

    Get PDF
    This is a post-peer-review, pre-copyedit version of an article published in Personal and Ubiquitous Computing. The final authenticated version is available online at: http://dx.doi.org/10.1007/s00779-012-0605-3[Abstract] Cloud computing is posing several challenges, such as security, fault tolerance, access interface singularity, and network constraints, both in terms of latency and bandwidth. In this scenario, the performance of communications depends both on the network fabric and its efficient support in virtualized environments, which ultimately determines the overall system performance. To solve the current network constraints in cloud services, their providers are deploying high-speed networks, such as 10 Gigabit Ethernet. This paper presents an evaluation of high-performance computing message-passing middleware on a cloud computing infrastructure, Amazon EC2 cluster compute instances, equipped with 10 Gigabit Ethernet. The analysis of the experimental results, confronted with a similar testbed, has shown the significant impact that virtualized environments still have on communication performance, which demands more efficient communication middleware support to get over the current cloud network limitations.Ministerio de Ciencia e Innovación; TIN2010-16735Ministerio de Educación y Ciencia; AP2010-434

    Design of Scalable Java Communication Middleware for Multi-Core Systems

    Get PDF
    This is a post-peer-review, pre-copyedit version of an article published in The Computer Journal. The final authenticated version is available online at: https://doi.org/10.1093/comjnl/bxs122[Abstract] This paper presents smdev, a shared memory communication middleware for multi-core systems. smdev provides a simple and powerful messaging application program interface that is able to exploit the underlying multi-core architecture replacing inter-process and network-based communications by threads and shared memory transfers. The performance evaluation of smdev on several multi-core systems has shown noticeable improvements compared with other Java shared memory solutions, reaching and even overcoming the performance of natively compiled libraries. Thus, smdev has obtained start-up latencies around 0.76 μs and almost 90 Gbps bandwidth for point-to-point communications, as well as high performance and scalability both for collective operations and representative messaging kernels. This fact has motivated the integration of smdev in F-MPJ, our message-passing implementation in Java.Ministerio de Ciencia e Innovación; TIN2010-1673

    Efficient Intranode Communication in GPU-Accelerated Systems

    Full text link
    Abstract—Accelerator awareness has become a pressing issue in data movement models, such as MPI, because of the rapid deployment of systems that utilize accelerators. In our previous work, we developed techniques to enhance MPI with accelerator awareness, thus allowing applications to easily and efficiently communicate data between accelerator memories. In this paper, we extend this work with techniques to perform efficient data movement between accelerators within the same node using a DMA-assisted, peer-to-peer intranode communication technique that was recently introduced for NVIDIA GPUs. We present a detailed design of our new approach to intranode communication and evaluate its improvement to communication and application performance using micro-kernel benchmarks and a 2D stencil application kernel. I

    Cache-Efficient, Intranode, Large-Message MPI Communication with MPICH2-Nemesis

    Get PDF
    International audienceThe emergence of multicore processors raises the need to efficiently transfer large amounts of data between local processes. MPICH2 is a highly portable MPI implementation whose large-message communication schemes suffer from high CPU utilization and cache pollution because of the use of a double-buffering strategy, common to many MPI implementations. We introduce two strategies offering a kernel-assisted, single-copy model with support for noncontiguous and asynchronous transfers. The first one uses the now widely available vmsplice Linux system call; the second one further improves performance thanks to a custom kernel module called KNEM. The latter also offers I/OAT copy offload, which is dynamically enabled depending on both hardware cache characteristics and message size. These new solutions outperform the standard transfer method in the MPICH2 implementation when no cache is shared between the processing cores or when very large messages are being transferred. Collective communication operations show a dramatic improvement, and the IS NAS parallel benchmark shows a 25% speedup and better cache efficiency

    Efficient shared memory message passing for inter-VM communications

    Get PDF
    Thanks to recent advances in virtualization technologies, it is now possible to benefit from the flexibility brought by virtual machines at little cost in terms of CPU performance. However on HPC clusters some overheads remain which prevent widespread usage of virtualization. In this article, we tackle the issue of inter-VM MPI communications when VMs are located on the same physical machine. To achieve this we introduce a virtual device which provides a simple message passing API to the guest OS. This interface can then be used to implement an efficient MPI library for virtual machines. The use of a virtual device makes our solution easily portable across multiple guest operating systems since it only requires a small driver to be written for this device. We present an implementation based on Linux, the KVM hypervisor and Qemu as its userspace device emulator. Our implementation achieves near native performance in terms of MPI latency and bandwidth

    KNEM: a Generic and Scalable Kernel-Assisted Intra-node MPI Communication Framework

    Get PDF
    International audienceThe multiplication of cores in today's architectures raises the importance of intra-node communication in modern clusters and their impact on the overall parallel application performance. Although several proposals focused on this issue in the past, there is still a need for a portable and hardware-independent solution that addresses the requirements of both point-to-point and collective MPI operations inside shared-memory computing nodes. This paper presents the KNEM module for the Linux kernel that provides MPI implementations with a flexible and scalable interface for performing kernel-assisted single-copy data transfers between local processes. It enables high-performance communication within most existing MPI implementations and brings significant application performance improvements thanks to more efficient point-to-point and collective operations
    corecore