156 research outputs found

    Implementation of MPICH on top of MPLi̲te

    Get PDF
    The goal of this thesis is to develop a new Channel Interface device for the MPICH implementation of the MPI (Message Passing Interface) standard using MPLi̲te. MPLi̲te is a lightweight message-passing library that is not a full MPI implementation, but offers high performance. MPICH (Message Passing Interface CHameleon) is a full implementation of the MPI standard that has the p4 library as the underlying communication device for TCP/IP networks. By integrating MPLi̲te as a Channel Interface device in MPICH, a parallel programmer can utilize the full MPI implementation of MPICH as well as the high bandwidth offered by MPLi̲te. There are several layers in the MPICH library where one can tie a new device. The Channel Interface is the lowest layer that requires very few functions to add a new device. By attaching MPLi̲te to MPICH at the lowest level, the Channel Interface, almost all of the performance of the MPLi̲te library can be delivered to the applications using MPICH. MPLi̲te can be implemented either as a blocking or a non-blocking Channel Interface device. The performance was measured on two separate test clusters, the PC and the Alpha mini-clusters, having Gigabit Ethernet connections. The PC cluster has two 1.8 GHz Pentium 4 PCs and the Alpha cluster has two 500 MHz Compaq DS20 workstations. Different network interface cards like Netgear, TrendNet and SysKonnect Gigabit Ethernet cards were used for the measurements. Both the blocking and non-blocking MPICH-MPLi̲te Channel Interface devices perform close to raw TCP, whereas a performance loss of 25-30% is seen in the MPICH-p4 Channel Interface device for larger messages. The superior performance offered by the MPICH-MPLi̲te device compared to the MPICH-p4 device can be easily seen on the SysKonnect cards using jumbo frames. The throughput curve also improves considerably by increasing the Eager/Rendezvous threshold

    Optimizing computation-communication overlap in asynchronous task-based programs

    Get PDF
    Asynchronous task-based programming models are gaining popularity to address the programmability and performance challenges in high performance computing. One of the main attractions of these models and runtimes is their potential to automatically expose and exploit overlap of computation with communication. However, we find that inefficient interactions between these programming models and the underlying messaging layer (in most cases, MPI) limit the achievable computation-communication overlap and negatively impact the performance of parallel programs. We address this challenge by exposing and exploiting information about MPI internals in a task-based runtime system to make better task-creation and scheduling decisions. In particular, we present two mechanisms for exchanging information between MPI and a task-based runtime, and analyze their trade-offs. Further, we present a detailed evaluation of the proposed mechanisms implemented in MPI and a task-based runtime. We show performance improvements of up to 16.3% and 34.5% for proxy applications with point-to-point and collective communication, respectively.Peer ReviewedPostprint (author's final draft

    Implementation of MPICH on Top of MP_Lite

    Get PDF
    The goal of this thesis is to develop a new Channel Interface device for the MPICH Implementation of the MPI (Message Passing Interface) standard using MP{_}Lite. MP{_}Lite is a lightweight message-passing library that is not a full MPI implementation, but offers high performance MPICH (Message Passing Interface CHameleon) is a full implementation of the MPI standard that has the p4 library as the underlying communication device for TCP/IP networks. By integrating MP{_}Lite as a Channel Interface device in MPICH, a parallel programmer can utilize the full MPI implementation of MPICH as well as the high bandwidth offered by MP{_}Lite. There are several layers in the MPICH library where one can tie a new device. The Channel Interface is the lowest layer that requires very few functions to add a new device. By attaching MP{_}Lite to MPICH at the lowest level, the Channel Interface, almost all of the performance of the MP{_}Lite library can be delivered to the applications using MPICH. MP{_}Lite can be implemented either as a blocking or a non-blocking Channel Interface device. The performance was measured on two separate test clusters, the PC and the Alpha miniclusters, having Gigabit Ethernet connections. The PC cluster has two 1.8 GHz Pentium 4 PCs and the Alpha cluster has two 500 MHz Compaq DS20 workstations. Different network interface cards like Netgear, TrendNet and SysKonnect Gigabit Ethernet cards were used for the measurements. Both the blocking and non-blocking MPICH-MP{_}Lite Channel Interface devices perform close to raw TCP, whereas a performance loss of 25-30% is seen in the MPICH-p4 Channel Interface device for larger messages. The superior performance offered by the MPICH-MP{_}Lite device compared to the MPICH-p4 device can be easily seen on the SysKonnect cards using jumbo frames. The throughput curve also improves considerably by increasing the Eager/Rendezvous threshold

    McMPI – a managed-code message passing interface library for high performance communication in C#

    Get PDF
    This work endeavours to achieve technology transfer between established best-practice in academic high-performance computing and current techniques in commercial high-productivity computing. It shows that a credible high-performance message-passing communication library, with semantics and syntax following the Message-Passing Interface (MPI) Standard, can be built in pure C# (one of the .Net suite of computer languages). Message-passing has been the dominant paradigm in high-performance parallel programming of distributed-memory computer architectures for three decades. The MPI Standard originally distilled architecture-independent and language-agnostic ideas from existing specialised communication libraries and has since been enhanced and extended. Object-oriented languages can increase programmer productivity, for example by allowing complexity to be managed through encapsulation. Both the C# computer language and the .Net common language runtime (CLR) were originally developed by Microsoft Corporation but have since been standardised by the European Computer Manufacturers Association (ECMA) and the International Standards Organisation (ISO), which facilitates portability of source-code and compiled binary programs to a variety of operating systems and hardware. Combining these two open and mature technologies enables mainstream programmers to write tightly-coupled parallel programs in a popular standardised object-oriented language that is portable to most modern operating systems and hardware architectures. This work also establishes that a thread-to-thread delivery option increases shared-memory communication performance between MPI ranks on the same node. This suggests that the thread-as-rank threading model should be explicitly specified in future versions of the MPI Standard and then added to existing MPI libraries for use by thread-safe parallel codes. This work also ascertains that the C# socket object suffers from undesirable characteristics that are critical to communication performance and proposes ways of improving the implementation of this object

    An implementation of active objects in Java

    Full text link
    Active objects are a form of multitasking for computer systems. Active objects manipulate their own execution thread instead of using the execution thread of the object that created them. When a method is invoked on an active object, the call returns immediately and the caller continues execution. Thus, active objects can be utilized to develop parallel applications; Active object model can be implemented in a number of different ways: with patterns, external libraries or extending the language. The solution proposed by this thesis is to implement active objects by extending the Java language with new keywords. We have modified Sun\u27s open-source Java Compiler to accept the added keywords and to translate them into regular Java syntax; An \u27active\u27 modifier was introduced to mark active objects; active objects can be created on remote machines and communication with them is done using RMI; active object\u27s methods can be executed asynchronously and the results obtained later using a new \u27waitfor\u27 statement

    Formal specification of MPI 2.0: case study in specifying a practical concurrent programming API

    Get PDF
    technical reportWe describe the rst formal speci cation of a non-trivial subset of MPI, the dominant communication API in high performance computing. Engineering a formal speci cation for a non-trivial concurrency API requires the right combination of rigor, executability, and traceability, while also serving as a smooth elaboration of a pre-existing informal speci - cation. It also requires the modularization of reusable speci cation components to keep the length of the speci cation in check. Long-lived APIs such as MPI are not usually `textbook minimalistic' because they support a diverse array of applications, a diverse community of users, and have ef cient implementations over decades of computing hardware. We choose the TLA+ notation to write our speci cations, and describe how we organized the speci - cation of 150 of the 300 MPI 2.0 functions. We detail a handful of these functions in this paper, and assess our speci cation with respect to the aforesaid requirements. We close with a description of possible approaches that may help render the act of writing, understanding, and validating speci cations much more productive

    Making Applications Faster by Asynchronous Execution: Slowing Down Processes or Relaxing MPI Collectives

    Full text link
    Comprehending the performance bottlenecks at the core of the intricate hardware-software interactions exhibited by highly parallel programs on HPC clusters is crucial. This paper sheds light on the issue of automatically asynchronous MPI communication in memory-bound parallel programs on multicore clusters and how it can be facilitated. For instance, slowing down MPI processes by deliberate injection of delays can improve performance if certain conditions are met. This leads to the counter-intuitive conclusion that noise, independent of its source, is not always detrimental but can be leveraged for performance improvements. We employ phase-space graphs as a new tool to visualize parallel program dynamics. They are useful in spotting certain patterns in parallel execution that will easily go unnoticed with traditional tracing tools. We investigate five different microbenchmarks and applications on different supercomputer platforms: an MPI-augmented STREAM Triad, two implementations of Lattice-Boltzmann fluid solvers, and the LULESH and HPCG proxy applications.Comment: 18 pages, 14 figures, 7 tables. Corrected Fig. 4 layou
    corecore