1,594 research outputs found
Efficient Message Passing Interface (MPI) for Parallel Computing on Clusters of Workstations
Parallel computing on clusters of workstations and personal computers has very high
potential, since it leverages existing hardware and software. Parallel programming
environments offer the user a convenient way to express parallel computation and communication.
In fact, recently, a Message Passing Interface (MPI) has been proposed as an industrial
standard for writing "portable" message-passing parallel programs. The communication
part of MPI consists of the usual point-to-point communication as well as collective
communication. However, existing implementations of programming environments for clusters
are built on top of a point-to-point communication layer (send and receive) over local
area networks (LANs) and, as a result, suffer from poor performance in the collective
communication part.
In this paper, we present an efficient design and implementation of the collective
communication part in MPI that is optimized for clusters of workstations. Our system consists
of two main components: the MPI-CCL layer that includes the collective communication
functionality of MPI and a User-level Reliable Transport Protocol (URTP) that interfaces
with the LAN Data-link layer and leverages the fact that the LAN is a broadcast medium.
Our system is integrated with the operating system via an efficient kernel extension
mechanism that we developed. The kernel extension significantly improves the performance of
our implementation as it can handle part of the communication overhead without involving
user space.
We have implemented our system on a collection of IBM RS/6000 workstations con-
nected via a lOMbit Ethernet LAN. Our performance measurements are taken from typical
scientific programs that run in a parallel mode by means of the MPI. The hypothesis behind
our design is that system's performance will be bounded by interactions between the kernel
and user space rather than by the bandwidth delivered by the LAN Data-Link Layer. Our
results indicate that the performance of our MPI Broadcast (on top of Ethernet) is about
twice as fast as a recently published software implementation of broadcast on top of ATM
MPICH-G2: A Grid-Enabled Implementation of the Message Passing Interface
Application development for distributed computing "Grids" can benefit from
tools that variously hide or enable application-level management of critical
aspects of the heterogeneous environment. As part of an investigation of these
issues, we have developed MPICH-G2, a Grid-enabled implementation of the
Message Passing Interface (MPI) that allows a user to run MPI programs across
multiple computers, at the same or different sites, using the same commands
that would be used on a parallel computer. This library extends the Argonne
MPICH implementation of MPI to use services provided by the Globus Toolkit for
authentication, authorization, resource allocation, executable staging, and
I/O, as well as for process creation, monitoring, and control. Various
performance-critical operations, including startup and collective operations,
are configured to exploit network topology information. The library also
exploits MPI constructs for performance management; for example, the MPI
communicator construct is used for application-level discovery of, and
adaptation to, both network topology and network quality-of-service mechanisms.
We describe the MPICH-G2 design and implementation, present performance
results, and review application experiences, including record-setting
distributed simulations.Comment: 20 pages, 8 figure
Recommended from our members
Leveraging legacy codes to distributed problem solving environments: A web service approach
This paper describes techniques used to leverage high performance legacy codes as CORBA components to a distributed problem solving environment. It first briefly introduces the software architecture adopted by the environment. Then it presents a CORBA oriented wrapper generator (COWG) which can be used to automatically wrap high performance legacy codes as CORBA components. Two legacy codes have been wrapped with COWG. One is an MPI-based molecular dynamic simulation (MDS) code, the other is a finite element based computational fluid dynamics (CFD) code for simulating incompressible Navier-Stokes flows. Performance comparisons between runs of the MDS CORBA component and the original MDS legacy code on a cluster of workstations and on a parallel computer are also presented. Wrapped as CORBA components, these legacy codes can be reused in a distributed computing environment. The first case shows that high performance can be maintained with the wrapped MDS component. The second case shows that a Web user can submit a task to the wrapped CFD component through a Web page without knowing the exact implementation of the component. In this way, a user’s desktop computing environment can be extended to a high performance computing environment using a cluster of workstations or a parallel computer
Overlapping of Communication and Computation and Early Binding: Fundamental Mechanisms for Improving Parallel Performance on Clusters of Workstations
This study considers software techniques for improving performance on clusters of workstations and approaches for designing message-passing middleware that facilitate scalable, parallel processing. Early binding and overlapping of communication and computation are identified as fundamental approaches for improving parallel performance and scalability on clusters. Currently, cluster computers using the Message-Passing Interface for interprocess communication are the predominant choice for building high-performance computing facilities, which makes the findings of this work relevant to a wide audience from the areas of high-performance computing and parallel processing. The performance-enhancing techniques studied in this work are presently underutilized in practice because of the lack of adequate support by existing message-passing libraries and are also rarely considered by parallel algorithm designers. Furthermore, commonly accepted methods for performance analysis and evaluation of parallel systems omit these techniques and focus primarily on more obvious communication characteristics such as latency and bandwidth. This study provides a theoretical framework for describing early binding and overlapping of communication and computation in models for parallel programming. This framework defines four new performance metrics that facilitate new approaches for performance analysis of parallel systems and algorithms. This dissertation provides experimental data that validate the correctness and accuracy of the performance analysis based on the new framework. The theoretical results of this performance analysis can be used by designers of parallel system and application software for assessing the quality of their implementations and for predicting the effective performance benefits of early binding and overlapping. This work presents MPI/Pro, a new MPI implementation that is specifically optimized for clusters of workstations interconnected with high-speed networks. This MPI implementation emphasizes features such as persistent communication, asynchronous processing, low processor overhead, and independent message progress. These features are identified as critical for delivering maximum performance to applications. The experimental section of this dissertation demonstrates the capability of MPI/Pro to facilitate software techniques that result in significant application performance improvements. Specific demonstrations with Virtual Interface Architecture and TCP/IP over Ethernet are offered
COMPARATIVE ANALYSIS OF PVM AND MPI FOR THE DEVELOPMENT OF PHYSICAL APPLICATIONS IN PARALLEL AND DISTRIBUTED SYSTEMS
This research is aimed to explore each of these two Parallel Virtual Machine (PVM) and Message Passing Interface(MPI) vehicles for DPP (Distributed Parallel Processing) considering capability, ease of use, and availability, and compares their distinguishing features and also explores programmer interface and their utilization for solving real world parallel processing applications. This work recommends a potential research issue, that is, to study the feasibility of creating a programming environment that allows access to the virtual machine features of PVM and the message passing features of MPI. PVM and MPI, two systems for programming clusters, are often compared. Each system has its unique strengths and this will remain so in to the foreseeable future. The comparisons usually start with the unspoken assumption that PVM and MPI represent different solutions to the same problem. In this paper we show that, in fact, the two systems often are solving different problems. In cases where the problems do match but the solutions chosen by PVM and MPI are different, we explain the reasons for the differences. Usually such differences can be traced to explicit differences in the goals of the two systems, their origins, or the relationship between their specifications and their implementations. This paper also compares PVM and MPI features, pointing out the situations where one may be favored over the other; it explains the deference’s between these systems and the reasons for such deference’s
PCODE: an efficient and reliable collective communication protocol for unreliable broadcast domain
Existing programming environments for clusters are typically built on top of a point-to-point communication layer (send and receive) over local area networks (LANs) and, as a result, suffer from poor performance in the collective communication part. For example, a broadcast that is implemented using a TCP/IP protocol (which is a point-to-point protocol) over a LAN is obviously inefficient as it is not utilizing the fact that the LAN is a broadcast medium. We have observed that the main difference between a distributed computing paradigm and a message passing parallel computing paradigm is that, in a distributed environment the activity of every processor is independent while in a parallel environment the collection of the user-communication layers in the processors can be modeled as a single global program. We have formalized the requirements by defining the notion of a correct global program. This notion provides a precise specification of the interface between the transport layer and the user-communication layer. We have developed PCODE, a new communication protocol that is driven by a global program and proved its correctness.
We have implemented the PCODE protocol on a collection of IBM RS/6000 workstations and on a collection of Silicon Graphics Indigo workstations, both communicating via UDP broadcast. The experimental results we obtained indicate that the performance advantage of PCODE over the current point-to-point approach (TCP) can be as high as an order of magnitude on a cluster of 16 workstations
State-of-the-Art in Parallel Computing with R
R is a mature open-source programming language for statistical computing and graphics. Many areas of statistical research are experiencing rapid growth in the size of data sets. Methodological advances drive increased use of simulations. A common approach is to use parallel computing. This paper presents an overview of techniques for parallel computing with R on computer clusters, on multi-core systems, and in grid computing. It reviews sixteen different packages, comparing them on their state of development, the parallel technology used, as well as on usability, acceptance, and performance. Two packages (snow, Rmpi) stand out as particularly useful for general use on computer clusters. Packages for grid computing are still in development, with only one package currently available to the end user. For multi-core systems four different packages exist, but a number of issues pose challenges to early adopters. The paper concludes with ideas for further developments in high performance computing with R. Example code is available in the appendix
- …