178 research outputs found
MPICH-G2: A Grid-Enabled Implementation of the Message Passing Interface
Application development for distributed computing "Grids" can benefit from
tools that variously hide or enable application-level management of critical
aspects of the heterogeneous environment. As part of an investigation of these
issues, we have developed MPICH-G2, a Grid-enabled implementation of the
Message Passing Interface (MPI) that allows a user to run MPI programs across
multiple computers, at the same or different sites, using the same commands
that would be used on a parallel computer. This library extends the Argonne
MPICH implementation of MPI to use services provided by the Globus Toolkit for
authentication, authorization, resource allocation, executable staging, and
I/O, as well as for process creation, monitoring, and control. Various
performance-critical operations, including startup and collective operations,
are configured to exploit network topology information. The library also
exploits MPI constructs for performance management; for example, the MPI
communicator construct is used for application-level discovery of, and
adaptation to, both network topology and network quality-of-service mechanisms.
We describe the MPICH-G2 design and implementation, present performance
results, and review application experiences, including record-setting
distributed simulations.Comment: 20 pages, 8 figure
Recommended from our members
Cooperative fault-tolerant distributed computing U.S. Department of Energy Grant DE-FG02-02ER25537 Final Report
The Harness project has developed novel software frameworks for the execution of high-end simulations in a fault-tolerant manner on distributed resources. The H2O subsystem comprises the kernel of the Harness framework, and controls the key functions of resource management across multiple administrative domains, especially issues of access and allocation. It is based on a “pluggable” architecture that enables the aggregated use of distributed heterogeneous resources for high performance computing. The major contributions of the Harness II project result in significantly enhancing the overall computational productivity of high-end scientific applications by enabling robust, failure-resilient computations on cooperatively pooled resource collections
Advanced Message Routing for Scalable Distributed Simulations
The Joint Forces Command (JFCOM) Experimentation Directorate (J9)'s recent Joint Urban Operations (JUO)
experiments have demonstrated the viability of Forces Modeling and Simulation in a distributed environment. The
JSAF application suite, combined with the RTI-s communications system, provides the ability to run distributed
simulations with sites located across the United States, from Norfolk, Virginia to Maui, Hawaii. Interest-aware
routers are essential for communications in the large, distributed environments, and the current RTI-s framework
provides such routers connected in a straightforward tree topology. This approach is successful for small to medium
sized simulations, but faces a number of significant limitations for very large simulations over high-latency, wide
area networks. In particular, traffic is forced through a single site, drastically increasing distances messages must
travel to sites not near the top of the tree. Aggregate bandwidth is limited to the bandwidth of the site hosting the
top router, and failures in the upper levels of the router tree can result in widespread communications losses
throughout the system.
To resolve these issues, this work extends the RTI-s software router infrastructure to accommodate more
sophisticated, general router topologies, including both the existing tree framework and a new generalization of the
fully connected mesh topologies used in the SF Express ModSAF simulations of 100K fully interacting vehicles.
The new software router objects incorporate the scalable features of the SF Express design, while optionally using
low-level RTI-s objects to perform actual site-to-site communications. The (substantial) limitations of the original
mesh router formalism have been eliminated, allowing fully dynamic operations. The mesh topology capabilities
allow aggregate bandwidth and site-to-site latencies to match actual network performance. The heavy resource load at
the root node can now be distributed across routers at the participating sites
Gestion des réseaux multi-grappes hétérogènes avec la bibliothèque Madeleine III
This paper introduces the new version of the Madeleine portable multi-protocol communication library. Madeleine version III now includes full, flexible multi-cluster support associated to a redesigned version of the transparent multi-network message forwarding mechanism. Madeleine III works together with a new configuration management module to handle a wide panel of network-heterogeneous multi-cluster configurations. The integration of a new topology information system allows programmers of parallel computing applications to build highly optimized distributed algorithms on top of the transparent multi-network communication system provided by Madeleine III's virtual networks. The preliminary experiments we conducted regarding the new virtual network capabilities of Madeleine III showed interesting results with an asymptotic bandwidth of 43 MB/s over a virtual link made of a SISCI/SCI and a BIP/Myrinet physical link
Grid-enabling problem solving environments: a case study of SCIRun and NetSolve
Journal ArticleCombining the functionality of NetSolve, a grid-based middleware solution, with SCIRun, a graphically-based problem solving environment (PSE), yields a platform for creating and executing grid-enabled applications. Using this integrated system, hardware and/or software resources not previously accessible to a user become available completely behind the scenes. Neither the SCIRun system nor the SCIRun user need to know any details about how these resources are located and utilized. A SCIRun module merely makes an RPC-style call to NetSolve via the NetSolve C language API to invoke a certain routine and to pass its data. Distributed computation and the details of remote communication are completely abstracted away from the SCIRun framework and its end user
RELEASE: A High-level Paradigm for Reliable Large-scale Server Software
Erlang is a functional language with a much-emulated model for building reliable distributed systems. This paper outlines the RELEASE project, and describes the progress in the rst six months. The project aim is to scale the Erlang's radical concurrency-oriented programming paradigm to build reliable general-purpose software, such as server-based systems, on massively parallel machines. Currently Erlang has inherently scalable computation and reliability models, but in practice scalability is constrained by aspects of the language and virtual machine. We are working at three levels to address these challenges: evolving the Erlang virtual machine so that it can work effectively on large scale multicore systems; evolving the language to Scalable Distributed (SD) Erlang; developing a scalable Erlang infrastructure to integrate multiple, heterogeneous clusters. We are also developing state of the art tools that allow programmers to understand the behaviour of massively parallel SD Erlang programs. We will demonstrate the e ectiveness of the RELEASE approach using demonstrators and two large case studies on a Blue Gene
- …