Search CORE

178 research outputs found

MPICH-G2: A Grid-Enabled Implementation of the Message Passing Interface

Author: Foster I.
Karonis N. T.
Toonen B.
Publication venue
Publication date: 01/01/2002
Field of study

Application development for distributed computing "Grids" can benefit from tools that variously hide or enable application-level management of critical aspects of the heterogeneous environment. As part of an investigation of these issues, we have developed MPICH-G2, a Grid-enabled implementation of the Message Passing Interface (MPI) that allows a user to run MPI programs across multiple computers, at the same or different sites, using the same commands that would be used on a parallel computer. This library extends the Argonne MPICH implementation of MPI to use services provided by the Globus Toolkit for authentication, authorization, resource allocation, executable staging, and I/O, as well as for process creation, monitoring, and control. Various performance-critical operations, including startup and collective operations, are configured to exploit network topology information. The library also exploits MPI constructs for performance management; for example, the MPI communicator construct is used for application-level discovery of, and adaptation to, both network topology and network quality-of-service mechanisms. We describe the MPICH-G2 design and implementation, present performance results, and review application experiences, including record-setting distributed simulations.Comment: 20 pages, 8 figure

arXiv.org e-Print Archive

CiteSeerX

Introduction to the Globus toolkit

Author: Aloisio G
Cafaro M
Publication venue: CERN
Publication date: 01/01/2000
Field of study

CERN Document Server

Recommended from our members

Cooperative fault-tolerant distributed computing U.S. Department of Energy Grant DE-FG02-02ER25537 Final Report

Author: Sunderam Vaidy S.
Publication venue: 'Office of Scientific and Technical Information (OSTI)'
Publication date: 09/01/2007
Field of study

The Harness project has developed novel software frameworks for the execution of high-end simulations in a fault-tolerant manner on distributed resources. The H2O subsystem comprises the kernel of the Harness framework, and controls the key functions of resource management across multiple administrative domains, especially issues of access and allocation. It is based on a “pluggable” architecture that enables the aggregated use of distributed heterogeneous resources for high performance computing. The major contributions of the Harness II project result in significantly enhancing the overall computational productivity of high-end scientific applications by enabling robust, failure-resilient computations on cooperatively pooled resource collections

UNT Digital Library

Advanced Message Routing for Scalable Distributed Simulations

Author: Barrett Brian
Gottschalk Thomas
Publication venue
Publication date: 01/12/2004
Field of study

The Joint Forces Command (JFCOM) Experimentation Directorate (J9)'s recent Joint Urban Operations (JUO) experiments have demonstrated the viability of Forces Modeling and Simulation in a distributed environment. The JSAF application suite, combined with the RTI-s communications system, provides the ability to run distributed simulations with sites located across the United States, from Norfolk, Virginia to Maui, Hawaii. Interest-aware routers are essential for communications in the large, distributed environments, and the current RTI-s framework provides such routers connected in a straightforward tree topology. This approach is successful for small to medium sized simulations, but faces a number of significant limitations for very large simulations over high-latency, wide area networks. In particular, traffic is forced through a single site, drastically increasing distances messages must travel to sites not near the top of the tree. Aggregate bandwidth is limited to the bandwidth of the site hosting the top router, and failures in the upper levels of the router tree can result in widespread communications losses throughout the system. To resolve these issues, this work extends the RTI-s software router infrastructure to accommodate more sophisticated, general router topologies, including both the existing tree framework and a new generalization of the fully connected mesh topologies used in the SF Express ModSAF simulations of 100K fully interacting vehicles. The new software router objects incorporate the scalable features of the SF Express design, while optionally using low-level RTI-s objects to perform actual site-to-site communications. The (substantial) limitations of the original mesh router formalism have been eliminated, allowing fully dynamic operations. The mesh topology capabilities allow aggregate bandwidth and site-to-site latencies to match actual network performance. The heavy resource load at the root node can now be distributed across routers at the participating sites

Caltech Authors

Gestion des réseaux multi-grappes hétérogènes avec la bibliothèque Madeleine III

Author: Aumage Olivier
Publication venue: HAL CCSD
Publication date: 01/02/2002
Field of study

This paper introduces the new version of the Madeleine portable multi-protocol communication library. Madeleine version III now includes full, flexible multi-cluster support associated to a redesigned version of the transparent multi-network message forwarding mechanism. Madeleine III works together with a new configuration management module to handle a wide panel of network-heterogeneous multi-cluster configurations. The integration of a new topology information system allows programmers of parallel computing applications to build highly optimized distributed algorithms on top of the transparent multi-network communication system provided by Madeleine III's virtual networks. The preliminary experiments we conducted regarding the new virtual network capabilities of Madeleine III showed interesting results with an asymptotic bandwidth of 43 MB/s over a virtual link made of a SISCI/SCI and a BIP/Myrinet physical link

HAL-ENS-LYON

INRIA a CCSD electronic archive server

Hal-Diderot

Scalable networked information processing environment (SNIPE)

Author: Davidson
Fagg
Fagg
Foster
Graham E Fagg
Grimshaw
Herlihy
Jack J Dongarra
Keith Moore
Skeen
Smarr
Publication venue: 'Elsevier BV'
Publication date
Field of study

Crossref

Grid-enabling problem solving environments: a case study of SCIRun and NetSolve

Author: Johnson Christopher R.
Miller Michelle
Publication venue: 'Society for Modeling and Simulation International (SCS)'
Publication date: 01/01/2001
Field of study

Journal ArticleCombining the functionality of NetSolve, a grid-based middleware solution, with SCIRun, a graphically-based problem solving environment (PSE), yields a platform for creating and executing grid-enabled applications. Using this integrated system, hardware and/or software resources not previously accessible to a user become available completely behind the scenes. Neither the SCIRun system nor the SCIRun user need to know any details about how these resources are located and utilized. A SCIRun module merely makes an RPC-style call to NetSolve via the NetSolve C language API to invoke a certain routine and to pass its data. Distributed computation and the details of remote communication are completely abstracted away from the SCIRun framework and its end user

The University of Utah: J. Willard Marriott Digital Library

Cooperative fault-tolerant distributed computing U.S. Department of Energy Grant DE-FG02-02ER25537 Final Report

Author
Publication venue: 'Office of Scientific and Technical Information (OSTI)'
Publication date
Field of study

Crossref

RELEASE: A High-level Paradigm for Reliable Large-scale Server Software

Author: Chechina Natalia
Trinder Phil
Publication venue
Publication date: 01/01/2012
Field of study

Erlang is a functional language with a much-emulated model for building reliable distributed systems. This paper outlines the RELEASE project, and describes the progress in the rst six months. The project aim is to scale the Erlang's radical concurrency-oriented programming paradigm to build reliable general-purpose software, such as server-based systems, on massively parallel machines. Currently Erlang has inherently scalable computation and reliability models, but in practice scalability is constrained by aspects of the language and virtual machine. We are working at three levels to address these challenges: evolving the Erlang virtual machine so that it can work effectively on large scale multicore systems; evolving the language to Scalable Distributed (SD) Erlang; developing a scalable Erlang infrastructure to integrate multiple, heterogeneous clusters. We are also developing state of the art tools that allow programmers to understand the behaviour of massively parallel SD Erlang programs. We will demonstrate the e ectiveness of the RELEASE approach using demonstrators and two large case studies on a Blue Gene

Enlighten