Search CORE

31 research outputs found

Scalable networked information processing environment (SNIPE)

Author: Davidson
Fagg
Fagg
Foster
Graham E Fagg
Grimshaw
Herlihy
Jack J Dongarra
Keith Moore
Skeen
Smarr
Publication venue: 'Elsevier BV'
Publication date
Field of study

Crossref

Harness: The next generation beyond PVM

Author: A. Geist
A. Geist
M. Snir
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

A reconfigurable component-based problem solving environment

Author: Coddington P.
Hawick K.
James H.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2001
Field of study

©2001 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.Problem solving environments are an attractive approach to the integration of calculation and management tools for various scientific and engineering applications. These applications often require high performance computing components in order to be computationally feasible. It is therefore a challenge to construct integration technology, suitable for problem solving environments, that allows both flexibility as well as the embedding of parallel and high performance computing systems. Our DISCWorld system is designed to meet these needs and provides a Java-based middleware to integrate component applications across wide-area networks. Key features of our design are the abilities to: access remotely stored data; compose complex processing requests either graphically or through a scripting language; execute components on heterogeneous and remote platforms; reconfigure task sub-graphs to run across multiple servers. Operators in task graphs can be slow (but portable) “pure Java” implementations or wrappers to fast (platform specific) supercomputer implementations.K. Hawick, H. James, P. Coddingto

CiteSeerX

Crossref

Adelaide Research & Scholarship

FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World

Author: G. E. Fagg
M. Migliardi
P. H. Worley
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Recommended from our members

An Exploration in Implementing Fault Tolerance in Scientific Simulation Application Software

Author: DRAKE RICHARD R.
SUMMERS RANDALL M.
Publication venue: 'Office of Scientific and Technical Information (OSTI)'
Publication date: 01/05/2003
Field of study

The ability for scientific simulation software to detect and recover from errors and failures of supporting hardware and software layers is becoming more important due to the pressure to shift from large, specialized multi-million dollar ASCI computing platforms to smaller, less expensive interconnected machines consisting of off-the-shelf hardware. As evidenced by the CPlant{trademark} experiences, fault tolerance can be necessary even on such a homogeneous system and may also prove useful in the next generation of ASCI platforms. This report describes a research effort intended to study, implement, and test the feasibility of various fault tolerance mechanisms controlled at the simulation code level. Errors and failures would be detected by underlying software layers, communicated to the application through a convenient interface, and then handled by the simulation code itself. Targeted faults included corrupt communication messages, processor node dropouts, and unacceptable slowdown of service from processing nodes. Recovery techniques such as re-sending communication messages and dynamic reallocation of failing processor nodes were considered. However, most fault tolerance mechanisms rely on underlying software layers which were discovered to be lacking to such a degree that mechanisms at the application level could not be implemented. This research effort has been postponed and shifted to these supporting layers

UNT Digital Library

An Exploration in Implementing Fault Tolerance in Scientific Simulation Application Software

Author
Publication venue: 'Office of Scientific and Technical Information (OSTI)'
Publication date
Field of study

Crossref

Improving Scalability and Usability of Parallel Runtime Environments for High Availability and High Performance Systems

Author: Angskun Thara
Publication venue: TRACE: Tennessee Research and Creative Exchange
Publication date: 01/12/2007
Field of study

The number of processors embedded in high performance computing platforms is growing daily to solve larger and more complex problems. Hence, parallel runtime environments have to support and adapt to the underlying platforms that require scalability and fault management in more and more dynamic environments. This dissertation aims to analyze, understand and improve the state of the art mechanisms for managing highly dynamic, large scale applications. This dissertation demonstrates that the use of new scalable and fault-tolerant topologies, combined with rerouting techniques, builds parallel runtime environments, which are able to efficiently and reliably deliver sets of information to a large number of processes. Several important graph properties are provided to illustrate the theoretical capability of these topologies in terms of both scalability and fault-tolerance, such as reasonable degree, regular graph, low diameter, symmetric graph, low cost factor, low message traffic density, optimal connectivity, low fault-diameter and strongly resilient. The dissertation builds a communication framework based on these topologies to support parallel runtime environments. Such a framework can handle multiple types of messages, e.g., unicast, multicast, broadcast and all-gather. Additionally, the communication framework has been formally verified to work in both normal and failure circumstances without creating any of the common problems such as broadcast storm, deadlock and non-progress cycle

University of Tennessee, Knoxville: Trace

A resource management architecture for metacomputing systems

Author: B. C. Neuman
B. C. Neuman
I. Foster
J. Czyzyk
J. Jones
J. Weissman
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

A Policy-Based Resource Brokering Environment for Computational Grids

Author: Al-Theneyan Ahmed Hamdan
Publication venue: ODU Digital Commons
Publication date: 01/01/2002
Field of study

With the advances in networking infrastructure in general, and the Internet in particular, we can build grid environments that allow users to utilize a diverse set of distributed and heterogeneous resources. Since the focus of such environments is the efficient usage of the underlying resources, a critical component is the resource brokering environment that mediates the discovery, access and usage of these resources. With the consumer\u27s constraints, provider\u27s rules, distributed heterogeneous resources and the large number of scheduling choices, the resource brokering environment needs to decide where to place the user\u27s jobs and when to start their execution in a way that yields the best performance for the user and the best utilization for the resource provider. As brokering and scheduling are very complicated tasks, most current resource brokering environments are either specific to a particular grid environment or have limited features. This makes them unsuitable for large applications with heterogeneous requirements. In addition, most of these resource brokering environments lack flexibility. Policies at the resource-, application-, and system-levels cannot be specified and enforced to provide commitment to the guaranteed level of allocation that can help in attracting grid users and contribute to establishing credibility for existing grid environments. In this thesis, we propose and prototype a flexible and extensible Policy-based Resource Brokering Environment (PROBE) that can be utilized by various grid systems. In designing PROBE, we follow a policy-based approach that provides PROBE with the intelligence to not only match the user\u27s request with the right set of resources, but also to assure the guaranteed level of the allocation. PROBE looks at the task allocation as a Service Level Agreement (SLA) that needs to be enforced between the resource provider and the resource consumer. The policy-based framework is useful in a typical grid environment where resources, most of the time, are not dedicated. In implementing PROBE, we have utilized a layered architecture and façade design patterns. These along with the well-defined API, make the framework independent of any architecture and allow for the incorporation of different types of scheduling algorithms, applications and platform adaptors as the underlying environment requires. We have utilized XML as a base for all the specification needs. This provides a flexible mechanism to specify the heterogeneous resources and user\u27s requests along with their allocation constraints. We have developed XML-based specifications by which high-level internal structures of resources, jobs and policies can be specified. This provides interoperability in which a grid system can utilize PROBE to discover and use resources controlled by other grid systems. We have implemented a prototype of PROBE to demonstrate its feasibility. We also describe a test bed environment and the evaluation experiments that we have conducted to demonstrate the usefulness and effectiveness of our approach

Old Dominion University

Fault Tolerant MPI for the HARNESS Meta-computing System

Author: E. Gabriel
G. E. Fagg
G. E. Fagg
G. E. Fagg
J. L. Traff
M. Frigo
M. Migliardi
P. H. Worley
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref