Search CORE

2 research outputs found

A Survey of Fault-Tolerance and Fault-Recovery Techniques in Parallel Systems

Author: Treaster Michael
Publication venue
Publication date: 31/12/2004
Field of study

Supercomputing systems today often come in the form of large numbers of commodity systems linked together into a computing cluster. These systems, like any distributed system, can have large numbers of independent hardware components cooperating or collaborating on a computation. Unfortunately, any of this vast number of components can fail at any time, resulting in potentially erroneous output. In order to improve the robustness of supercomputing applications in the presence of failures, many techniques have been developed to provide resilience to these kinds of system faults. This survey provides an overview of these various fault-tolerance techniques.Comment: 11 page

arXiv.org e-Print Archive

CiteSeerX

Dependable High Performance Computing on a Parallel Sysplex Cluster

Author: Blochinger Wolfgang
Bündgen Reinhard
Heinemann Andreas
Publication venue: CSREA Press
Publication date: 01/01/2000
Field of study

In this paper we address the issue of dependable distributed high performance computing in the field of Symbolic Computation. We describe the extension of a middleware infrastructure designed for high performance computing with efficient checkpointing mechanisms. As target platform an IBM Parallel Sysplex Cluster is used. We consider the satisfiability checking problem for boolean formulae as an example application from the realm of Symbolic Computation. Time measurements for an implementation of this application on top of the described system environment are given

TUbiblio