MPI/FT TM: Architecture and Taxonomies for Fault-Tolerant, Message-Passing Middleware for Performance-Portable Parallel Computing *Work performed in part with support from NASA under

Abstract

MPI has proven effective for parallel applications in situations with neither QoS nor fault handling. Emerging environments motivate fault-tolerant MPI middleware. Environments include space-based, wide-area/web/meta computing, and scalable clusters. MPI/FT, the system described here, trades off sufficient MPI fault coverage against acceptable parallel performance, based on mission requirements and constraints. MPI codes are evolved to use MPI/FT features. Non-portable code for event handlers and recovery management is isolated

    Similar works

    Full text

    thumbnail-image

    Available Versions