Search CORE

24 research outputs found

Algorithmic Based Fault Tolerance Applied to High Performance Computing

Author: Bosilca George
Delmas Remi
Dongarra Jack
Langou Julien
Publication venue
Publication date: 01/01/2008
Field of study

We present a new approach to fault tolerance for High Performance Computing system. Our approach is based on a careful adaptation of the Algorithmic Based Fault Tolerance technique (Huang and Abraham, 1984) to the need of parallel distributed computation. We obtain a strongly scalable mechanism for fault tolerance. We can also detect and correct errors (bit-flip) on the fly of a computation. To assess the viability of our approach, we have developed a fault tolerant matrix-matrix multiplication subroutine and we propose some models to predict its running time. Our parallel fault-tolerant matrix-matrix multiplication scores 1.4 TFLOPS on 484 processors (cluster jacquard.nersc.gov) and returns a correct result while one process failure has happened. This represents 65% of the machine peak efficiency and less than 12% overhead with respect to the fastest failure-free implementation. We predict (and have observed) that, as we increase the processor count, the overhead of the fault tolerance drops significantly

arXiv.org e-Print Archive

CiteSeerX

Recommended from our members

How are we doing? A self-assessment of the quality of services and systems at NERSC (October 1, 1996--September 30, 1997)

Author: Kramer W. T.
Publication venue: 'Office of Scientific and Technical Information (OSTI)'
Publication date: 01/01/1998
Field of study

Since its inception nearly 25 years ago, the National Energy Research Scientific Computing Center has provided its ever-expanding client base with the latest in scientific computing resources. A key element of NERSC`s successful operation is its ability to anticipate and meet the diverse needs of clients. In order to further this strong working relationship, NERSC staff and clients meet periodically via ERSUG to share views, offer training and identify problems and solutions. The success of NERSC is measured in large part by the quality of science produced by its clients. NERSC`s job is to give them the reliable tools they need -- client support, software and access to computing resources. To ensure that those needs are being met, a set of 10 performance goals pertaining to NERSC systems and service has been established. The goals that have been set out cover the following areas: Reliable and timely service; Innovative assistance; Timely and accurate information; New technologies; Wise technology integration; Progress measurement; High-performance computing center Leadership; Technology transfer; Staff effectiveness; and Protected Infrastructure. This report, covering work from October 1996 through September 1997, has been produced to give NERSC clients, sponsors and staff a better idea of how NERSC is performing

UNT Digital Library

Distributed Operating Systems

Author: ADAMS C. J.
ALMES G. T.
Andrew S. Tanenbaum
AVIZIENIS A.
AVIZIENIS A.
BALL J. E.
BIRMAN K. P.
BIRRELL A. D.
BLACK A. P.
BOGGS D. R.
BROWNBRIDGE D. R.
BRYANT R. M.
CHERITON D. R.
CHERITON D. R.
CHERITON D. R.
CHERITON D. R.
CHESSON G.
CHOW T. C.
CHOW Y. C.
CHU W. W.
DELLAR C.
EFE K.
FARBER D. J.
FINKEL R. A.
FITZGERALD R.
FRIDRICH M.
FRIDRICH M.
GAGLIANELLO R. D.
GUGOR V. D.
GYLYS V. B.
HWANG K.
ISLOOR S.
JANSON P.
JENSEN E. D.
JESSOP W. H.
LAMPSON B. W.
LAZOWSKA E. D.
LISKOV B.
LO V. M.
LUDERER G. W.
MAMRAK S. A.
MENASCE D.
MILLSTEIN R. E.
MOHAN C. K.
MULLENDER S. J.
MULLENDER S. J.
OKI B. M.
OUSTERHOUT J. K.
PASHTAN A.
POPEK G.
PU C.
RASHID R. F.
REED D. P.
Robbert Van Renesse
SATYANARAYANAN M.
SCHROEOER M.
SMITH R.
SOLOMON M. H.
SOLOMON M. H.
STANKOVIC J. A.
STONE H. S.
STONE H. S.
STONE H. S.
SVOBOOOVA L.
SW HART
TANENBAUM A. S.
TANENBAUM A. S.
VAN TILBORG A. M.
WAMBECQ A.
WEINSTEIN M. J.
WITTIE L.
WITTIE L. D.
WUPIT A.
ZIMMERMANN H.
ZIMMERMANN H.
Publication venue
Publication date: 01/01/1985
Field of study

Distributed operating systems have many aspects in common with centralized ones, but they also differ in certain ways. This paper is intended as an introduction to distributed operating systems, and especially to current university research about them. After a discussion of what constitutes a distributed operating system and how it is distinguished from a computer network, various key design issues are discussed. Then several examples of current research projects are examined in some detail, namely, the Cambridge Distributed Computing System, Amoeba, V, and Eden. © 1985, ACM. All rights reserved

Keeping checkpoint/restart viable for exascale systems

Author: Ferreira Kurt
Publication venue: UNM Digital Repository
Publication date: 01/12/2011
Field of study

Next-generation exascale systems, those capable of performing a quintillion operations per second, are expected to be delivered in the next 8-10 years. These systems, which will be 1,000 times faster than current systems, will be of unprecedented scale. As these systems continue to grow in size, faults will become increasingly common, even over the course of small calculations. Therefore, issues such as fault tolerance and reliability will limit application scalability. Current techniques to ensure progress across faults like checkpoint/restart, the dominant fault tolerance mechanism for the last 25 years, are increasingly problematic at the scales of future systems due to their excessive overheads. In this work, we evaluate a number of techniques to decrease the overhead of checkpoint/restart and keep this method viable for future exascale systems. More specifically, this work evaluates state-machine replication to dramatically increase the checkpoint interval (the time between successive checkpoints) and hash-based, probabilistic incremental checkpointing using graphics processing units to decrease the checkpoint commit time (the time to save one checkpoint). Using a combination of empirical analysis, modeling, and simulation, we study the costs and benefits of these approaches on a wide range of parameters. These results, which cover of number of high-performance computing capability workloads, different failure distributions, hardware mean time to failures, and I/O bandwidths, show the potential benefits of these techniques for meeting the reliability demands of future exascale platforms

Checkpoint-based forward recovery using lookahead execution and rollback validation in parallel and distributed systems

Author: Long Junsheng
Publication venue
Publication date
Field of study

This thesis studies a forward recovery strategy using checkpointing and optimistic execution in parallel and distributed systems. The approach uses replicated tasks executing on different processors for forwared recovery and checkpoint comparison for error detection. To reduce overall redundancy, this approach employs a lower static redundancy in the common error-free situation to detect error than the standard N Module Redundancy scheme (NMR) does to mask off errors. For the rare occurrence of an error, this approach uses some extra redundancy for recovery. To reduce the run-time recovery overhead, look-ahead processes are used to advance computation speculatively and a rollback process is used to produce a diagnosis for correct look-ahead processes without rollback of the whole system. Both analytical and experimental evaluation have shown that this strategy can provide a nearly error-free execution time even under faults with a lower average redundancy than NMR