12 research outputs found
Optimal Message Log Reclamation for Independent Checkpointing
Coordinated Science Laboratory was formerly known as Control Systems LaboratoryNational Aeronautics and Space Administration / NASA NAG 1-613Department of the Navy managed by the Office of the Chief of Naval Research / N00014-91-J-128
Performance comparison of hierarchical checkpoint protocols grid computing
Grid infrastructure is a large set of nodes
geographically distributed and connected by a communication. In
this context, fault tolerance is a necessity imposed by the
distribution that poses a number of problems related to the
heterogeneity of hardware, operating systems, networks,
middleware, applications, the dynamic resource, the scalability,
the lack of common memory, the lack of a common clock, the
asynchronous communication between processes. To improve the
robustness of supercomputing applications in the presence of
failures, many techniques have been developed to provide
resistance to these faults of the system. Fault tolerance is intended
to allow the system to provide service as specified in spite of
occurrences of faults. It appears as an indispensable element in
distributed systems. To meet this need, several techniques have
been proposed in the literature. We will study the protocols based
on rollback recovery. These protocols are classified into two
categories: coordinated checkpointing and rollback protocols and
log-based independent checkpointing protocols or message
logging protocols. However, the performance of a protocol
depends on the characteristics of the system, network and
applications running. Faced with the constraints of large-scale
environments, many of algorithms of the literature showed
inadequate. Given an application environment and a system, it is
not easy to identify the recovery protocol that is most appropriate
for a cluster or hierarchical environment, like grid computing.
While some protocols have been used successfully in small scale,
they are not suitable for use in large scale. Hence there is a need
to implement these protocols in a hierarchical fashion to compare
their performance in grid computing. In this paper, we propose
hierarchical version of four well-known protocols. We have
implemented and compare the performance of these protocols in
clusters and grid computing using the Omnet++ simulator
Lazy Checkpoint Coordination for Bounding Rollback Propagation
Coordinated Science Laboratory was formerly known as Control Systems LaboratoryNational Aeronautics and Space Administration / NASA NAG 1-613Department of the Navy managed by the Office of the Chief of Naval Research / N00014-91-J-128
On the Implementation and Use of Message Logging
We present a number of experiments showing that for compute-intensive applications executing in parallel on clusters of workstations, message logging has higher failure-free overhead than coordinated checkpointing. Message logging protocols, however, result in much shorter output latency than coordinated checkpointing. Therefore, message logging should be used for applications involving substantial interactions with the outside world, while coordinated checkpointing should be used otherwise. We also present an unorthodox message logging design that uses coordinated checkpointing with message logging, departing from the conventional approaches that use independent checkpointing. This combination of message logging and coordinated checkpointing offers several advantages, including improved failure-free performance, bounded recovery time, simplified garbage collection, and reduced complexity. Meanwhile, the new protocols retain the advantages of the conventional message logging protocols with respect to output commit. Finally, we discuss three “lessons learned” from an implementation of various message logging protocol
Recuperacao de processos em sistemas distribuidos
Sistemas computacionais tornam-se mais confiáveis se forem empregadas técnicas adequadas de recuperação pós-falhas. Como estas técnicas baseiam-se em redundância de componentes e dados, e os sistemas distribuídos podem dispor facilmente desta redundância, parece natural incorporar procedimentos de recuperação nesses sistemas. Esse artigo apresenta os conceitos básicos associados à recuperação em sistemas distribuídos mostrando exemplos destes procedimentos incorporados a sistemas operacionais.Computing systems become more dependable when appropriate fault recovery techniques are applied to them. These techniques are based on components or data redundancy. Considering the implicit redundancy of distributed systems, it seems natural to implement recovery facilities in these systems. This paper is a tutorial on the concepts related to recovery in distributed systems and illustrates fault recovery through examples of recovery protocols implemented in operating systems
Recommended from our members
Algorithm Based Fault Tolerance in Massively Parallel Systems
An A complex computer system consists of billions of transistors, miles of wires, and many interactions with an unpredictable environment. Correct results must be produced despite faults that dynamically occur in some of these components. Many techniques have been developed for fault tolerant computation. General purpose methods are independent of the application, yet incur an overhead cost which may be unacceptable for massively parallel systems. Algorithm-specific methods, which can operate at lower cost, are a developing alternative [1, 72]. This paper first reviews the general-purpose approach and then focuses on the algorithm-specific method, with an eye toward massively parallel processors. Algorithm-based fault tolerance has the attraction of low overhead; furthermore it addresses both the detection and also the correction problems. The principle is to build low-cost checking and correcting mechanism based exclusively on the redundancies inherent in the system
Proxy Module for System on Mobile Devices (SyD) Middleware
Nowadays, users of mobile devices are growing. The users expect that they could communicate constantly using their mobile devices while they are also constantly moving. Therefore, there is a need to provide disconnection tolerance of transactions in the mobile devices’ platforms and its synchronization management. System on Mobile Devices (SyD) is taken as one of the examples of mobile devices’ platforms. The thesis studies the existing SyD architecture, from its framework into its kernel, and introduces the proxy module enhancement in SyD to handle disconnection tolerance, including its synchronization. SyD kernel has been extended for the purpose of enabling proxy module. SyDSync has been constructed for synchronization with the proxy. The timeout has been studied for seamless proxy invocation. A Camera application that tries to catch a stolen vehicle has been simulated for the practical purpose of using the proxy module extension
Space Reclamation for Uncoordinated Checkpointing in Message-Passing Systems
Checkpointing and rollback recovery are techniques that can provide efficient recovery from transient process failures. In a message-passing system, the rollback of a message sender may cause the rollback of the corresponding receiver, and the system needs to roll back to a consistent set of checkpoints called recovery line. If the processes are allowed to take uncoordinated checkpoints, the above rollback propagation may result in the domino effect which prevents recovery line progression. Traditionally, only obsolete checkpoints before the global recovery line can be discarded, and the necessary and sufficient condition for identifying all garbage checkpoints has remained an open problem. A necessary and sufficient condition for achieving optimal garbage collection is derived and it is proved that the number of useful checkpoints is bounded by N(N+1)/2, where N is the number of processes. The approach is based on the maximum-sized antichain model of consistent global checkpoints and the technique of recovery line transformation and decomposition. It is also shown that, for systems requiring message logging to record in-transit messages, the same approach can be used to achieve optimal message log reclamation. As a final topic, a unifying framework is described by considering checkpoint coordination and exploiting piecewise determinism as mechanisms for bounding rollback propagation, and the applicability of the optimal garbage collection algorithm to domino-free recovery protocols is demonstrated
Reliable massively parallel symbolic computing : fault tolerance for a distributed Haskell
As the number of cores in manycore systems grows exponentially, the number of failures is
also predicted to grow exponentially. Hence massively parallel computations must be able to
tolerate faults. Moreover new approaches to language design and system architecture are needed
to address the resilience of massively parallel heterogeneous architectures.
Symbolic computation has underpinned key advances in Mathematics and Computer Science,
for example in number theory, cryptography, and coding theory. Computer algebra software
systems facilitate symbolic mathematics. Developing these at scale has its own distinctive
set of challenges, as symbolic algorithms tend to employ complex irregular data and control
structures. SymGridParII is a middleware for parallel symbolic computing on massively parallel
High Performance Computing platforms. A key element of SymGridParII is a domain specific
language (DSL) called Haskell Distributed Parallel Haskell (HdpH). It is explicitly designed for
scalable distributed-memory parallelism, and employs work stealing to load balance dynamically
generated irregular task sizes.
To investigate providing scalable fault tolerant symbolic computation we design, implement
and evaluate a reliable version of HdpH, HdpH-RS. Its reliable scheduler detects and handles
faults, using task replication as a key recovery strategy. The scheduler supports load balancing
with a fault tolerant work stealing protocol. The reliable scheduler is invoked with two fault
tolerance primitives for implicit and explicit work placement, and 10 fault tolerant parallel
skeletons that encapsulate common parallel programming patterns. The user is oblivious to
many failures, they are instead handled by the scheduler.
An operational semantics describes small-step reductions on states. A simple abstract machine
for scheduling transitions and task evaluation is presented. It defines the semantics of
supervised futures, and the transition rules for recovering tasks in the presence of failure. The
transition rules are demonstrated with a fault-free execution, and three executions that recover
from faults.
The fault tolerant work stealing has been abstracted in to a Promela model. The SPIN
model checker is used to exhaustively search the intersection of states in this automaton to
validate a key resiliency property of the protocol. It asserts that an initially empty supervised
future on the supervisor node will eventually be full in the presence of all possible combinations
of failures.
The performance of HdpH-RS is measured using five benchmarks. Supervised scheduling
achieves a speedup of 757 with explicit task placement and 340 with lazy work stealing when
executing Summatory Liouville up to 1400 cores of a HPC architecture. Moreover, supervision
overheads are consistently low scaling up to 1400 cores. Low recovery overheads are observed in
the presence of frequent failure when lazy on-demand work stealing is used. A Chaos Monkey
mechanism has been developed for stress testing resiliency with random failure combinations.
All unit tests pass in the presence of random failure, terminating with the expected results
Integrated Data, Message, and Process Recovery for Failure Masking in Web Services
Modern Web Services applications encompass multiple distributed interacting components, possibly including millions of lines of code written in different programming languages. With this complexity, some bugs often remain undetected despite extensive testing procedures, and occasionally cause transient system failures. Incorrect failure handling in applications often leads to incomplete or to unintentional request executions. A family of recovery protocols called interaction contracts provides a generic solution to this problem by means of system-integrated data, process, and message recovery for multi-tier applications. It is able to mask failures, and allows programmers to concentrate on the application logic, thus speeding up the development process. This thesis consists of two major parts. The first part formally specifies the interaction contracts using the state-and-activity chart language. Moreover, it presents a formal specification of a concrete Web Service that makes use of interaction contracts, and contains no other error-handling actions. The formal specifications undergo verification where crucial safety and liveness properties expressed in temporal logics are mathematically proved by means of model checking. In particular, it is shown that each end-user request is executed exactly once. The second part of the thesis demonstrates the viability of the interaction framework in a real world system. More specifically, a cascadable Web Service platform, EOS, is built based on widely used components, Microsoft Internet Explorer and PHP application server, with interaction contracts integrated into them.Heutige Web-Service-Anwendungen setzen sich aus mehreren verteilten interagierenden
Komponenten zusammen. Dabei werden oft mehrere Programmiersprachen eingesetzt,
und der Quellcode einer Komponente kann mehrere Millionen Programmzeilen
umfassen. In Anbetracht dieser Komplexität bleiben typischerweise einige
Programmierfehler trotz intensiver Qualitätssicherung unentdeckt und verursachen
vorübergehende Systemsausfälle zur Laufzeit. Eine ungenügende Fehlerbehandlung in
Anwendungen führt oft zur unvollständigen oder unbeabsichtigt wiederholten
Ausführung einer Operation. Eine Familie von Recovery-Protokollen, die so genannten
"Interaction Contracts", bietet eine generische Lösung dieses Problems. Diese Recovery-
Protokolle sorgen für die Fehlermaskierung und ermöglichen somit, dass Entwickler ihre
ganze Konzentration der Anwendungslogik widmen können. Dies trägt zu einer
erheblichen Beschleunigung des Entwicklungsprozesses bei.
Diese Dissertation besteht aus zwei wesentlichen Teilen. Der erste Teil widmet sich der
formalen Spezifikation der Recovery-Protokolle unter Verwendung des Formalismus der
State-and-Activity-Charts. Darüber hinaus entwickeln wir die formale Spezifikation einer
Web-Service-Anwendung, die außer den Recovery-Protokollen keine weitere
Fehlerbehandlung beinhaltet. Die formalen Spezifikationen werden in Bezug auf kritische
Sicherheits- und Lebendigkeitseigenschaften, die als temporallogische Formeln
angegeben sind, mittels "Model Checking" verifiziert. Unter anderem wird somit
mathematisch bewiesen, dass jede Operation eines Endbenutzers genau einmal ausgeführt
wird. Der zweite Teil der Dissertation beschreibt die Implementierung der Recovery-
Protokolle im Rahmen einer beliebig verteilbaren Web-Service-Plattform EOS, die auf
weit verbreiteten Web-Produkten aufbaut: dem Browser "Microsoft Internet Explorer"
und dem PHP-Anwendungsserver