Search CORE

2 research outputs found

Non-blocking Synchronous Checkpointing Based On Rollback-dependency Trackability

Author: Garcia I.C.
Sakata T.C.
Publication venue
Publication date
Field of study

This article proposes an original approach that applies the Rollback-Dependency Trackability (RDT) property to implement a new non-blocking synchronous checkpointing protocol, called RDT-NBS, that takes mutable checkpoints and efficiently supports concurrent initiators. Mutable checkpoints can be saved in non-stable storage and make it possible for non-blocking synchronous checkpointing protocols to save a minimal number of checkpoints in stable storage during the construction of a consistent global checkpoint. We prove that this minimality property does not hold in presence of concurrent checkpointing initiations. Even though, RDT-NBS uses mutable checkpoints to reduce the use of stable memory assuring the existence of a consistent global checkpoint in stable storage. We also present simulation results that compare RDT-NBS to quasisynchronous RDT. © 2006 IEEE.411420Baldoni, R., Helary, J., Mostefaoui, A., Raynal, M., A Communication-Induced Checkpoint Protocol that Ensures Rollback Dependency Trackability (1997) IEEE Symp. on Fault Tolerant ComputingCao, G., Singhal, M., On Coordinated Checkpointing in Distributed Systems (1998) IEEE Trans. on Parallel and Distributed Systems, 9 (12), pp. 1213-1225. , DecCao, G., Singhal, M., On the Impossibility of Min-process Non-blocking Checkpointing and an Efficient Checkpointing Algorithm for Mobile Computing Systems (1998) Proc. 27th Internat. Conf. on Parallel Processing, pp. 37-44. , New York, IEEE PressCao, G., Singhal, M., Checkpointing with Mutable Checkpoints (2003) Theoretical Computer Science, 290 (2), pp. 1127-1148. , janChandy, M., Lamport, L., Distributed Snapshots: Determining Global States of Distributed Systems (1985) ACM Transaction on Computing Systems, 3 (1), pp. 63-75. , FebE. N. Elnozahy and D. B. J. ad W. Zwaenepoel. The Performance of Consistent Checkpointing. In Proc. of the 11th Symposium on Reliable Distributed Systems, pages 86-95, Oct. 1992Garcia, I.C., Buzato, L.E., Progressive Construction of Consistent Global Checkpoints (1999) 19th IEEE International Conference on Distributed Computing Systems, , Austin, Texas, USA, JuneI. C. Garcia and L. E. Buzato. Using Common Knowledge to Improve Fixed-Dependency-After-Send. In II Workshop de Testes e Tolerância a Falhas, Curitiba, Paraná, July 2000. Available as technical report number IC-99-22 (http://www.dcc.unicamp.br/ic-tr-ftp/1999/99-22.ps.gz)Garcia, I.C., Buzato, L.E., An Efficient Checkpointing Protocol for the Minimal Characterization of Operational Rollback-Dependency Trackability (2004) 23rd Symposium on Reliable Distributed Computing Systems, , Florianópolis, Santa Catarina, OctKoo, R., Toueg, S., Checkpointing and RollbackRecovery for Distributed Systems (1987) IEEE Transaction on Software Engineering, 13, pp. 23-31. , JanKumar, P., Kumar, L., Chauhan, R., Gupta, V., A Non-intrusive Minimum Process Synchronous Checkpointing Protocol for Mobile Distributed Systems (2005) IEEE International Personal Wireless Communications, pp. 491-495. , janLamport, L., Time, Clocks, and the Ordering of Events in a Distributed System (1978) Commun. ACM, 21 (7), pp. 558-565. , JulyManivannan, D., Netzer, R.H.B., Singhal, M., Finding Consistent Global Checkpoints in a Distributed Computation (1997) IEEE Trans. on Parallel and Distributed Systems, pp. 623-627. , JuneMattern, F., Virtual Time and Global States of Distributed Systems (1989) Parallel and Distributed Algorithms, pp. 215-226. , Elsevier Science Publishers B.V, North-HollandNetzer, R.H.B., Xu, J., Necessary and Sufficient Conditions for Consistent Global Snapshots (1995) IEEE Transaction on Parallel and Distributed Systems, 6 (2), pp. 165-169Prakash, R., Singhal, M., Low-Cost Checkpointing and Failure Recovery in Mobile Computing Systems (1996) IEEE Transaction on Parallel and Distributed Systems, 7 (10), pp. 1035-1048. , OctSchmidt, F.P.R., Garcia, I.C., Buzato, L.E., Optimal Asynchronous Garbage Collection for RDT Checkpointing Protocols (2005) 25th IEEE International Conference on Distributed Computing Systems, , Columbus, Ohio, USA, JuneRandell, B., System Structure for Software Fault Tolerance (1975) IEEE Transaction on Software Engineering, 1 (2), pp. 220-232. , JuneSilva, L.M., Silva, J.G., Global Checkpointing for Distributed Programs (1992) Proc. of the 11th Symposium on Reliable Distributed Systems, pp. 155-162. , OctTsai, J., Kuo, S.-Y., Wang, Y.-M., Theoretical Analysis for Communication- Induced Checkpointing Protocols with Rollback-Dependency Trackability (1998) IEEE Transaction on Parallel and Distributed Systems, 9 (10), pp. 963-971. , OctVieira, G.M.D., Buzato, L.E., Distributed Checkpointing: Analysis and Benchmarks. Curitiba, Paraná (2006) Proceedings of Simpósio Brasileiro de Redes de Computadores, , May, To appear inWang, Y.M., Consistent Global Checkpoints that Contain a Given Set of Local Checkpoints (1997) IEEE Trans. on Computers, 46 (4), pp. 456-468. , Ap

Repositorio da Producao Cientifica e Intelectual da Unicamp

Locality-driven checkpoint and recovery

Author: Wei Zunce
Publication venue
Publication date: 01/01/2010
Field of study

Checkpoint and recovery are important fault-tolerance techniques for distributed systems. The two categories of existing strategies incur unacceptable performance cost either at run time or upon failure recovery, when applied to large-scale distributed systems. In particular, the large number of messages and processes in these systems causes either considerable checkpoint as well as logging overhead, or catastrophic global-wise recovery effect. This thesis proposes a locality-driven strategy for efficiently checkpointing and recovering such systems with both affordable runtime cost and controllable failure recoverability. Messages establish dependencies between distributed processes, which can be either preserved by coordinated checkpoints or removed via logging. Existing strategies enforce a uniform handling policy for all message dependencies, and hence gains advantage at one end but bears disadvantage at the other. In this thesis, a generic theory of Quasi-Atomic Recovery has been formulated to accommodate message handling requirements of both kinds, and to allow using different message handling methods together. Quasi-atomicity of recovery blocks implies proper confinement of recoveries, and thus enables localization of checkpointing and recovery around such a block and consequently a hybrid strategy with combined advantages from both ends. A strategy of group checkpointing with selective logging has been proposed, based on the observation of message localization around 'locality regions' in distributed systems. In essence, a group-wise coordinated checkpoint is created around such a region and only the few inter-region messages are logged subsequently. Runtime overhead is optimized due to largely reduced logging efforts, and recovery spread is as localized as region-wise. Various protocols have been developed to provide trade-offs between flexibility and performance. Also proposed is the idea of process clone that can be used to effectively remove program-order recovery dependencies among successive group checkpoints and thus to stop inter-group recovery spread. Distributed executions exhibit locality of message interactions. Such locality originates from resolving distributed dependency localization via message passing, and appears as a hierarchical 'region-transition' pattern. A bottom-up approach has been proposed to identify those regions, by detecting popular recurrence patterns from individual processes as 'locality intervals', and then composing them into 'locality regions' based on their tight message coupling relations between each other. Experiments conducted on real-life applications have shown the existence of hierarchical locality regions and have justified the feasibility of this approach. Performance optimization of group checkpoint strategies has to do with their uses of locality. An abstract performance measure has been-proposed to properly integrate both runtime overhead and failure recoverability in a region-wise marner. Taking this measure as the optimization objective, a greedy heuristic has been introduced to decompose a given distributed execution into optimized regions. Analysis implies that an execution pattern with good locality leads to good optimized performance, and the locality pattern itself can serve as a good candidate for the optimal decomposition. Consequently, checkpoint protocols have been developed to efficiently identify optimized regions in such an execution, with assistance of either design-time or runtime knowledge

Concordia University Research Repository