Search CORE

439 research outputs found

On page-based optimistic process checkpointing

Author: Dearle Alan
Hulse D
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 14/04/2020
Field of study

Persistent object systems must provide some form of checkpointing to ensure that changes to persistent data are secured on non-volatile storage. When processes share or exchange modified data, mechanisms must be provided to ensure that they may be consistently checkpointed. This may be performed eagerly by synchronously checkpointing all dependent data. Alternatively, optimistic techniques may be used where processes are individually checkpointed and globally consistent states are found asynchronously. This paper examines two eager checkpointing techniques and describes a new optimistic technique. The technique is applicable in systems such as SASOS, where the notion of process and address space are decoupled.Othe

St Andrews Research Repository

CHECKPOINTING AND RECOVERY IN DISTRIBUTED AND DATABASE SYSTEMS

Author: Wu Jiang
Publication venue: UKnowledge
Publication date: 01/01/2011
Field of study

A transaction-consistent global checkpoint of a database records a state of the database which reflects the effect of only completed transactions and not the re- sults of any partially executed transactions. This thesis establishes the necessary and sufficient conditions for a checkpoint of a data item (or the checkpoints of a set of data items) to be part of a transaction-consistent global checkpoint of the database. This result would be useful for constructing transaction-consistent global checkpoints incrementally from the checkpoints of each individual data item of a database. By applying this condition, we can start from any useful checkpoint of any data item and then incrementally add checkpoints of other data items until we get a transaction- consistent global checkpoint of the database. This result can also help in designing non-intrusive checkpointing protocols for database systems. Based on the intuition gained from the development of the necessary and sufficient conditions, we also de- veloped a non-intrusive low-overhead checkpointing protocol for distributed database systems. Checkpointing and rollback recovery are also established techniques for achiev- ing fault-tolerance in distributed systems. Communication-induced checkpointing algorithms allow processes involved in a distributed computation take checkpoints independently while at the same time force processes to take additional checkpoints to make each checkpoint to be part of a consistent global checkpoint. This thesis develops a low-overhead communication-induced checkpointing protocol and presents a performance evaluation of the protocol

University of Kentucky

Fine-Grain Checkpointing with In-Cache-Line Logging

Author: Aksun David T.
Avni Hillel
Cohen Nachshon
Larus James R.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 02/02/2019
Field of study

Non-Volatile Memory offers the possibility of implementing high-performance, durable data structures. However, achieving performance comparable to well-designed data structures in non-persistent (transient) memory is difficult, primarily because of the cost of ensuring the order in which memory writes reach NVM. Often, this requires flushing data to NVM and waiting a full memory round-trip time. In this paper, we introduce two new techniques: Fine-Grained Checkpointing, which ensures a consistent, quickly recoverable data structure in NVM after a system failure, and In-Cache-Line Logging, an undo-logging technique that enables recovery of earlier state without requiring cache-line flushes in the normal case. We implemented these techniques in the Masstree data structure, making it persistent and demonstrating the ease of applying them to a highly optimized system and their low (5.9-15.4\%) runtime overhead cost.Comment: In 2019 Architectural Support for Programming Languages and Operating Systems (ASPLOS 19), April 13, 2019, Providence, RI, US

arXiv.org e-Print Archive

Crossref

A Survey on Transactional Stream Processing

Author: Markl Volker
Soto Juan
Zhang Shuhao
Publication venue
Publication date: 31/05/2023
Field of study

Transactional stream processing (TSP) strives to create a cohesive model that merges the advantages of both transactional and stream-oriented guarantees. Over the past decade, numerous endeavors have contributed to the evolution of TSP solutions, uncovering similarities and distinctions among them. Despite these advances, a universally accepted standard approach for integrating transactional functionality with stream processing remains to be established. Existing TSP solutions predominantly concentrate on specific application characteristics and involve complex design trade-offs. This survey intends to introduce TSP and present our perspective on its future progression. Our primary goals are twofold: to provide insights into the diverse TSP requirements and methodologies, and to inspire the design and development of groundbreaking TSP systems

arXiv.org e-Print Archive

Database recovery

Author: Emanuel Kim L.
Publication venue: RIT Scholar Works
Publication date: 10/08/1987
Field of study

Recovery techniques are an important aspect of database systems. They are essential to ensure that data integrity is maintained after any type of failure occurs. The recovery mechanism must be designed so that the availability and performance of the system are not unacceptably impacted by the recovery algorithms running during normal execution. On the other hand, enough information must be stored so that the database can be restored or transactions backed out in a reasonable amount of time. Concepts, techniques, and problems associated with database recovery will be presented in this thesis. The recovery issues for both centralized and distributed systems will be discussed, along with the tradeoffs of different recovery tools. The database recovery schemes in IMS/VS, DB2 and SDD-1 will be described to show approaches in existing systems

RIT Scholar Works

Checkpointing of parallel applications in a Grid environment

Author: Sajadah K.
Sajadah K.
Publication venue
Publication date: 01/01/2011
Field of study

The Grid environment is generic, heterogeneous, and dynamic with lots of unreliable resources making it very exposed to failures. The environment is unreliable because it is geographically dispersed involving multiple autonomous administrative domains and it is composed of a large number of components. Examples of failures in the Grid environment can be: application crash, Grid node crash, network failures, and Grid system component failures. These types of failures can affect the execution of parallel/distributed application in the Grid environment and so, protections against these faults are crucial. Therefore, it is essential to develop efficient fault tolerant mechanisms to allow users to successfully execute Grid applications. One of the research challenges in Grid computing is to be able to develop a fault tolerant solution that will ensure Grid applications are executed reliably with minimum overhead incurred. While checkpointing is the most common method to achieve fault tolerance, there is still a lot of work to be done to improve the efficiency of the mechanism. This thesis provides an in-depth description of a novel solution for checkpointing parallel applications executed on a Grid. The checkpointing mechanism implemented allows to checkpoint an application at regions where there is no interprocess communication involved and therefore reducing the checkpointing overhead and checkpoint size

WestminsterResearch

Flexible and transparent fault tolerance for distributed object-oriented applications

Author: Kottmann Dietmar
Schill Alexander B.
Publication venue
Publication date: 02/08/2007
Field of study

This report describes an approach enabling automatic structural reconfigurations of distributed applications based on configuration management in order to compensate for node and network failures. The major goal of the approach is to maintain the relevant application functionality after failures automatically.This goalis achieved by a dedicated system model and by a decentralized reconfiguration algorithm based on it. The system model provides support for redundant application object storage and for application-level consistency based on distributed checkpoints. The reconfiguration algorithm detects failures, computes a compensating configuration, and realizes this new configuration. The report emphasizes flexibility in the sense ofadaptable levels of fault tolerance, as well as transparency in the sense of fully-automatic reaction to failures

KITopen