5 research outputs found

    Plate : persistent memory management for nonvolatile main memory

    Get PDF
    Over the past few years, nonvolatile memory has actively been researched and developed. Therefore, studying operating system (OS) designs predicated on the main memory in the form of a nonvolatile memory and studying methods to manage persistent data in a virtual memory are crucial to encourage the widespread use of nonvolatile memory in the future. However, the main memory in most computers today is volatile, and replacing highcapacity main memory with nonvolatile memory is extremely cost-prohibitive. This paper proposes an OS structure for nonvolatile main memory. The proposed OS structure consists of three functions to study and develop OSs for nonvolatile main memory computers. First, a structure, which is called plate, is proposed whereby persistent data are managed assuming that nonvolatile main memory is present in a computer. Second, we propose a persistent-data mechanism to make a volatile memory function as nonvolatile main memory, which serves as a basis for the development of OSs for computers with nonvolatile main memory. Third, we propose a continuous operation control using the persistent-data mechanism and plates. This paper describes the design and implementation of the OS structure based on the three functions on The ENduring operating system for Distributed EnviRonment and describes the evaluation results of the proposed functions

    A fast restart mechanism for checkpoint/recovery protocols in networked environments

    Full text link
    Checkpoint/recovery has been studied extensively, and various optimization techniques have been presented for its improvement. Regardless of the considerable research efforts, little work has been done on improving its restart latency. The time spent on retrieving and loading the checkpoint image during a recovery is non-trivial, especially in networked environments. With the ever-increasing application memory footprint and system failure rate, it is becoming more of an issue. In this paper, we present a Fast REstart Mechanism called FREM. It allows fast restart of a failed process without requiring the availability of the entire checkpoint image. By dynamically tracking the process data accesses after each checkpoint, FREM masks restart latency by overlapping the computation of the resumed process with the retrieval of its checkpoint image. We have implemented FREM with the BLCR checkpointing tool in Linux systems. Our experiments with the SPEC benchmarks indicate that it can effectively reduce restart latency by 61.96 % on average in networked environments

    A checkpointing mechanism for virtual clusters using memory-bound time-multiplexed data transfers

    Get PDF
    Transparent hypervisor-level checkpoint-restart mechanisms for virtual clusters (VCs) or clusters of virtual machines (VMs) offer an attractive fault tolerance capability for cloud data centers. However, existing mechanisms have suffered from high checkpoint downtimes and overheads. This paper introduces Mekha, a novel hypervisor-level, in-memory coordinated checkpoint-restart mechanism for VCs that leverages precopy live migration. During a VC checkpoint event, Mekha creates a shadow VM for each VM and employs a novel memory-bound timed-multiplex data (MTD) transfer mechanism to replicate the state of each VM to its corresponding shadow VM. We also propose a global ending condition that enables the checkpoint coordinator to control the termination of the MTD algorithm for every VM in a VC, thereby reducing overall checkpoint latency. Furthermore, the checkpoint protocols of Mekha are designed based on barrier synchronizations and virtual time, ensuring the global consistency of checkpoints and utilizing existing data retransmission capabilities to handle message loss. We conducted several experiments to evaluate Mekha using a message passing interface (MPI) application from the NASA advanced supercomputing (NAS) parallel benchmark. The results demonstrate that Mekha significantly reduces checkpoint downtime compared to traditional checkpoint mechanisms. Consequently, Mekha effectively decreases checkpoint overheads while offering efficiency and practicality, making it a viable solution for cloud computing environments
    corecore