157 research outputs found

    Checkpointing of parallel applications in a Grid environment

    Get PDF
    The Grid environment is generic, heterogeneous, and dynamic with lots of unreliable resources making it very exposed to failures. The environment is unreliable because it is geographically dispersed involving multiple autonomous administrative domains and it is composed of a large number of components. Examples of failures in the Grid environment can be: application crash, Grid node crash, network failures, and Grid system component failures. These types of failures can affect the execution of parallel/distributed application in the Grid environment and so, protections against these faults are crucial. Therefore, it is essential to develop efficient fault tolerant mechanisms to allow users to successfully execute Grid applications. One of the research challenges in Grid computing is to be able to develop a fault tolerant solution that will ensure Grid applications are executed reliably with minimum overhead incurred. While checkpointing is the most common method to achieve fault tolerance, there is still a lot of work to be done to improve the efficiency of the mechanism. This thesis provides an in-depth description of a novel solution for checkpointing parallel applications executed on a Grid. The checkpointing mechanism implemented allows to checkpoint an application at regions where there is no interprocess communication involved and therefore reducing the checkpointing overhead and checkpoint size

    Compiler assisted chekpointing of message-passing applications in heterogeneous environments

    Get PDF
    [Resumen] With the evolution of high performance computing towards heterogeneous, massively parallel systems, parallel applications have developed new checkpoint and restart necessities, Whether due to a failure in the execution or to a migration of the processes to different machines, checkpointing tools must be able to operate in heterogeneous environments. However, some of the data manipulated by a parallel application are not truly portable. Examples of these include opaque state (e.g. data structures for communications support) or diversity of interfaces for a single feature (e.g. communications, I/O). Directly manipulating the underlying ad-hoc representations renders checkpointing tools incapable of working on different environments. Portable checkpointers usually work around portability issues at the cost of transparency: the user must provide information such as what data needs to be stored, where to store it, or where to checkpoint. CPPC (ComPiler for Portable Checkpointing) is a checkpointing tool designed to feature both portability and transparency, while preserving the scalability of the executed applications. It is made up of a library and a compiler. The CPPC library contains routines for variable level checkpointing, using portable code and protocols. The CPPC compiler achieves transparency by relieving the user from time-consuming tasks, such as performing code analyses and adding instrumentation code

    Resource-Efficient Replication and Migration of Virtual Machines.

    Full text link
    Continuous replication and live migration of Virtual Machines (VMs) are two vital tools in a virtualized environment, but they are resource-expensive. Continuously replicating a VM's checkpointed state to a backup host maintains high-availability (HA) of the VM despite host failures, but checkpoint replication can generate significant network traffic. Each replicated VM also incurs a 100% memory overhead, since the backup unproductively reserves the same amount of memory to hold the redundant VM state. Live migration, though being widely used for load-balancing, power-saving, etc., can also generate excessive network traffic, by transferring VM state iteratively. In addition, it can incur a long completion time and degrade application performance. This thesis explores ways to replicate VMs for HA using resources efficiently, and to migrate VMs fast, with minimal execution disruption and using resources efficiently. First, we investigate the tradeoffs in using different compression methods to reduce the network traffic of checkpoint replication in a HA system. We evaluate gzip, delta and similarity compressions based on metrics that are specifically important in a HA system, and then suggest guidelines for their selection. Next, we propose HydraVM, a storage-based HA approach that eliminates the unproductive memory reservation made in backup hosts. HydraVM maintains a recent image of a protected VM in a shared storage by taking and consolidating incremental VM checkpoints. When a failure occurs, HydraVM quickly resumes the execution of a failed VM by loading a small amount of essential VM state from the storage. As the VM executes, the VM state not yet loaded is supplied on-demand. Finally, we propose application-assisted live migration, which skips transfer of VM memory that need not be migrated to execute running applications at the destination. We develop a generic framework for the proposed approach, and then use the framework to build JAVMM, a system that migrates VMs running Java applications skipping transfer of garbage in Java memory. Our evaluation results show that compared to Xen live migration, which is agnostic of running applications, JAVMM can reduce the completion time, network traffic and application downtime caused by Java VM migration, all by up to over 90%.PhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/111575/1/karenhou_1.pd

    Efficient Checkpointing for Heterogeneous Collaborative Environments: Representation, Coordination, and Automation.

    Get PDF
    Checkpointing can be used to adapt resource utilization in heterogeneous distributed environments. In checkpointing, the state of a process is captured and later restored on a computer to restart execution from the point where the state capturing had occurred. Such capability can be applied to process migration for which resource utilization is adapted toward high-performance by moving a running process from one computer to another. For a heterogeneous environment, problems in checkpointing can be categorized into three domains regarding mechanisms to capture and restore the execution state, memory state, and communication state of a process. Although a few solutions have been proposed, a well-defined solution is not yet exist. This thesis presents a practical solution to capture and restore the process state in heterogeneous distributed environments. The solution is based on three novel mechanisms: the data transfer mechanism, the memory space representation model and its associated data collection and restoration mechanisms, and the reliable communication and process migration protocols. These mechanisms define the machine-independent representations of the execution state, the memory state, and the communication state. They work in coordination to perform process migration in a heterogeneous environment. A software system is designed and implemented to automatically migrate a process. A number of process migration experiments are tested on sequential and collaborative processes. Experimental results advocate correctness and practicability of our solution
    • …
    corecore