34,545 research outputs found

    Checkpointing as a Service in Heterogeneous Cloud Environments

    Get PDF
    A non-invasive, cloud-agnostic approach is demonstrated for extending existing cloud platforms to include checkpoint-restart capability. Most cloud platforms currently rely on each application to provide its own fault tolerance. A uniform mechanism within the cloud itself serves two purposes: (a) direct support for long-running jobs, which would otherwise require a custom fault-tolerant mechanism for each application; and (b) the administrative capability to manage an over-subscribed cloud by temporarily swapping out jobs when higher priority jobs arrive. An advantage of this uniform approach is that it also supports parallel and distributed computations, over both TCP and InfiniBand, thus allowing traditional HPC applications to take advantage of an existing cloud infrastructure. Additionally, an integrated health-monitoring mechanism detects when long-running jobs either fail or incur exceptionally low performance, perhaps due to resource starvation, and proactively suspends the job. The cloud-agnostic feature is demonstrated by applying the implementation to two very different cloud platforms: Snooze and OpenStack. The use of a cloud-agnostic architecture also enables, for the first time, migration of applications from one cloud platform to another.Comment: 20 pages, 11 figures, appears in CCGrid, 201

    CRAFT: A library for easier application-level Checkpoint/Restart and Automatic Fault Tolerance

    Get PDF
    In order to efficiently use the future generations of supercomputers, fault tolerance and power consumption are two of the prime challenges anticipated by the High Performance Computing (HPC) community. Checkpoint/Restart (CR) has been and still is the most widely used technique to deal with hard failures. Application-level CR is the most effective CR technique in terms of overhead efficiency but it takes a lot of implementation effort. This work presents the implementation of our C++ based library CRAFT (Checkpoint-Restart and Automatic Fault Tolerance), which serves two purposes. First, it provides an extendable library that significantly eases the implementation of application-level checkpointing. The most basic and frequently used checkpoint data types are already part of CRAFT and can be directly used out of the box. The library can be easily extended to add more data types. As means of overhead reduction, the library offers a build-in asynchronous checkpointing mechanism and also supports the Scalable Checkpoint/Restart (SCR) library for node level checkpointing. Second, CRAFT provides an easier interface for User-Level Failure Mitigation (ULFM) based dynamic process recovery, which significantly reduces the complexity and effort of failure detection and communication recovery mechanism. By utilizing both functionalities together, applications can write application-level checkpoints and recover dynamically from process failures with very limited programming effort. This work presents the design and use of our library in detail. The associated overheads are thoroughly analyzed using several benchmarks

    Chartering Turnaround: Leveraging Public Charter School Autonomy to Address Failure

    Get PDF
    Persistently low-achieving public schools around the country have received $5.8 billion from the federal School Improvement Grant (SIG) program, in addition to district and state funds, and other supplementary federal funds. Despite all of these sources of funding, most of the schools receiving them have failed to make a dramatic difference in improving student achievement. However, according to a new report jointly released by the National Alliance for Public Charter Schools and the Center on School Turnaround, autonomy provided by state charter laws can be better leveraged to improve school turnaround efforts.The report, Chartering Turnaround: Leveraging Public Charter School Autonomy to Address Failure, provides case studies of three charter management organizations (CMOs) that have successfully restarted low-achieving public schools, adding a valuable component to the limited body of research that exists about turnaround models. The report highlights the freedoms that benefit poor-performing schools most significantly, including: the autonomy to hire, retain and reward staff; the ability to adjust the length of school year, academic program and curriculum; and, the option to develop tailored approaches for finances and facilities
    • 

    corecore