30 research outputs found

    AI-Ckpt: Leveraging Memory Access Patterns for Adaptive Asynchronous Incremental Checkpointing

    Get PDF
    International audienceWith increasing scale and complexity of supercomputing and cloud computing architectures, faults are becoming a frequent occurrence, which makes reliability a difficult challenge. Although for some applications it is enough to restart failed tasks, there is a large class of applications where tasks run for a long time or are tightly coupled, thus making a restart from scratch unfeasible. Checkpoint-Restart (CR), the main method to survive failures for such applications faces additional challenges in this context: not only does it need to minimize the performance overhead on the application due to checkpointing, but it also needs to operate with scarce resources. Given the iterative nature of the targeted applications, we launch the assumption that first-time writes to memory during asynchronous checkpointing generate the same kind of interference as they did in past iterations. Based on this assumption, we propose novel asynchronous checkpointing approach that leverages both current and past access pattern trends in order to optimize the order in which memory pages are flushed to stable storage. Large scale experiments show up to 60% improvement when compared to state-of-art checkpointing approaches, all this achievable with an extra memory requirement of less than 5% of the total application memory

    BlobCR: Virtual Disk Based Checkpoint-Restart for HPC Applications on IaaS Clouds

    Get PDF
    International audienceInfrastructure-as-a-Service (IaaS) cloud computing is gaining significant interest in industry and academia as an alternative platform for running HPC applications. Given the need to provide fault tolerance, support for suspend-resume and offline migration, an efficient Checkpoint-Restart mechanism becomes paramount in this context. We propose BlobCR, a dedicated checkpoint repository that is able to take live incremental snapshots of the whole disk attached to the virtual machine (VM) instances. BlobCR aims to minimize the performance overhead of checkpointing by persisting VM disk snapshots asynchronously in the background using a low overhead technique we call selective copy-on-write. It includes support for both application-level and process-level checkpointing, as well as support to roll back file system changes. Experiments at large scale demonstrate the benefits of our proposal both in synthetic settings and for a real-life HPC application

    Portable Checkpointing for Parallel Applications

    Full text link
    High Performance Computing (HPC) systems represent the peak of modern computational capability. As ever-increasing demands for computational power have fuelled the demand for ever-larger computing systems, modern HPC systems have grown to incorporate hundreds, thousands or as many as 130,000 processors. At these scales, the huge number of individual components in a single system makes the probability that a single component will fail quite high, with today's large HPC systems featuring mean times between failures on the order of hours or a few days. As many modern computational tasks require days or months to complete, fault tolerance becomes critical to HPC system design. The past three decades have seen significant amounts of research on parallel system fault tolerance. However, as most of it has been either theoretical or has focused on low-level solutions that are embedded into a particular operating system or type of hardware, this work has had little impact on real HPC systems. This thesis attempts to address this lack of impact by describing a high-level approach for implementing checkpoint/restart functionality that decouples the fault tolerance solution from the details of the operating system, system libraries and the hardware and instead connects it to the APIs implemented by the above components. The resulting solution enables applications that use these APIs to become self-checkpointing and self-restarting regardless of the the software/hardware platform that may implement the APIs. The particular focus of this thesis is on the problem of checkpoint/restart of parallel applications. It presents two theoretical checkpointing protocols, one for the message passing communication model and one for the shared memory model. The former is the first protocol to be compatible with application-level checkpointing of individual processes, while the latter is the first protocol that is compatible with arbitrary shared memory models, APIs, implementations and consistency protocols. These checkpointing protocols are used to implement checkpointing systems for applications that use the MPI and OpenMP parallel APIs, respectively, and are first in providing checkpoint/restart to arbitrary implementations of these popular APIs. Both checkpointing systems are extensively evaluated on multiple software/hardware platforms and are shown to feature low overheads

    Efficient Passive Clustering and Gateways selection MANETs

    Get PDF
    Passive clustering does not employ control packets to collect topological information in ad hoc networks. In our proposal, we avoid making frequent changes in cluster architecture due to repeated election and re-election of cluster heads and gateways. Our primary objective has been to make Passive Clustering more practical by employing optimal number of gateways and reduce the number of rebroadcast packets

    Locality-driven checkpoint and recovery

    Get PDF
    Checkpoint and recovery are important fault-tolerance techniques for distributed systems. The two categories of existing strategies incur unacceptable performance cost either at run time or upon failure recovery, when applied to large-scale distributed systems. In particular, the large number of messages and processes in these systems causes either considerable checkpoint as well as logging overhead, or catastrophic global-wise recovery effect. This thesis proposes a locality-driven strategy for efficiently checkpointing and recovering such systems with both affordable runtime cost and controllable failure recoverability. Messages establish dependencies between distributed processes, which can be either preserved by coordinated checkpoints or removed via logging. Existing strategies enforce a uniform handling policy for all message dependencies, and hence gains advantage at one end but bears disadvantage at the other. In this thesis, a generic theory of Quasi-Atomic Recovery has been formulated to accommodate message handling requirements of both kinds, and to allow using different message handling methods together. Quasi-atomicity of recovery blocks implies proper confinement of recoveries, and thus enables localization of checkpointing and recovery around such a block and consequently a hybrid strategy with combined advantages from both ends. A strategy of group checkpointing with selective logging has been proposed, based on the observation of message localization around 'locality regions' in distributed systems. In essence, a group-wise coordinated checkpoint is created around such a region and only the few inter-region messages are logged subsequently. Runtime overhead is optimized due to largely reduced logging efforts, and recovery spread is as localized as region-wise. Various protocols have been developed to provide trade-offs between flexibility and performance. Also proposed is the idea of process clone that can be used to effectively remove program-order recovery dependencies among successive group checkpoints and thus to stop inter-group recovery spread. Distributed executions exhibit locality of message interactions. Such locality originates from resolving distributed dependency localization via message passing, and appears as a hierarchical 'region-transition' pattern. A bottom-up approach has been proposed to identify those regions, by detecting popular recurrence patterns from individual processes as 'locality intervals', and then composing them into 'locality regions' based on their tight message coupling relations between each other. Experiments conducted on real-life applications have shown the existence of hierarchical locality regions and have justified the feasibility of this approach. Performance optimization of group checkpoint strategies has to do with their uses of locality. An abstract performance measure has been-proposed to properly integrate both runtime overhead and failure recoverability in a region-wise marner. Taking this measure as the optimization objective, a greedy heuristic has been introduced to decompose a given distributed execution into optimized regions. Analysis implies that an execution pattern with good locality leads to good optimized performance, and the locality pattern itself can serve as a good candidate for the optimal decomposition. Consequently, checkpoint protocols have been developed to efficiently identify optimized regions in such an execution, with assistance of either design-time or runtime knowledge

    High-fidelity rendering on shared computational resources

    Get PDF
    The generation of high-fidelity imagery is a computationally expensive process and parallel computing has been traditionally employed to alleviate this cost. However, traditional parallel rendering has been restricted to expensive shared memory or dedicated distributed processors. In contrast, parallel computing on shared resources such as a computational or a desktop grid, offers a low cost alternative. But, the prevalent rendering systems are currently incapable of seamlessly handling such shared resources as they suffer from high latencies, restricted bandwidth and volatility. A conventional approach of rescheduling failed jobs in a volatile environment inhibits performance by using redundant computations. Instead, clever task subdivision along with image reconstruction techniques provides an unrestrictive fault-tolerance mechanism, which is highly suitable for high-fidelity rendering. This thesis presents novel fault-tolerant parallel rendering algorithms for effectively tapping the enormous inexpensive computational power provided by shared resources. A first of its kind system for fully dynamic high-fidelity interactive rendering on idle resources is presented which is key for providing an immediate feedback to the changes made by a user. The system achieves interactivity by monitoring and adapting computations according to run-time variations in the computational power and employs a spatio-temporal image reconstruction technique for enhancing the visual fidelity. Furthermore, algorithms described for time-constrained offline rendering of still images and animation sequences, make it possible to deliver the results in a user-defined limit. These novel methods enable the employment of variable resources in deadline-driven environments

    Software for Exascale Computing - SPPEXA 2016-2019

    Get PDF
    This open access book summarizes the research done and results obtained in the second funding phase of the Priority Program 1648 "Software for Exascale Computing" (SPPEXA) of the German Research Foundation (DFG) presented at the SPPEXA Symposium in Dresden during October 21-23, 2019. In that respect, it both represents a continuation of Vol. 113 in Springer’s series Lecture Notes in Computational Science and Engineering, the corresponding report of SPPEXA’s first funding phase, and provides an overview of SPPEXA’s contributions towards exascale computing in today's sumpercomputer technology. The individual chapters address one or more of the research directions (1) computational algorithms, (2) system software, (3) application software, (4) data management and exploration, (5) programming, and (6) software tools. The book has an interdisciplinary appeal: scholars from computational sub-fields in computer science, mathematics, physics, or engineering will find it of particular interest

    High performance deferred update replication

    Get PDF
    Replication is a well-known approach to implementing storage systems that can tolerate failures. Replicated storage systems are designed such that the state of the system is kept at several replicas. A replication protocol ensures that the failure of a replica is masked by the rest of the system, in a way that is transparent to its users. Replicated storage systems are among the most important building blocks in the design of large scale applications. Applications at scale are often deployed on top of commodity hardware, store a vast amount of data, and serve a large number of users. The larger the system, the higher its vulnerability to failures. The ability to tolerate failures is not the only desirable feature in a replicated system. Storage systems need to be efficient in order to accommodate requests from a large user base while achieving low response times. In that respect, replication can leverage multiple replicas to parallelize the execution of user requests. This thesis focuses on Deferred Update Replication (DUR), a well-established database replication approach. It provides high availability in that every replica can execute client transactions. In terms of performance, it is better than other replication techniques in that only one replica executes a given transaction while the other replicas only apply state changes. However, DUR suffers from the following drawback: each replica stores a full copy of the database, which has consequences in terms of performance. The first consequence is that DUR cannot take advantage of the aggregated memory available to the replicas. Our first contribution is a distributed caching mechanism that addresses the problem. It makes efficient use of the main memory of an entire cluster of machines, while guaranteeing strong consistency. The second consequence is that DUR cannot scale with the number of replicas. The throughput of a fully replicated system is inherently limited by the number of transactions that a single replica can apply to its local storage. We propose a scalable version of the DUR approach where the system state is partitioned in smaller replica sets. Transactions that access disjoint partitions are parallelized. The last part of the thesis focuses on latency. We show that the scalable DUR-based approach may have detrimental effects on response time, especially when replicas are geographically distributed. The thesis considers different deployments and their implications on latency. We propose optimizations that provide substantial gains in geographically distributed environments

    Integrated Data, Message, and Process Recovery for Failure Masking in Web Services

    Get PDF
    Modern Web Services applications encompass multiple distributed interacting components, possibly including millions of lines of code written in different programming languages. With this complexity, some bugs often remain undetected despite extensive testing procedures, and occasionally cause transient system failures. Incorrect failure handling in applications often leads to incomplete or to unintentional request executions. A family of recovery protocols called interaction contracts provides a generic solution to this problem by means of system-integrated data, process, and message recovery for multi-tier applications. It is able to mask failures, and allows programmers to concentrate on the application logic, thus speeding up the development process. This thesis consists of two major parts. The first part formally specifies the interaction contracts using the state-and-activity chart language. Moreover, it presents a formal specification of a concrete Web Service that makes use of interaction contracts, and contains no other error-handling actions. The formal specifications undergo verification where crucial safety and liveness properties expressed in temporal logics are mathematically proved by means of model checking. In particular, it is shown that each end-user request is executed exactly once. The second part of the thesis demonstrates the viability of the interaction framework in a real world system. More specifically, a cascadable Web Service platform, EOS, is built based on widely used components, Microsoft Internet Explorer and PHP application server, with interaction contracts integrated into them.Heutige Web-Service-Anwendungen setzen sich aus mehreren verteilten interagierenden Komponenten zusammen. Dabei werden oft mehrere Programmiersprachen eingesetzt, und der Quellcode einer Komponente kann mehrere Millionen Programmzeilen umfassen. In Anbetracht dieser Komplexität bleiben typischerweise einige Programmierfehler trotz intensiver Qualitätssicherung unentdeckt und verursachen vorübergehende Systemsausfälle zur Laufzeit. Eine ungenügende Fehlerbehandlung in Anwendungen führt oft zur unvollständigen oder unbeabsichtigt wiederholten Ausführung einer Operation. Eine Familie von Recovery-Protokollen, die so genannten "Interaction Contracts", bietet eine generische Lösung dieses Problems. Diese Recovery- Protokolle sorgen für die Fehlermaskierung und ermöglichen somit, dass Entwickler ihre ganze Konzentration der Anwendungslogik widmen können. Dies trägt zu einer erheblichen Beschleunigung des Entwicklungsprozesses bei. Diese Dissertation besteht aus zwei wesentlichen Teilen. Der erste Teil widmet sich der formalen Spezifikation der Recovery-Protokolle unter Verwendung des Formalismus der State-and-Activity-Charts. Darüber hinaus entwickeln wir die formale Spezifikation einer Web-Service-Anwendung, die außer den Recovery-Protokollen keine weitere Fehlerbehandlung beinhaltet. Die formalen Spezifikationen werden in Bezug auf kritische Sicherheits- und Lebendigkeitseigenschaften, die als temporallogische Formeln angegeben sind, mittels "Model Checking" verifiziert. Unter anderem wird somit mathematisch bewiesen, dass jede Operation eines Endbenutzers genau einmal ausgeführt wird. Der zweite Teil der Dissertation beschreibt die Implementierung der Recovery- Protokolle im Rahmen einer beliebig verteilbaren Web-Service-Plattform EOS, die auf weit verbreiteten Web-Produkten aufbaut: dem Browser "Microsoft Internet Explorer" und dem PHP-Anwendungsserver
    corecore