391 research outputs found

    Transparently Mixing Undo Logs and Software Reversibility for State Recovery in Optimistic PDES

    Get PDF
    The rollback operation is a fundamental building block to support the correct execution of a speculative Time Warp-based Parallel Discrete Event Simulation. In the literature, several solutions to reduce the execution cost of this operation have been proposed, either based on the creation of a checkpoint of previous simulation state images, or on the execution of negative copies of simulation events which are able to undo the updates on the state. In this paper, we explore the practical design and implementation of a state recoverability technique which allows to restore a previous simulation state either relying on checkpointing or on the reverse execution of the state updates occurred while processing events in forward mode. Differently from other proposals, we address the issue of executing backward updates in a fully-transparent and event granularity-independent way, by relying on static software instrumentation (targeting the x86 architecture and Linux systems) to generate at runtime reverse update code blocks (not to be confused with reverse events, proper of the reverse computing approach). These are able to undo the effects of a forward execution while minimizing the cost of the undo operation. We also present experimental results related to our implementation, which is released as free software and fully integrated into the open source ROOT-Sim (ROme OpTimistic Simulator) package. The experimental data support the viability and effectiveness of our proposal

    Autonomic State Management for Optimistic Simulation Platforms

    Get PDF
    We present the design and implementation of an autonomic state manager (ASM) tailored for integration within optimistic parallel discrete event simulation (PDES) environments based on the C programming language and the executable and linkable format (ELF), and developed for execution on x8664 architectures. With ASM, the state of any logical process (LP), namely the individual (concurrent) simulation unit being part of the simulation model, is allowed to be scattered on dynamically allocated memory chunks managed via standard API (e.g., malloc/free). Also, the application programmer is not required to provide any serialization/deserialization module in order to take a checkpoint of the LP state, or to restore it in case a causality error occurs during the optimistic run, or to provide indications on which portions of the state are updated by event processing, so to allow incremental checkpointing. All these tasks are handled by ASM in a fully transparent manner via (A) runtime identification (with chunk-level granularity) of the memory map associated with the LP state, and (B) runtime tracking of the memory updates occurring within chunks belonging to the dynamic memory map. The co-existence of the incremental and non-incremental log/restore modes is achieved via dual versions of the same application code, transparently generated by ASM via compile/link time facilities. Also, the dynamic selection of the best suited log/restore mode is actuated by ASM on the basis of an innovative modeling/optimization approach which takes into account stability of each operating mode with respect to variations of the model/environmental execution parameters

    Application-level differential checkpointing for HPC applications with dynamic datasets

    Get PDF
    High-performance computing (HPC) requires resilience techniques such as checkpointing in order to tolerate failures in supercomputers. As the number of nodes and memory in supercomputers keeps on increasing, the size of checkpoint data also increases dramatically, sometimes causing an I/O bottleneck. Differential checkpointing (dCP) aims to minimize the checkpointing overhead by only writing data differences. This is typically implemented at the memory page level, sometimes complemented with hashing algorithms. However, such a technique is unable to cope with dynamic-size datasets. In this work, we present a novel dCP implementation with a new file format that allows fragmentation of protected datasets in order to support dynamic sizes. We identify dirty data blocks using hash algorithms. In order to evaluate the dCP performance, we ported the HPC applications xPic, LULESH 2.0 and Heat2D and analyze them regarding their potential of reducing I/O with dCP and how this data reduction influences the checkpoint performance. In our experiments, we achieve reductions of up to 62% of the checkpoint time.This project has received funding from the European Unions Seventh Framework Programme (FP7/2007-2013) and the Horizon 2020 (H2020) funding framework under grant agreement no. H2020-FETHPC-754304 (DEEP-EST); and from the European Unions Horizon 2020 research and innovation programme under the LEGaTO Project (legato- project.eu), grant agreement No 780681.Peer ReviewedPostprint (author's final draft

    Better database cost/performance via batched I/O on programmable SSD

    Get PDF

    Extending the OpenCHK Model with Advanced Checkpoint Features

    Full text link
    One of the major challenges in using extreme scale systems efficiently is to mitigate the impact of faults. Application-level checkpoint/restart (CR) methods provide the best trade-off between productivity, robustness, and performance. There are many solutions implementing CR at the application level. They all provide advanced I/O capabilities to minimize the overhead introduced by CR. Nevertheless, there is still room for improvement in terms of programmability and flexibility, because end-users must manually serialize and deserialize application state using low-level APIs, modify the flow of the application to consider restarts, or rewrite CR code whenever the backend library changes. In this work, we propose a set of compiler directives and clauses that allow users to specify CR operations in a simple way. Our approach supports the common CR features provided by all the CR libraries. However, it can also be extended to support advanced features that are only available in some CR libraries, such as differential checkpointing, the use of HDF5 format, and the possibility of using fault-tolerance-dedicated threads. The result of our evaluation revealed a high increase in programmability. On average, we reduced the number of lines of code by 71%, 94%, and 64% for FTI, SCR, and VeloC, respectively, and no additional overhead was perceived using our solution compared to using the backend libraries directly. Finally, portability is enhanced because our programming model allows the use of any backend library without changing any code

    Compiler-Enhanced Incremental Checkpointing for OpenMP Applications

    Get PDF
    As modern supercomputing systems reach the peta-flop performance range, they grow in both size and complexity. This makes them increasingly vulnerable to failures from a variety of causes. Checkpointing is a popular technique for tolerating such failures, enabling applications to periodically save their state and restart computation after a failure. Although a variety of automated system-level checkpointing solutions are currently available to HPC users, manual application-level checkpointing remains more popular due to its superior performance. This paper improves performance of automated checkpointing via a compiler analysis for incremental checkpointing. This analysis, which works with both sequential and OpenMP applications, reduces checkpoint sizes by as much as 80% and enables asynchronous checkpointing

    GPUs as Storage System Accelerators

    Full text link
    Massively multicore processors, such as Graphics Processing Units (GPUs), provide, at a comparable price, a one order of magnitude higher peak performance than traditional CPUs. This drop in the cost of computation, as any order-of-magnitude drop in the cost per unit of performance for a class of system components, triggers the opportunity to redesign systems and to explore new ways to engineer them to recalibrate the cost-to-performance relation. This project explores the feasibility of harnessing GPUs' computational power to improve the performance, reliability, or security of distributed storage systems. In this context, we present the design of a storage system prototype that uses GPU offloading to accelerate a number of computationally intensive primitives based on hashing, and introduce techniques to efficiently leverage the processing power of GPUs. We evaluate the performance of this prototype under two configurations: as a content addressable storage system that facilitates online similarity detection between successive versions of the same file and as a traditional system that uses hashing to preserve data integrity. Further, we evaluate the impact of offloading to the GPU on competing applications' performance. Our results show that this technique can bring tangible performance gains without negatively impacting the performance of concurrently running applications.Comment: IEEE Transactions on Parallel and Distributed Systems, 201

    Composing resilience techniques: ABFT, periodic and incremental checkpointing

    Get PDF
    An electrochemical sensor is described for the determination of L-dopa (levodopa; 3,4-dihydroxyphenylalanine). An inkjet-printed carbon nanotube (IJPCNT) electrode was modified with manganese dioxide microspheres by drop-casting. They coating was characterized by field emission scanning electron microscopy, Fourier-transform infrared spectroscopy and X-ray powder diffraction. The sensor, best operated at a working voltage of 0.3 V, has a linear response in the 0.1 to 10 \u3bcM L-dopa concentration range, a 54 nM detection limit, excellent reproducibility, repeatability and selectivity. The amperometric approach was applied to the determination of L-dopa in spiked biological fluids and displayed satisfactory accuracy and precision. Graphical abstract Schematic representation of an amperometric method for determination L-dopa. It is based on the use of inkjet-printed carbon nanotube electrode (IJPCNT) modified with manganese dioxide (MnO2)

    Composing resilience techniques: ABFT, periodic and incremental checkpointing

    Full text link
    • …
    corecore