104 research outputs found

    Transparently Mixing Undo Logs and Software Reversibility for State Recovery in Optimistic PDES

    Get PDF
    The rollback operation is a fundamental building block to support the correct execution of a speculative Time Warp-based Parallel Discrete Event Simulation. In the literature, several solutions to reduce the execution cost of this operation have been proposed, either based on the creation of a checkpoint of previous simulation state images, or on the execution of negative copies of simulation events which are able to undo the updates on the state. In this paper, we explore the practical design and implementation of a state recoverability technique which allows to restore a previous simulation state either relying on checkpointing or on the reverse execution of the state updates occurred while processing events in forward mode. Differently from other proposals, we address the issue of executing backward updates in a fully-transparent and event granularity-independent way, by relying on static software instrumentation (targeting the x86 architecture and Linux systems) to generate at runtime reverse update code blocks (not to be confused with reverse events, proper of the reverse computing approach). These are able to undo the effects of a forward execution while minimizing the cost of the undo operation. We also present experimental results related to our implementation, which is released as free software and fully integrated into the open source ROOT-Sim (ROme OpTimistic Simulator) package. The experimental data support the viability and effectiveness of our proposal

    Autonomic State Management for Optimistic Simulation Platforms

    Get PDF
    We present the design and implementation of an autonomic state manager (ASM) tailored for integration within optimistic parallel discrete event simulation (PDES) environments based on the C programming language and the executable and linkable format (ELF), and developed for execution on x8664 architectures. With ASM, the state of any logical process (LP), namely the individual (concurrent) simulation unit being part of the simulation model, is allowed to be scattered on dynamically allocated memory chunks managed via standard API (e.g., malloc/free). Also, the application programmer is not required to provide any serialization/deserialization module in order to take a checkpoint of the LP state, or to restore it in case a causality error occurs during the optimistic run, or to provide indications on which portions of the state are updated by event processing, so to allow incremental checkpointing. All these tasks are handled by ASM in a fully transparent manner via (A) runtime identification (with chunk-level granularity) of the memory map associated with the LP state, and (B) runtime tracking of the memory updates occurring within chunks belonging to the dynamic memory map. The co-existence of the incremental and non-incremental log/restore modes is achieved via dual versions of the same application code, transparently generated by ASM via compile/link time facilities. Also, the dynamic selection of the best suited log/restore mode is actuated by ASM on the basis of an innovative modeling/optimization approach which takes into account stability of each operating mode with respect to variations of the model/environmental execution parameters

    An Evolutionary Algorithm to Optimize Log/Restore Operations within Optimistic Simulation Platforms

    Get PDF
    In this work we address state recoverability in advanced optimistic simulation systems by proposing an evolutionary algorithm to optimize at run-time the parameters associated with state log/restore activities. Optimization takes place by adaptively selecting for each simulation object both (i) the best suited log mode (incremental vs non-incremental) and (ii) the corresponding optimal value of the log interval. Our performance optimization approach allows to indirectly cope with hidden effects (e.g., locality) as well as cross-object effects due to the variation of log/restore parameters for different simulation objects (e.g., rollback thrashing). Both of them are not captured by literature solutions based on analytical models of the overhead associated with log/restore tasks. More in detail, our evolutionary algorithm dynamically adjusts the log/restore parameters of distinct simulation objects as a whole, towards a well suited configuration. In such a way, we prevent negative effects on performance due to the biasing of the optimization towards individual simulation objects, which may cause reduced gains (or even decrease) in performance just due to the aforementioned hidden and/or cross-object phenomena. We also present an application-transparent implementation of the evolutionary algorithm within the ROme OpTimistic Simulator (ROOT-Sim), namely an open source, general purpose simulation environment designed according to the optimistic synchronization paradigm

    Autonomic log/restore for advanced optimistic simulation systems

    Get PDF
    In this paper we address state recoverability in optimistic simulation systems by presenting an autonomic log/restore architecture. Our proposal is unique in that it jointly provides the following features: (i) log/restore operations are carried out in a completely transparent manner to the application programmer, (ii) the simulation-object state can be scattered across dynamically allocated non-contiguous memory chunks, (iii) two differentiated operating modes, incremental vs non-incremental, coexist via transparent, optimized run-time management of dual versions of the same application layer, with dynamic selection of the best suited operating mode in different phases of the optimistic simulation run, and (iv) determinationof the best suited mode for any time frame is carried out on the basis of an innovative modeling/optimization approach that takes into account stability of each operating mode vs variations of the model execution parameters. © 2010 IEEE

    Cache-Aware Memory Manager for Optimistic Simulations

    Get PDF
    Parallel Discrete Event Simulation is a well known technique for executing complex general-purpose simulations where models are described as objects the interaction of which is expressed through the generation of impulsive events. In particular, Optimistic Simulation allows full exploitation of the available computational power, avoiding the need to compute safety properties for the events to be executed. Optimistic Simulation platforms internally rely on several data structures, which are meant to support operations aimed at ensuring correctness, inter-kernel communication and/or event scheduling. These housekeeping and management operations access them according to complex patterns, commonly suffering from misuse of memory caching architectures. In particular, operations like log/restore access data structures on a periodic basis, producing the replacement of in-cache buffers related to the actual working set of the application logic, producing a non-negligible performance drop. In this work we propose generally-applicable design principles for a new memory management subsystem targeted at Optimistic Simulation platforms which can face this issue by wisely allocating memory buffers depending on their actual future access patterns, in order to enhance event-execution memory locality. Additionally, an application-transparent implementation within ROOT-Sim, an open-source generalpurpose optimistic simulation platform, is presented along with experimental results testing our proposal

    A load-sharing architecture for high performance optimistic simulations on multi-core machines

    Get PDF
    In Parallel Discrete Event Simulation (PDES), the simulation model is partitioned into a set of distinct Logical Processes (LPs) which are allowed to concurrently execute simulation events. In this work we present an innovative approach to load-sharing on multi-core/multiprocessor machines, targeted at the optimistic PDES paradigm, where LPs are speculatively allowed to process simulation events with no preventive verification of causal consistency, and actual consistency violations (if any) are recovered via rollback techniques. In our approach, each simulation kernel instance, in charge of hosting and executing a specific set of LPs, runs a set of worker threads, which can be dynamically activated/deactivated on the basis of a distributed algorithm. The latter relies in turn on an analytical model that provides indications on how to reassign processor/core usage across the kernels in order to handle the simulation workload as efficiently as possible. We also present a real implementation of our load-sharing architecture within the ROme OpTimistic Simulator (ROOT-Sim), namely an open-source C-based simulation platform implemented according to the PDES paradigm and the optimistic synchronization approach. Experimental results for an assessment of the validity of our proposal are presented as well

    Fault-Tolerant Adaptive Parallel and Distributed Simulation

    Full text link
    Discrete Event Simulation is a widely used technique that is used to model and analyze complex systems in many fields of science and engineering. The increasingly large size of simulation models poses a serious computational challenge, since the time needed to run a simulation can be prohibitively large. For this reason, Parallel and Distributes Simulation techniques have been proposed to take advantage of multiple execution units which are found in multicore processors, cluster of workstations or HPC systems. The current generation of HPC systems includes hundreds of thousands of computing nodes and a vast amount of ancillary components. Despite improvements in manufacturing processes, failures of some components are frequent, and the situation will get worse as larger systems are built. In this paper we describe FT-GAIA, a software-based fault-tolerant extension of the GAIA/ART\`IS parallel simulation middleware. FT-GAIA transparently replicates simulation entities and distributes them on multiple execution nodes. This allows the simulation to tolerate crash-failures of computing nodes; furthermore, FT-GAIA offers some protection against byzantine failures since synchronization messages are replicated as well, so that the receiving entity can identify and discard corrupted messages. We provide an experimental evaluation of FT-GAIA on a running prototype. Results show that a high degree of fault tolerance can be achieved, at the cost of a moderate increase in the computational load of the execution units.Comment: Proceedings of the IEEE/ACM International Symposium on Distributed Simulation and Real Time Applications (DS-RT 2016

    A new approach to reversible computing with applications to speculative parallel simulation

    Get PDF
    In this thesis, we propose an innovative approach to reversible computing that shifts the focus from the operations to the memory outcome of a generic program. This choice allows us to overcome some typical challenges of "plain" reversible computing. Our methodology is to instrument a generic application with the help of an instrumentation tool, namely Hijacker, which we have redesigned and developed for the purpose. Through compile-time instrumentation, we enhance the program's code to keep track of the memory trace it produces until the end. Regardless of the complexity behind the generation of each computational step of the program, we can build inverse machine instructions just by inspecting the instruction that is attempting to write some value to memory. Therefore from this information, we craft an ad-hoc instruction that conveys this old value and the knowledge of where to replace it. This instruction will become part of a more comprehensive structure, namely the reverse window. Through this structure, we have sufficient information to cancel all the updates done by the generic program during its execution. In this writing, we will discuss the structure of the reverse window, as the building block for the whole reversing framework we designed and finally realized. Albeit we settle our solution in the specific context of the parallel discrete event simulation (PDES) adopting the Time Warp synchronization protocol, this framework paves the way for further general-purpose development and employment. We also present two additional innovative contributions coming from our innovative reversibility approach, both of them still embrace traditional state saving-based rollback strategy. The first contribution aims to harness the advantages of both the possible approaches. We implement the rollback operation combining state saving together with our reversible support through a mathematical model. This model enables the system to choose in autonomicity the best rollback strategy, by the mutable runtime dynamics of programs. The second contribution explores an orthogonal direction, still related to reversible computing aspects. In particular, we will address the problem of reversing shared libraries. Indeed, leading from their nature, shared objects are visible to the whole system and so does every possible external modification of their code. As a consequence, it is not possible to instrument them without affecting other unaware applications. We propose a different method to deal with the instrumentation of shared objects. All our innovative proposals have been assessed using the last generation of the open source ROOT-Sim PDES platform, where we integrated our solutions. ROOT-Sim is a C-based package implementing a general purpose simulation environment based on the Time Warp synchronization protocol

    Fault Tolerant Adaptive Parallel and Distributed Simulation through Functional Replication

    Full text link
    This paper presents FT-GAIA, a software-based fault-tolerant parallel and distributed simulation middleware. FT-GAIA has being designed to reliably handle Parallel And Distributed Simulation (PADS) models, which are needed to properly simulate and analyze complex systems arising in any kind of scientific or engineering field. PADS takes advantage of multiple execution units run in multicore processors, cluster of workstations or HPC systems. However, large computing systems, such as HPC systems that include hundreds of thousands of computing nodes, have to handle frequent failures of some components. To cope with this issue, FT-GAIA transparently replicates simulation entities and distributes them on multiple execution nodes. This allows the simulation to tolerate crash-failures of computing nodes. Moreover, FT-GAIA offers some protection against Byzantine failures, since interaction messages among the simulated entities are replicated as well, so that the receiving entity can identify and discard corrupted messages. Results from an analytical model and from an experimental evaluation show that FT-GAIA provides a high degree of fault tolerance, at the cost of a moderate increase in the computational load of the execution units.Comment: arXiv admin note: substantial text overlap with arXiv:1606.0731
    • …
    corecore