10,287 research outputs found
Large-scale benchmarks of the Time-Warp/Graph-Theoretical Kinetic Monte Carlo approach for distributed on-lattice simulations of catalytic kinetics
We extend the work of Ravipati et al.[Comput. Phys. Commun., 2022, 270, 108148] in benchmarking the performance of large-scale, distributed, on-lattice kinetic Monte Carlo (KMC) simulations. Our software package, Zacros, employs a graph-theoretical approach to KMC, coupled with the Time-Warp algorithm for parallel discrete event simulations. The lattice is divided into equal subdomains, each assigned to a single processor; the cornerstone of the Time-Warp algorithm is the state queue, to which snapshots of the KMC (lattice) state are saved regularly, enabling historical KMC information to be corrected when conflicts occur at the subdomain boundaries. Focusing on three model systems, we highlight the key Time-Warp parameters that can be tuned to optimise KMC performance. The frequency of state saving, controlled by the state saving interval, δsnap, is shown to have the largest effect on performance, which favours balancing the overhead of re-simulating KMC history with that of writing state snapshots to memory. Also important is the global virtual time (GVT) computation interval, ΔτGVT, which has little direct effect on the progress of the simulation but controls how often the state queue memory can be freed up. We find that a vector data structure is, in general, more favourable than a linked list for storing the state queue, due to the reduced time required for allocating and de-allocating memory. These findings will guide users in maximising the efficiency of Zacros or other distributed KMC software, which is a vital step towards realising accurate, meso-scale simulations of heterogeneous catalysis
A Generic Checkpoint-Restart Mechanism for Virtual Machines
It is common today to deploy complex software inside a virtual machine (VM).
Snapshots provide rapid deployment, migration between hosts, dependability
(fault tolerance), and security (insulating a guest VM from the host). Yet, for
each virtual machine, the code for snapshots is laboriously developed on a
per-VM basis. This work demonstrates a generic checkpoint-restart mechanism for
virtual machines. The mechanism is based on a plugin on top of an unmodified
user-space checkpoint-restart package, DMTCP. Checkpoint-restart is
demonstrated for three virtual machines: Lguest, user-space QEMU, and KVM/QEMU.
The plugins for Lguest and KVM/QEMU require just 200 lines of code. The Lguest
kernel driver API is augmented by 40 lines of code. DMTCP checkpoints
user-space QEMU without any new code. KVM/QEMU, user-space QEMU, and DMTCP need
no modification. The design benefits from other DMTCP features and plugins.
Experiments demonstrate checkpoint and restart in 0.2 seconds using forked
checkpointing, mmap-based fast-restart, and incremental Btrfs-based snapshots
Autonomic log/restore for advanced optimistic simulation systems
In this paper we address state recoverability in optimistic simulation systems by presenting an autonomic log/restore architecture. Our proposal is unique in that it jointly provides the following features: (i) log/restore operations are carried out in a completely transparent manner to the application programmer, (ii) the simulation-object state can be scattered across dynamically allocated non-contiguous memory chunks, (iii) two differentiated operating modes, incremental vs non-incremental, coexist via transparent, optimized run-time management of dual versions of the same application layer, with dynamic selection of the best suited operating mode in different phases of the optimistic simulation run, and (iv) determinationof the best suited mode for any time frame is carried out on the basis of an innovative modeling/optimization approach that takes into account stability of each operating mode vs variations of the model execution parameters. © 2010 IEEE
Analyzing and Modeling the Performance of the HemeLB Lattice-Boltzmann Simulation Environment
We investigate the performance of the HemeLB lattice-Boltzmann simulator for
cerebrovascular blood flow, aimed at providing timely and clinically relevant
assistance to neurosurgeons. HemeLB is optimised for sparse geometries,
supports interactive use, and scales well to 32,768 cores for problems with ~81
million lattice sites. We obtain a maximum performance of 29.5 billion site
updates per second, with only an 11% slowdown for highly sparse problems (5%
fluid fraction). We present steering and visualisation performance measurements
and provide a model which allows users to predict the performance, thereby
determining how to run simulations with maximum accuracy within time
constraints.Comment: Accepted by the Journal of Computational Science. 33 pages, 16
figures, 7 table
Lightweight Asynchronous Snapshots for Distributed Dataflows
Distributed stateful stream processing enables the deployment and execution
of large scale continuous computations in the cloud, targeting both low latency
and high throughput. One of the most fundamental challenges of this paradigm is
providing processing guarantees under potential failures. Existing approaches
rely on periodic global state snapshots that can be used for failure recovery.
Those approaches suffer from two main drawbacks. First, they often stall the
overall computation which impacts ingestion. Second, they eagerly persist all
records in transit along with the operation states which results in larger
snapshots than required. In this work we propose Asynchronous Barrier
Snapshotting (ABS), a lightweight algorithm suited for modern dataflow
execution engines that minimises space requirements. ABS persists only operator
states on acyclic execution topologies while keeping a minimal record log on
cyclic dataflows. We implemented ABS on Apache Flink, a distributed analytics
engine that supports stateful stream processing. Our evaluation shows that our
algorithm does not have a heavy impact on the execution, maintaining linear
scalability and performing well with frequent snapshots.Comment: 8 pages, 7 figure
- …