Abstract-We present the design and implementation of an autonomic state manager (ASM) tailored for integration within optimistic parallel discrete event simulation (PDES) environments based on the C programming language and the executable and linkable format (ELF), and developed for execution on x86_64 architectures. With ASM, the state of any logical process (LP), namely the individual (concurrent) simulation unit being part of the simulation model, is allowed to be scattered on dynamically allocated memory chunks managed via standard API (e.g., malloc/free). Also, the application programmer is not required to provide any serialization/ deserialization module in order to take a checkpoint of the LP state, or to restore it in case a causality error occurs during the optimistic run, or to provide indications on which portions of the state are updated by event processing, so to allow incremental checkpointing. All these tasks are handled by ASM in a fully transparent manner via (A) runtime identification (with chunk-level granularity) of the memory map associated with the LP state, and (B) runtime tracking of the memory updates occurring within chunks belonging to the dynamic memory map. The co-existence of the incremental and non-incremental log/restore modes is achieved via dual versions of the same application code, transparently generated by ASM via compile/link time facilities. Also, the dynamic selection of the best suited log/ restore mode is actuated by ASM on the basis of an innovative modeling/optimization approach which takes into account stability of each operating mode with respect to variations of the model/environmental execution parameters.
Ç

INTRODUCTION
T IMELINESS in the delivery of simulation outputs is an increasingly relevant issue to cope with, especially in contexts where simulation is exploited as a tool for decision making. For the case of discrete event simulation (DES) models, performance issues have been traditionally targeted via the parallel-DES (PDES) paradigm [1] , based on partitioning the simulation model into distinct logical processes (LPs) to be executed concurrently. Each LP models a portion of the simulated system, and interactions between different portions are captured by the exchange of timestamped event messages across the LPs. Thanks to partitioning and to concurrent LPs' execution, PDES is able to exploit the computing power offered by (high-end) parallel/distributed platforms in order to both speedup model execution and make (very) large and/or accurate models tractable. The LP is usually implemented as a set of data structures updated via a callback, whose execution, representing the processing of a simulation event, is dispatched by an underlying simulation platform (see, e.g., [2] , [3] , [4] ) . Also, causally consistent execution is typically based on forcing any LP to process its input events in non-decreasing timestamp order.
To support local timestamp ordering at the LP, two synchronization approaches have been proposed: conservative and optimistic. The conservative approach (see, e.g., [5] ) avoids at all the possibility for any event to be executed out of timestamp order. This is achieved via block-until-safe policies suspending processing activities at the LP until it is determined that the execution of its next pending event is coherent with logical-time ordering. On the other hand, the optimistic approach (see, e.g., [6] ) allows the LP to speculatively process its available input events under the assumption that timestamp ordering will not be violated. If any violation is eventually detected, rollback recovery mechanisms bring the involved LPs back to a correct snapshot of their states, starting from which execution is resumed. Literature results show that the optimistic approach is prone to higher parallelism exploitation, and to deliver performance which is less influenced by the simulation model lookahead. 1 These advantages are reflected also on the side of scalability, as recently shown in [7] , where very large platforms (with thousands of CPU cores) are employed for a comparative analysis of conservative versus optimistic approaches.
On the other hand, recoverability of the LPs' states, which is the building block for optimistic synchronization and which has been traditionally supported via log/restore techniques, poses problems on the side of both resource usage and application transparency. As for the former aspect, we need to consider both CPU usage for tasks enabling state recoverability (such as state logging) and recovery actions, as well as memory usage for keeping recoverability-related data/metadata.
The issue of transparency deals with avoiding the need for recoverability modules to be demanded from the application programmer, hence masking to her the actual synchronization paradigm. This is a non-trivial aspect that relates to the flexibility according to which the programmer is allowed to organize the data structures representing the LP state image, whose log/restore is then demanded from the underlying PDES platform. For incremental logging, this also requires transparent and efficient identification of the updates occurring within the LP-state layout.
In this paper we present a fully innovative autonomic state manager (ASM) architecture, targeting C-based, ELF, x86_64 platforms, which jointly addresses transparency and performance issues in state recoverability by exhibiting the following features: (A) It allows the application programmer to use standard constructs for dynamic memory allocation/deallocation, hence allowing the LP state to be scattered across non-contiguous memory chunks. (B) It transparently enables phase-interleaved adoption of incremental and non-incremental log/restore modes. (C) It runs each log/restore mode in a highly optimized fashion, via the adoption of dual-coding approaches and of classical schemes for the optimization of parameters determining the actual overhead for each mode. (D) It dynamically (and transparently) switches to the best suited operating mode (incremental versus non-incremental) depending on execution dynamics of the optimistic simulation run. This is done on the basis of an innovative approach that takes into account performance stability of each operating mode versus variations of application and/or environmental parameters. While some of the above points are dealt with by log/ restore proposals in literature, none of them fully covers the whole set of listed features.
ASM has been integrated into the ROOT À Sim open source simulation platform [4] , [8] based on the C language and MPI. We also report a performance study for an assessment of ASM, which has been based on running a suite of applications on top of a 32-core (64 GB of RAM) HP ProLiant machine, which is representative of current off-the-shelf commodity hardware exploitable for scientific computing.
This paper significantly expands the preliminary conference paper in [9] . Particularly, it provides details on the memory management architecture and reports experimental data for a wider set of benchmark applications, for scaled-up models and with a larger computing platform.
The remainder of this paper is structured as follows. In Section 2 related work is discussed. Section 3 presents the complete ASM architecture, including memory management aspects and the performance models it relies on. Experimental results are presented in Section 4. Conclusions are drawn in Section 5. A supplementary material section is available online, which can be found on the Computer Society Digital Library at http://doi.ieeecomputersociety.org/ 10.1109/TPDS.2014.2323967.
RELATED WORK
Logging is the classical means to support recoverability of the LP state in optimistic PDES systems. It relies on (infrequently) saving the LP state image in order to generate restoration points along the simulation time axis. Several studies have provided analytic models describing the expected log/ restore overhead when experiencing a given rollback pattern (e.g., in terms of frequency of rollback occurrence at the LP) and when taking state logs, namely checkpoints, at specific points of the execution (for instance every x events) [10] , [11] , [12] . By monitoring the independent parameters appearing in the analytic expressions, the models can be used to (dynamically) determine the settings that keep the log/resore overhead at minimum values. More specifically, taking checkpoints less frequently reduces the log cost. However, the state to be recovered might not be immediately available within the log, and would need to be reconstructed by restoring an older snapshot and by fictitiously reprocessing the intermediate events up to the restoration point (this re-processing phase is also known as coasting forward). Analytical models help determining the well-suited balance among these two opposite overhead tendencies.
The provided models target either non-incremental logging or incremental logging, or the case where the two approaches are used in combination (e.g., by taking incremental logs between subsequent non-incremental logs) [13] or are considered comparatively [14] . However, these proposals have been mainly tailored for the evaluation of log/ restore policies (once known the costs for basic operations, such as the copy of the whole, or part of, the LP state image into the log buffer), not to provide log/restore architectures explicitly tackling transparency of log/restore tasks to the application-level code.
The issue of transparency has been dealt with by other studies. For incremental log/restore schemes this has been done by either instrumenting application level code (in order to transparently insert code portions aimed at identifying state update operations) [15] or by employing operator overloading schemes, as for the case of the object-oriented proposal in [16] . These approaches require compile-time identification of the memory portions forming the actual state image of the LP, hence they do not cope with dynamic memory. On the other hand, the solution in [17] provides support for transparency in the context of dynamic memory based state layouts, but limited to non-incremental logging.
Other proposals have provided log architectures based on specialized hardware [18] , [19] . They achieve some level of transparency, while also offloading the CPU, at the price of limiting the programming model, e.g., by imposing contiguousness or static determination of the memory area maintaining the state image of the LP.
The case of dynamic-memory based LP states has been addressed by the proposals in [20] , [21] . However, the level of transparency is not maximized since ad-hoc dynamic memory allocation/deallocation APIs are used to notify the underlying simulation platform that the corresponding operation needs to be rollbackable.
Full transparency, in combination with incremental logging, has been provided by the high-level-architecture (HLA) oriented proposal in [22] . This approach exploits page-based memory update tracking mechanisms relying on facilities offered by the underlying operating system. Hence the granularity according to which incremental logs are taken cannot be set arbitrarily, and cannot be optimized depending on the actual needs. This proposal is suited for federations of simulation components where a middleware layer (namely the HLA RunTime Infrastructure) is used to operate distributed coordination and data exchange, whose overhead tends to mask the one imposed by page-based logging. It results less suited for traditional PDES platforms, relying on highly optimized low-overhead engine-level coordination and data-exchange facilities.
An approach to state recoverability which is orthogonal to the aforementioned solutions is based on reverse computing [23] , [24] , [25] , where the forward execution code is coupled with a reverse code version which is in charge of backward compensating (hence undoing) the updates performed on the LP state in case a rollback occurs. The issue of automation of the generation of the reverse code, which targets transparency to the application programmer, is also faced in [23] . This approach reduces the memory demand for state-log buffers, while also nullifying the log overhead (since logs are not taken at all). On the other hand, the tradeoff is towards a potential increase of the restore latency in case very long rollbacks occur, which would require long reverse computing paths to reach the requested state recovery point. This approach demands the combination with periodic state logging in order (a) to avoid excessively long backward computation phases, and (b) to deal with nonreversible operations (such as plain assignments within the state image).
THE AUTONOMIC STATE MANAGER
Memory Management Architecture
Memory Mapper and Allocator
ASM's memory allocator relies on malloc, realloc, calloc and free wrappers which are transparently interposed via simple compile time directives between the applicationlevel code and the standard malloc library (see Fig. 1 ). In order to allow rollbackable memory-management operations, the wrapper must know the identity of the calling LP. This is done by internally assigning a unique identifier in the range ½0; N À 1 to each LP, where N is the total number of LPs hosted by the local instance of the simulation platform. 2 Using ASM_init(int num_LPs) the platform can notify the number of locally-hosted LPs to ASM, which in turn allocates an array of num_LPs entries structured as:
base_state_address, identifying the address that should be passed to the event processing callback upon dispatching the LP so to allow it to correctly access its state in memory; state_layout_info, identifying the address and the current size of a metadata table keeping information on the memory layout for the LP state.
The API void *set_current_LP(int LP_id, time_t sim_time) exposed by ASM allows the simulation platform to notify ASM what is the identity of the local LP that is currently about to execute its next simulation event. In this way, the wrapper can identify the LP metadata it must refer to upon subsequent malloc/free invocations by the application software, and can be informed about the current logical time of the dispatched LP. This is required to make memory deallocations correctly rollbackable, based on the relation between the advancement of the global virtual time (GVT) of the simulation and the simulation time associated with free calls (see Section 3.1.2). 3 LP metadata, accessible via state_layout_info, are organized into table entries structured as: struct malloc_area { int my_index, dirty_area; size_t chunk_size; void * where; int total_chunks, in_use_chunks, dirty_chunks; int next_chunk; time_t last_access; struct malloc_area *prev, *next; } Each entry is used to manage a block of given-size contiguous memory chunks, and different blocks host chunks with size corresponding to different powers of 2 (as supported by standard configurations of the malloc library). The chunk_size field indicates the size associated with the malloc_area entry. The where field is initially set to NULL. When a malloc call is issued by the applicationlevel code, the chunk size that best fits the request is identified, a block of contiguous chunks is allocated by the memory map manager via the underlying standard malloc library, and its address is registered within the where field, thus validating that malloc_area entry. This approach implements a pre-allocation strategy, where a block of pre-allocated chunks is reserved for a specific LP.
By default, the virtual address returned by the ASM memory allocator upon the very first malloc call for a specific LP is registered as the base_state_address for that LP. However, the SetState(void *addr) API is offered to the programmer to allow changes in the base_sta-te_address. Hence, the LP can notify the underlying platform about any change in the memory positioning of the data structure representing the current root for the whole memory layout of its state. For both time and space efficiency, each chunk within a pre-allocated block is associated with a single bit that indicates its current status, in term of whether it is in use or not. 4 The resulting status bitmap is 2. For platforms not adopting this strategy to identify LPs, a mapping between ASM's strategy and any other one can be created.
3. Optimistic synchronization relies on both event-messages and anti-messages, used to annihilate previously sent event-messages and inform the original receiver of the occurred rollback (this may in turn trigger a rollback in case the annihilated event-message was already processed at the destination). The GVT value represents the commitment horizon of the simulation, namely the time barrier currently separating the set of committed events from the ones which can still be subject to rollback. This barrier corresponds to the minimum timestamp of not yet processed or in-transit event-messages/anti-messages.
4. This is a main difference from the original malloc library, where a complex header is associated with each managed chunk in order to maximize flexibility in memory usage (e.g., by dynamically partitioning or aggregating chunks according to the so-called "boundary tagging" scheme [26] ). Nevertheless, ASM exploits such a flexibility by ultimately relying on the malloc library for actual virtual memory allocation.
placed at the head of the pre-allocated block of chunks, along with a dirty bitmap (see Fig. 2 ).
Upon a memory allocation, the in_use_chunks counter is updated, and next_chunk in the involved malloc_area is used to identify the most convenient position for starting the bitmap search in order to identify a free chunk. The manipulation of next_chunk follows a first-fit policy aimed at reducing both free chunks and bitmap fragmentation by aggregating in-use chunks in the initial part of the block.
When a block of chunks of a given size gets exhausted, the metadata table is expanded via a standard realloc operation, leaving available at least one new malloc_area entry, which gets linked via the prev and next fields, creating a list of entries used to manage chunks of a given size. Also, a new block of contiguous chunks of that size gets allocated. In this scenario, we can deduce that chunks of that size are highly useful for serving memory allocation requests for the LP. Consequently, whenever we expand the metadata table, we double the size of the newly reserved block of chunks.
Upon a free call, the associated chunk (and the corresponding block) is not released. Instead it is marked in the status bitmap as available for future allocations. In this way, memory deallocations are correctly rollbackable until they get eventually committed due to GVT advancement. Operatively, this is achieved by also exploiting the last_access field within the malloc_area entry, which is used to record the logical time associated with the last memory allocation/deallocation operation within the corresponding block, and to determine whether a block formed by chunks that have all been released can be really deallocated.
ASM has been also equipped with a software-level directmapped caching subsystem, with cache lines formed by the tuple hchunk addr; chunk end addr; malloc area indexi. Upon chunk allocation, the cache line is filled so that, in case of a subsequent free operation associated with that same chunk address, the wrapper retrieves the corresponding malloc_area in Oð1Þ time (unless for cases where the same cache line is overwritten during the run). A cache line is reset only when the corresponding chunk gets really deallocated.
Non-Incremental Log/Restore Support
The support for non-incremental logging operations (also termed full logs) linearizes and packs the LPs' currentlyallocated chunks in a properly-sized contiguous log buffer (allocated via the malloc library), along with metadata describing the current memory layout. Further, in order to make the invocation of SetState() rollbackable, the base_state_address value is also logged. The trigger of the full log operation is a call to the function take_full_log(void), which takes the checkpoint for the current LP, namely the one set via the aforementioned set_current_LP service. The log buffer is then linked to the head of a list of logs ordered by logical time values passed through by the LP.
To minimize checkpoint size, malloc_area entries that are not currently valid (i.e., those areas with no chunks allocated) are not logged at all, while the logged ones explicitly keep track (via the my_index field) of their original location within the metadata table. The chunks to be checkpointed are identified via a memory block's bitmap scan operation. Chunks with status bit set are the only ones packed into the checkpoint buffer, while the use bitmap is entirely logged, in order to allow correct chunks restoration by providing information on the correct positions they need to be copied back in their memory block. As the dirty bitmap does not contain any useful information for full logs, it is ignored.
Bitmap scanning is stopped early if the number of bits that have already been found set is equal to in_use_-chunks. If only a few chunks are currently in use within a block, by the aforementioned chunk-selection-for-allocation algorithm (i.e., the one which updates the value of next_chunk), they are likely to be placed in the initial portion of the memory block, which helps making the early stop approach effective. 5 On the other hand, when most of the chunks in a memory block are currently in use, an orthogonal optimization has been introduced. Specifically, no bitmap scanning is performed. Rather, the whole memory block (including unused chunks as well) is copied with a single memcpy call. The considerable benefit comes from the fact that modern processors offer optimized instructions to copy contiguous memory buffers of generic size (e.g., the movs instruction on IA-32 compliant processors). This optimization exploits a threshold-based mechanism that switches between the two approaches (i.e., selective versus non-selective) when the percentage of in-use chunks within a memory block oversteps a given value, providing a hysteresis region for stability reasons.
The general API offered by ASM for triggering a restore operation of the state of the current LP is state_restore (time_t requested_time, time_t *restored_-time). When operating in non-incremental log mode, this function traverses the log-chain searching for the most recent full log with logical time less than or equal to requested_-time. The restoration procedure unpacks the state, by delinearizing memory chunks stored into the contiguous log buffer and placing them back into the original memory blocks' positions. malloc_area entries (maintained by the same log buffers) are used to identify correct original blocks, and are restored as well. The exact chunk locations within the blocks are identified using the logged bitmaps, which are also restored at the head of the corresponding memory blocks. 6 Finally, ASM notifies the restored logical time via the restored_time parameter. 5. The only exception is when a very adverse sequence of operations occurs, formed by several allocations and then a few deallocations, leaving memory holes scattered across the whole block.
6. The only exception is when the chunks within a given block were saved non-selectively due to large block occupancy, which is signaled via a special per-block flag inserted in the log buffer. In this case the bitmap is restored but is not used to identify the position of the chunks within the block since it is directly determined by the layout of the contiguous log buffer. Metadata table's entries which were not valid at checkpointing time are not present in the log buffer, yet they must be reinitialized to make them compliant with the restored state layout and its logical time. Specifically, we reset the in_use_chunks field to zero and we set the last_access field to the currently restored logical time. However, memory blocks pointed by the where field are not really released, although the associated bitmap is reset. In fact, memory areas reserved for the LP are never deallocated due to the effects of a rollback operation. Also, when restoring valid malloc_area entries, the current linking of the metadata table entries (including the non-valid ones) is maintained. This allows the previously described algorithms for memory allocation to be piece-wise-deterministic (PWD), meaning that the address of the allocated chunk is deterministically selected, unless a new block allocation is really required. In the latter case the address depends on the block allocation address selected by the underlying malloc library. Hence, ASM can serve a replayed sequence of LP allocation requests (which has been already served before the state restoration procedure) with the same identical memory addresses. This allows supporting coasting-forward operations correctly in case the overlying application complies with the PWD assumption (even when the application logic strongly relies on the addresses of allocated buffers). Hence, any optimized strategy for selecting checkpointing intervals and for balancing checkpoint-overhead reduction with coasting-forward latency can be also employed for performance optimization reasons, as we shall discuss later on.
Incremental Log/Restore Support
Incremental log operations for the current LP are similar to full ones, although they are triggered via the take_in-cremental_log(void) API offered by ASM. Also, the incremental-log buffer is still linked to the log chain according to increasing values of the logical time of the current LP. However, actual packing operations depend on extra information explicitly used to track updates in memory chunks as follows:
A: dirty_area is set and dirty_chunks is zero. In this case the malloc_area is logged together with the status bitmap. Yet the dirty bitmap and the currently in-use chunks are not logged. B: dirty_area is set and dirty_chunks is greater than zero. In this case the malloc_area is packed into the log buffer together with the status bitmap, the dirty bitmap and the chunks that are currently in use and which have been dirtied. C: dirty_area is not set. In this case, no information associated with the area is logged at all.
Further, independently of the actual case among the aforementioned ones, data structures tracking dirty data/ metadata are reset upon taking the incremental log.
We finally emphasize that, in our proposal, incremental state log operations can be taken infrequently. In fact they are based on the recognition of memory portions that have been dirtied since the last log, independently of the amount of events actually performing the dirtying operations.
When a restore operation must be executed, still invoked for the current LP via the generic API state_restore (time_t requested_time, time_t *restored_-time), the latest incremental log with time less than or equal to requested_time is selected (its timestamp will then be returned via the restored_time parameter). After, the following steps are iterated by backward traversing the chain of logs: 1) A (not-yet-restored) malloc_area found inside the log buffer is put back in place inside the metadata table. The associated status bitmap is also copied back from the log buffer. 2) Each dirty chunk found inside the log and associated with the malloc_area, which has not yet been restored in a previous iteration while backward traversing the log, is copied back in its correct position inside the corresponding memory block.
The iterative restore procedure stops when all the active malloc_area entries have been restored and all the in-use chunks that have been dirtied are also restored. Although in principle this could entail an indefinite number of iterative backward steps along the log chain, in practice the restore operation can be immediately finalized once we find a full log while backward re-traversing the log chain. In fact, all the in-use chunks that have not yet been restored are immediately available inside the full log for copy-back operations. Full logs can be explicitly interleaved to incremental ones just for performance optimization purposes, as we shall discuss.
As a final note, the application transparent detection of actual write operations occurring within the LP state image (and hence the runtime identification of the chunks being dirtied) takes place in our architecture via compile/link time instrumentation of the application code. Details on the instrumentation process we have instantiated, which is particularly tailored to runtime efficiency of the instrumented code, are reported in the supplementary material section of this paper, available online.
Memory Recovery
As explained, incremental and non-incremental logs are all linked together within per-LP lists, sorted by simulation time. Obsolete logs can be discarded, thus allowing virtual memory recovery, via the void prune_logs(time_t new_GVT). This function scans the log queue for each managed LP, finds the oldest full log with time less than or equal to the value of new_GVT, and prunes all the logs with a lower simulation time. Given the organization of the aforementioned recovery procedures, maintaining at least one full-log with time less than or equal to the newly computed GVT value allows correct recoverability of the LP state.
Dual Coding Mechanism
Memory-update tracking facilities should be enabled or disabled depending on the log-mode selected for the current phase of execution, in order to minimize the actual execution overhead. The most immediate way to achieve this goal is to insert a check on a particular predicate at the very beginning of the tracking routine, so that the monitor would simply return control to the application-level code if actual updates are not required to be identified. This is the case when running in full-log mode. However, such a solution would impose overhead for non-useful housekeeping tasks, given that we already know that the current log-mode is the full one.
To overcome this issue, our instrumentation process provides optimized coexistence of two versions of the same application modules, according to a dual-coding approach. Particularly, instrumentation generates two :text sections in the final executable, one containing a non-instrumented version of the application code, the other containing the instrumented counterpart (where memory writes are tracked). These sections are transparently placed into the executable layout (together with the respective .rodata) in different virtual memory sections, using some standard facilities provided by the GNU linker ld. At the same time, the functions entry points are accessed (via function pointers) in such a way that the simulation platform is able to call the LP dispatching-callback via its standard API, and it is ASM's burden to make the pointers actually point the right code version. Also, the replicated :data=:bss sections associated with the two versions of the application object code have been collapsed on the same virtual addressing range in order to provide a single actual copy of initialized and non-initialized data, accessible by both generated code versions (see Fig. 3 ).
Log/Restore Overhead Modeling
After having enabled the optimized co-existence of incremental and non-incremental log/restore modes, as explained in the previous section, we provide the models assessing the corresponding overhead per event (due to both log and restore operations). These models borrow from the one presented in [10] for periodic non-incremental logging, for which we provide both (i) a specialization to capture internal mechanisms proper of our advanced memory-map manager (i.e., the cost of managing metadata identifying scattered memory layouts), and (ii) an extension to accommodate the case of incremental logging as supported by our architecture. The model in [10] describes the log/ restore overhead on a per-LP basis. We inherit this feature in our modeling approach, thus providing a scheme allowing dynamic optimization of the log/restore mode for any individual LP. Consequently, from now on, overhead modeling and optimization of the log/restore mode are implicitly referred to what experienced for each single LP.
For the non-incremental case, borrowing from [10] and recalling the aforementioned specialization, the log/restore overhead per event can be expressed as
where: d e is the average event execution cost; S F is the average size of a full log; d LB is the average cost for logging a single byte belonging to the state image, which we consider to include the per-byte cost for logging the metadata kept by the memory-map manager; d RB is the average cost for restoring one byte from the log, again assumed to include the per-byte cost for the restoring state layout metadata; P r is the rollback probability; x F is the selected log interval when operating according to the non-incremental mode. The above overhead is minimized for
[10], and we denote as x opt F the optimal non-incremental loginterval according to this equation. 7 For the incremental mode, as supported by our architecture, log operations are in no way required to be forced at each simulation event, but can be taken periodically. Accordingly, state reconstruction at whichever simulation time can be supported via a mixture of state restore from the log, and classical coasting forward. Also, full logs can be (infrequently) interleaved with incremental logs to enable fossil collection of incremental log records with a timestamp less than the timestamp of the latest committed full log. These full logs are anyway exploitable during recovery procedures since, while backward traversing the log chain, the restore operation of a complete state image gets finalized by extracting from the log all the in-use chunks that have not yet been restored via the scan of incremental logs, and putting them back in place within the state layout. To account for such optimized internal mechanisms offered by the memory-map manager, the above equation can be adapted as shown below to model the log/restore overhead for the incremental mode
where: S P is the average size of a partial (incremental) log; X I is the selected log-interval when operating according to the incremental mode; X I;F is the interleave step between full and incremental logs (number of incremental log operations after which a full-log is taken); d m is the per-event cost for running the memory-update tracking module. In equation (2), the term S F d RB accounting for the cost of state reload from the log is comparable to the one in equation (1), due to the aforementioned mechanism, according to which all the in-use chunks belonging to the state image are restored (by retrieving them either from the incremental logs along the log chain, or the first full log found during the log chain backward traversing procedure). Further, each event is charged with the memory-update monitoring overhead d m , which also appears during coasting forward. By exploiting the same arguments used in [10] for the minimization of the overhead versus the log interval, we get that the optimum value for the interval of incremental logs can Fig. 3 . The dual-coding generation process.
7. As suggested in [10] , a proper upper bound can be selected for x opt F in order to enable efficient fossil collection of obsolete eventbuffers. In fact, events with timestamp between the newly computed GVT and the timestamp of latest available log preceding GVT cannot be garbage collected to allow correct coasting forward. be computed as
e, and we denote as x opt I the optimal interval according to this equation. 8 
Autonomic Optimization
In ASM, the log/restore overhead models derived in the previous section are not used to simply select as the best operating log mode the one for which the corresponding expected overhead is minimal (once identified the best log-interval value). Instead, the selection step keeps into account fluctuations that can affect the set of parameters appearing within the overhead models (e.g., the expected event execution cost d e ), which cannot be directly controlled since they depend on proper runtime dynamics related to the simulation model execution. This set includes all the parameters appearing within the models, except the log-intervals x F and x I (or x I;F ), that can be controlled at runtime by ASM.
Such an approach, aimed at proactively providing stability of the optimal performance, well fits performance optimization when the set of possible operating modes is differentiated, each of them providing different overhead sensibility versus parameter fluctuations and/or variations. Literature approaches for log/restore optimization do not cope with such a multiple operating-mode scenario, which is the reason why sensibility of the a-priori uniquely selected operating mode versus parameter variations was not required to be addressed. Further, as shown in literature [11] , when dynamically optimizing the parameters driving the log/restore subsystem in optimistic PDES environments, the same optimization process may give rise to secondary effects (particularly, throttling or thrashing effects) which slightly modify the actual dynamics in terms of, e.g., the rollback probability experienced by the LPs. Having an approach which selects and configures the log/restore operating mode on the basis of stability criteria versus variations of runtime parameters can also cope with such indirect effects.
Overall, the best suited operating mode is selected on the basis of a cost function CF ðx opt F ; x opt I Þ defined as:
and on the result of the integration of this cost function over a multi-dimensional domain defined by the values of the parameters fd e ; d m ; d LB ; d RB ; P r ; S F ; S P g. For each parameter x defining a dimension of the integration domain, we integrate the cost function over the interval x AE ax, where we suggest a ¼ 0:1 to capture statistically relevant fluctuations of the parameters that can be envisaged at the time the dynamic selection is carried out. If the integration result is negative, then the selected operating mode is non-incremental (with the log-interval set to x opt F ), otherwise the incremental mode is selected (with log-interval set to x opt I ). Assuming the independence of the parameters defining the integration domain, the integral function for CF ðx opt F ; x opt I Þ is a polynomial that, after the substitution of the integral domain variables, has the following expression:
The above optimization procedure requires defining a trigger for the evaluation of the integral function in order to dynamically actuate the selection of the best suited logmode. We assume that the simulation run is partitioned into a startup phase and a normal phase. For the startup phase one of the two possible log modes is selected by default, and is kept until the end of that phase. Then, before starting the normal phase, the integral function is evaluated by using the mean x and the corresponding relevant statistical fluctuation ax for the above parameters defining the integration domain, on the basis of samples observed during the startup phase (we remaind to the supplementary material section, available online, for what concerns the techniques we employed to sample/estimate these parameters).
Once the best suited log mode is selected at the end of the startup phase, subsequent re-selections can occur during the normal phase. The re-selection trigger is based on the current value of the mean x of any of the parameters defining the integration domain, and a predicate involving the values x Ã and ax Ã that were used upon the last log mode autonomic selection. If for whichever parameter x the expression jx À x Ã j > ax Ã becomes verified during the run, then the integral function is recalculated on the basis of current mean values. The reason for such a trigger is that the last dynamic selection of the best suited log mode has been actuated on the basis of statistical parameter values x Ã and ax Ã that cannot be considered anymore representative of actual runtime dynamics and related fluctuations.
EXPERIMENTAL RESULTS
We have integrated ASM within the ROme OpTimistic Simulator (ROOT À Sim) [4] , [8] , an ANSI-C/MPI-based open-source optimistic simulation platform based on the Time Warp protocol [6] and tailored for UNIX-like systems. To test the effects of ASM, we have run experiments with different discrete event simulation applications. In this section we focus on the results achieved with the personal communication system (PCS) benchmark. Experimental data related to other applications, as well as details on ROOT À Sim, can be found in the supplementary material section, available online.
PCS models a mobile network adhering to GSM technology. Each LP models the state's evolution of an individual hexagonal cell, and the whole set of cells provides wireless coverage on a square region of variable size. Each cell handles a parameterizable number N of wireless channels, which are modeled in a high fidelity fashion via explicit simulation of power regulation and interference/ fading phenomena, according to the result in [27] . The event types which can occur at any LP are: Start Call, which simulates a new call installation on a target cell; End Call, which simulates a call termination; Handoff Leave, which simulates the leave of an on-going call from the current residence cell; Handoff Receive, which simulates the installation of a call handed off from an adjacent cell; Recompute Fading, which simulates the effects of climatic variations onto the fading (and consequently interference) phenomena for ongoing calls.
Upon the start of a call, a call-setup record is instantiated via dynamically-allocated data structures, which is linked to a list of already active records within that same cell. Each record is released when the corresponding call ends or is handed off towards an adjacent cell. In the latter case, a similar call-setup procedure is executed at the destination cell. Upon call setup, power regulation is performed, which involves scanning the aforementioned list of records for computing the minimum transmission power allowing the current call setup to achieve the threshold-level signal-tointerference (SIR) value. Data structures keeping track of fading coefficients are also updated while scanning the list, according to a meteorological model defining climatic conditions (and related variations). This application is highly parameterizable. Beyond the already mentioned number N of wireless channels per cell, the set of configurable parameters entails: i) t A , which expresses the inter-arrival time of subsequent calls to any target cell; ii) t duration , which expresses the expected call duration; iii) t change , which expresses the residual residence time of a mobile device into the current cell. These parameters affect the utilization factor of available channels, expressed as t duration =ðt A Ã NÞ. This impacts the granularity of the events since the more the busy channels, the more power-management records are allocated and consequently scanned/updated during the processing of different events. At the same time, higher values of the channel utilization factor lead to higher memory requirements for the state image of individual LPs. Both the above dependencies (namely, CPU demand and memory) are anyhow bounded depending on the total number N of per-cell managed channels.
To study the effects of ASM when considering differentiated execution and memory access patterns for the application layer, we use two different configurations of the PCS application. In one configuration we simulate 1,024 cells, each one managing up to 1,000 wireless channels, where the expected duration of a call t duration has been set to 120 sec, the residual residence time for an active call in the current cell t change has been set to the value 300 sec, while the inter-arrival time t A has been varied during the simulation so to generate a configuration where the actual load on the cells depends on the period of the day. Specifically, 17 hours of operativity of the cellular system have been simulated (from 00:00 AM to 05:00 PM) with variations of t A in the interval [0.64, 3.20], with peak intensity of the workload during the morning until lunch time, and minimum load very early in the morning (around breakfast). Consequently, the utilization factor has been varied in the interval [0.31, 0.06]. For this configuration of the PCS model, climatic conditions have been set as good and steady, thus not causing the need for frequent recalculation of fading coefficients. We will refer this configuration to as "Variable t A ". On the other hand, the second configuration of PCS has been parameterized by having the expected inter-arrival time t A fixed to the value 0.8 (giving rise to channel utilization values on the order of 25 percent), which leads to focusing the simulation on a morning operativity scenario, but where the climatic conditions exhibit variations that lead to periods where frequent recalculation of fading coefficients needs to be operated. We will refer this configuration to as "Frequent fading recalculation". Both the above configurations lead to runtime dynamics that vary, e.g., in terms of event granularity and portion of the LP state that needs to be updated by the events, however this is achieved in different manners in the different scenarios. In all the configurations of the PCS benchmark, we have evenly distributed the 1,024 LPs on top of 32 simulation-kernel processes, each of them being mapped onto one CPU core. We have run experiments on a 32-core HP ProLiant server equipped with 64 GB of RAM and running Debian 6 on top of the 2.6.32-5-amd64 Linux kernel.
We report in Figs. 4 and 5 the cumulated committed events achieved by the parallel run versus wall-clock-time. These values have been computed as the average over ten runs (done with different pseudo-random seeds), with a minimal variance observed across different runs. This parameter (and the slope of the associated curve) indicates the speed according to which a given platform configuration commits simulation events, and hence how fast the configuration supports model execution. We report five plots referring to (i) the case in which ASM is active (ii) the case in which ASM is active, but we always force the incremental log/restore mode, with the corresponding optimized value for x I and (iii) the case in which ASM is active but the full log/restore mode is forced, with the corresponding optimized value for x F (iv) the case where we force the incremental log mode by having the application layer directly calling the memory map manager for notifying which portions of the state have been updated by event processing, and (v) the case in which the application code was modified so to avoid using dynamic memory, hence leading to the situation where the state buffer for each LP is pre-allocated at startup in the form of an array of entries. The plots for cases (ii) and (iii) express performance levels that could be achieved via an optimized log/restore mode (adaptive in the selection of the log interval) based on either the incremental or the non-incremental log mode, as supported by ASM, but not allowing autonomic switch between the two modes. On the other hand, the plots for case (iv) represent scenarios that benefit by optimal checkpoint interval calculation and incremental state log/ restore, but require the intervention of the programmer in relation to some of the tasks enabling incremental logging, thus offering a transparency level which is strictly lower than the one offered by ASM. Hence, this case allows quantifying the performance penalty associated with full statemanagement transparency as provided by ASM. Finally, case (v) is representative of scenarios where no facilities other than the bare minimal log and restore operations are supported, and without any infrastructure allowing for dynamic memory handling, thus requiring the state to be contiguous and statically sized to the maximum value admitted by the model parameterization. This is a baseline for the evaluation of the advantages by ASM. By the results, we see that, depending on the specific phase within the simulation run, (e.g., early morning versus lunch time for the variable t A configuration) forcedincremental and forced-full modes alternately exhibit better execution speed (which is indicated by the different slope of the cumulated committed events curve while the run is in progress). Anyway, the most important outcome by the cumulated event rate plots is that ASM always switches to the best performing mode (incremental versus non-incremental) depending on the currently simulated period, and hence depending on the actual dynamics (e.g., in terms of state size, event granularity, memory update pattern, etc.). The overall effect is that ASM allows faster execution, on the order of 10 to 14 percent over the other modes for the case of the variable t A configuration, and on the order of 11 to 27 percent for the case of frequent fading recalculation. Compared to configuration (iv), ASM shows a slowdown varying between 10 and 20 percent. However, configuration (iv) removes all the transparent facilities provided by ASM in terms of identification and notification to the memory map manager of the portions of the LP state that have been dirtied. Finally, compared to the baseline configuration (v), ASM has a throughput increase ranging from 35 to 40 percent, which indicates how an enhancement in programmability (via the transparent support for dynamic memory allocation) is strictly coupled with a non-negligible performance increase.
In Fig. 6 we report average per-process memory usage for all the considered configurations. In particular, we show average memory occupancy for the whole simulation process (i.e., simulation-platform layer and application-level model), for state logs, and for log metadata. In all the runs we have set the GVT (and memory recovery) period to 1 sec, which gives rise to negligible coordination overhead (given the tight coupling of the underlying architecture) while allowing prompt release of memory buffers. Also, the memory usage samples refer to the state of the processes as observed right before performing memory recovery. As we can see, memory requirements for metadata is very reduced (on the order of 1 percent) in any configuration, highlighting memory efficiency by the data structures keeping track of memory allocation. The overall average memory occupancy shows a greater variance when dealing with phase-interleaved configuration of the PCS benchmarks, due to the fact that some phases execute more coarse-grained events and therefore require less logs per time unit. In both the frequent fading recalculation and the variable t A configurations, the forced full snapshot execution mode has a higher memory requirement, which is a predictable result due to the higher amount of information which is stored into a snapshot. However, such a memory consumption remains significantly lower than the one for the baseline case (v), especially for the variable t A configuration, which gives rise to better locality still favoring performance. At the same time, the configurations relying on the forced incremental snapshot mode and the one where the application layer calls the memory map manager to explicitly update the dirty portion of the memory map show a memory usage for logs which is very comparable, indicating similar dynamics in terms of logging frequency, which confirms how the 10 to 20 percent performance loss by ASM versus configuration (iv) is essentially related to the overhead for transparently handling memory updates via instrumentation. Finally, the memory usage for logs by AMS always stands between the forced incremental and the forced full ones, which reflects AMS' behavior switching from one configuration to the other.
We also report in Table 1 the execution time for running the PCS applications (very same code used for the parallel runs) in serial mode on top of a calendar queue scheduler. By the data, the parallel runs with ASM allow significant speedups, especially for the frequent fading recalculation setting. Overall, the experimental study has been carried out with competitive parallel executions.
CONCLUSIONS
In this paper we have presented ASM, an innovative autonomic state management subsystem targeted at optimistic PDES engines. ASM provides full transparency of state log/ restore to the application layer, and at-runtime autonomic re-selection of the best suited log mode (incremental versus non-incremental) depending on the actual runtime dynamics of the optimistic simulation run, accounting as well for stability of the selected mode versus fluctuations in such dynamics. Mode switching is supported by ASM via an application transparent dual-coding mechanism, allowing to run the application code that best fits the requirements of the currently active log mode.
