Mixing Hardware and Software Reversibility for Speculative Parallel Discrete Event Simulation by Cingolani, Davide et al.
Mixing Hardware and Software Reversibility for
Speulative Parallel Disrete Event Simulation
Davide Cingolani, Mauro Ianni, Alessandro Pellegrini, and Franeso Quaglia
Sapienza, University of Rome
Abstrat. Speulative parallel disrete event simulation requires a sup-
port for reversing proessed events, also alled state reovery, in ase they
reveal as ausally inonsistent. In this artile we present an approah
where state reovery relies on a mix of hardware- and software-based
tehniques. Partiularly, we exploit the Hardware Transational Memory
(HTM) support, as oered by Intel Haswell CPUs, to proess events by
the appliation ode as in-memory transations, whih are possibly om-
mitted only after their ausal onsisteny is veried. At the same time,
we exploit an innovative software-based reversibility tehnique, fully rely-
ing on transparent software instrumentation targeting x86/ELF objets,
whih enables undoing side eets by events with no atual bakward re-
omputation. Eah thread within our multi-thread speulative proessing
engine dynamially (namely, on a per-event basis) selets whih reovery
mode to rely on (hardware vs software) depending on varying runtime
dynamis. The latter are aptured by a lightweight model indiating to
what extent the HTM support (not paying any instrumentation ost) is
eient, and after what level of events' parallelism it starts degrading
its performane, e.g., due to exessive data onits while manipulating
ausality meta-data within HTM based transations. We released our
implementation as open soure software and provide some experimental
results for an assessment of its eetiveness.
1 Introdution
When dealing with Disrete Event Simulation (DES), its move onto parallel
arhitetures has been historially based on the Parallel Disrete Event Simula-
tion (PDES) paradigm [7℄. In this kind of simulation, as well as in the traditional
DES paradigm, the evolution of the system is desribed in terms of timestamped
disrete events, whih are impulsivethey happen at a spei simulation time
instant, the timestamp of the event, and have no duration. Parallelism is ahieved
in PDES by partitioning the simulation model into several distint entities, alled
simulation objets or logial proesses (LPs). Eah LP is assoiated with a pri-
vate simulation statethe whole simulation state is the union of these private
statesand the exeution of an impulsive simulation event at any LP produes
a state transition on the state of the LP itself. The privateness of the LPs'
simulation states implies that information exhange aross dierent LP is only
supported via the exhange of events, whih an be generated (in any number)
during the exeution of whihever event.
PDES speulative exeution [10℄ allows proessing events with no previous
assurane of their ausal onsisteny. This means that an event destined to some
LP an be dispathed for exeution with no guarantee at all regarding the fat
that no other events with a higher priority, say lower timestamp, will be ever re-
eived by that same LP in the future. Suh events, referred to as straggler events,
are the a-posteriori materialization of a timestamp-order violation, also referred
to as ausal violation. Suh violations require some state reovery (reversibility)
support for undoing the side eets on the LPs' states whih are assoiated with
inonsistent proessing of events.
In literature, the reversibility support has been traditionally based on pure
software implementations exploiting either hekpointing tehniques (see, e.g.,
[15, 16℄) or reverse omputing ones (see, e.g., [2℄). A few other approahes have
been based on o-loading the hekpoint task to o-the-shelf or unonventional
hardware [9, 18℄. More reently, the Hardware Transational Memory (HTM)
support oered by modern proessors, suh as the Intel Haswell, has been taken
into onsideration in order to enable the speulative exeution of events as in-
memory transations [19℄, making them automatially reoverable with low over-
head thanks to the reliane on the hardware transational ahe. However, to
the best of our knowledge, there has been no attempt to exploit hardware and
software based reversibility in a synergi ombination for speulative PDES.
In this artile we present a speulative PDES engine, oriented to multi-ore
mahines, whih is based on suh a kind of hardware/software ombination. Par-
tiularly, we enable eah onurrent worker thread operating within the engine
to dynamially selet the best suited reversibility support among two: (1) one
relying on HTM failities inspired to [19℄ and (2) another relying based on soft-
ware reversibility, partiularly in the form of undo ode bloks [3℄. The dynami
seletion is based on the onsideration that not all the speulatively exeuted
events are valuable in the same manner when run as HTM transations due
to several reasons. A rst one deals with the fat that the nal ommit of the
transation needs to hek/update ausality meta-data, hene the higher the de-
gree of onurreny while aessing these meta-data, the higher the likelihood of
yielding to data onits that lead to the abort of the HTM transations. Also,
ausality meta-data are updated aording to the progress of the ommit horizon
of the PDES run, as determined along time by the ommit of the event with the
lowest timestamp. Hene speulatively proessed events with HTM support that
are further ahead of the ommit horizon will need to nd ausality meta-data
reeting more updates upon trying to ommit, whih again leads to an abort if
these updates were not yet issued by the ommitment of events with higher pri-
ority, say lower timestamps. Finally, the HTM support is limited to transations
whose read/write set ts (with no apaity onit by other ores of the same
CPU) the transational hardware ahe. Hene for models with events that (or
exeution phases where the events) have large data sets the likelihood of su-
essfully ommitting the orresponding HTM transations may be (signiantly)
redued.
We overome these drawbaks in our speulative PDES engine by dynami-
ally enabling any worker thread to proess an event not as an HTM transation
(just to redue the likelihood of running non-valuable transations), but rather
via a modied version of the original event-handler ode. This version is trans-
parently instrumented in order to be able to generate (at runtime) the minimal
set of mahine instrutions (the so alled undo ode blok) that allows revers-
ing any memory side eet. In the instrumentation proess we target x86/ELF
objets. The possibility to ommit events run with software reversibility is no
longer bound to the possibility to ommit an HTM transation. This leads to
the senario where the engine is able to improve fruitful usage of omputing
resoures just beause of the possibility to exploit the HTM support in the most
valuable manner, while jointly relying on a bit more ostly software reversibility
when valuable hardware based reversibility would be impaired.
Clearly, the oexistene of HTM and software based reversibility (with on-
urrent threads relying on one or the other at a given time instant) needs solu-
tions in order to avoid that the two tehniques do not interfere with eah other.
Speially, valuable HTM work should not be interfered by software reversibil-
ity based one. For the ase of onurrent speulatively proessed events bound to
the same LP (hene operating within the same loal state) this is ahieved by in-
troduing a prioritization mehanism that leads an HTM proessed event to gain
higher priority with respet to the events proessed with the software reversibil-
ity support. So the latter will never onurrently aess (any portion of) the
overall data setsay LP state as a wholepossibly targeted by the HTM trans-
ation, hene not leading to its abort. On the other hand, we still enable inter-LP
onurreny, thus enabling the so alled weak-ausality model [17℄, by not pre-
venting multiple HTM transations to suessfully operate on disjoint data sets
within the LP state. Also, given that in our software reversibility sheme we
avoid the usage of hekpointing (in fat the undo ode blok is not a log of
data, rather of mahine instrutions), we avoid at all the typially large usage
of memory by hekpointing (only partially resolved by inremental hekpoint-
ing shemes) hene further reduing the (potential) problems related to limited
ahe apaity issues of the HTM support and oniting ahe aesses by the
threads.
Our engine has been released as open soure software
1
, and we also provide
some experimental data for an assessment of its eetiveness when running the
lassial Phold PDES benhmark [8℄ on an Intel Haswell proessor, with HTM
support, equipped with 4 physial ores.
The remainder of this artile is strutured as follows. In Setion 2 we dis-
uss related work. In Setion 3 we present the methodology standing behind
hardware- and software-reversibility based exeution of PDES models, and we
desribe the design priniples haraterizing our mixed simulation engine arhi-
teture. Setion 4 presents an experimental assessment of our proposal.
1
https://github.om/HPDCS/htmPDES/tree/reverse
2 Related Work
The state restore operation is of fundamental importane in speulative PDES,
and has therefore been extensively studied in the literature. Two main inarna-
tions of state restore shemes have been proposed, one based on state hekpoint
and reload, and one based on reverse omputing. The former avour is based on
the possibility for the simulation engine to know what are the memory buers
that keep eah LP's simulation state, whih are opied onto a separate buer
alled the simulation snapshotat a given point of the exeution. In this way, un-
doing a hain of wrongly-omputed events (namely, state updates) boils down to
seleting a simulation snapshot whih is still onsistent (i.e., it was taken at a sim-
ulation time smaller than the straggler's one). This snapshot is then opied onto
the LP live state image, thus undoing the eets of ausal-inonsistent events.
This approah is both memory- and omputationally-intensive, and might lead
to poor simulation performane, sine if no ausal inonsisteny is deteted at all,
resoures are spent for taking unneessary snapshots. To this end, several pro-
posals have addressed the possibility to take state snapshots less frequently (see,
e.g., [16℄) or in an inremental way (see, e.g., [21℄) or ombining the two shemes
(see, e.g., [15, 20℄). Other solutions rely on hardware support to ooad from the
CPU the memory opy for taking the hekpoint. Speially, the work in [18℄
proposed to exploit programmable DMA engines to perform the opy, while [9℄
presents the design of a so alled rollbak-hip, a hardware faility that auto-
matially saves old versions of state variables upon their updates. Both these
approahes, redue the CPU-time for hekpointing tasks but do not diretly
ope with memory usage.
Reverse omputing is instead based on the notion of reverse events. A reverse
event e¯ assoiated with a forward event e is an event suh that if the exeution
of e produes the state transition e(S)→ S′, the exeution of e¯ on S′ produes
the inverse transition e¯(S′) → S. Suh reverse events ould be implemented
manually [2℄ or via ompiler-assisted approahes [12℄. Although reverse ompu-
tation is muh less memory-greedy than hekpointing, the main issue with this
approah lies in the rollbak length, namely the number of events whih must
be undone upon a state restore operation. In partiular, the total ost of a roll-
bak operation is diretly proportional to the number of undone events and their
granularity, as reverse events re-proess (although in a reversed fashion) all the
steps of a forward event, even if some of them are not diretly related to state
updates.
The more reent proposal in [3℄ has takled the state restore operation via
software reversibility through the adoption of undo ode bloks. The goal of this
approah is to redue the time-omplexity of the rollbak operation, making the
reversibility of events independent of the forward exeution's granularity. This is
done by relying on stati binary instrumentation, targeting x86-64/ELF objets,
where the simulation model's ode is sanned searhing for all mahine-level
instrutions whih entail a memory update. These instrutions are transparently
augmented with an ad-ho routine whih omputes the target address of the
memory write just before it takes plae, so that the original value is diretly
paked into an on-the-y assembled mahine instrution whose exeution restores
it. All these runtime generated assembly instrutions are stored into an undo
ode blok whih, when exeuted, undoes all the eets of the exeution of
a forward event on the simulation state. This solution nds a good balane
between inremental hekpointingno atual meta-data are required to restore
a previous stateand reverse omputingthe exeution ost of an event is no
longer dependent on the omplexity of forward events. Nevertheless, if an event
is unlikely to be undone due to a rollbak operation, the ost of traing memory
updates and generating undo ode blok is paid unneessarily.
Another reent proposal [19℄ exploits HTM failities oered by modern Intel
Haswell CPUs to allow running simulation events within transations. An ad-ho
routine determines whether the exeution of an event is safe or not, by heking
ompat shared meta-data keeping trak of the simulation time assoiated with
the events that are being run by the onurrent threads. The event assoiated
with the smallest timestamp is onsidered safe, and it is therefore the only event
whih is exeuted outside of a transation. By using this sheme, all the events
whih are transationally exeuted are automatially aborted if a onit on the
same data strutures is deteted. At the end of a transation, the safety of the
just-exeuted event is evaluated again, and in ase the event has beome safe, it
is then ommitted. In the negative ase, the transation is immediately aborted
and (possibly) restarted, beause the aess to the shared meta-data makes it
doomed if the event is not safe yetin fat, another thread will eventually update
the ontent of the meta-data, to indiate that the exeution of a safe event has
been ompleted. A dynami throttling strategy is used to inrease the likelihood
of ommitting a transation, by delaying the time instant at whih the shared
meta-data are aessed.
Our work diers from previously published work sine none of the afore-
mentioned proposals makes use of a ombination of hardware and software re-
versibility for state restore operations. Partiularly, we use the results in [19℄
and [3℄ as baselines for building a mixed hardware/software reoverability sup-
port that takes the advantages of the two dierent tehniques As pointed out
in the introdution, we dynamially resort to undo ode bloks (thus paying the
ost of running an instrumented ode version) only in ase valuable speulative
work annot be arried out (by a thread at some point in time) via the reliane
on HTM. Thus we pay the overhead of software reversibility only when HTM
based reversibility does not pay o (or is inviable due to, e.g., transational ahe
apaity limitations).
3 The Hardware/Software Reversibility Based Engine
3.1 Basis
We target a baseline speulative PDES engine struture that is independent of
the atual reversibility support, whose shematization is provided in Figure 1. In
ompliane with traditional PDES, the engine supports the partitioning of the
simulation model into n distint LPs, eah one assoiated with a unique ID in the
LP0
Simulation state
Event Handlers
LP1
Simulation state
Event Handlers
LPn-2
Simulation state
Event Handlers
LPn-1
Simulation state
Event Handlers
. . . 
Priority
Queue
Fig. 1. Basi engine organization.
range [0, n− 1]. Eah LP is assoiated with a private simulation state (although
possibly sattered on dynami memory) and with one or more event handlers
representing the ode bloks in harge of proessing the simulation events and
generating state updates, as well as of (possibly) produing new events to be in-
jeted in the system. The delivery of a simulation event to the orret handler is
demanded from the underlying simulation kernel, whih is also in harge of guar-
anteing onsisteny of a shared event pool that keeps all the already sheduled
events, as well as ausal onsisteny of the updates ourring on the LPs' states.
Conerning the event pool, we rely on a shared lok-proteted global queue, par-
tiularly a alendar queue [1℄. Multiple onurrent worker threads an extrat
events from the event pool and an onurrently dispath the exeution of the
orresponding LPs by ativating some event handler as a allbak funtion.
3.2 Simulation Horizons and Valuability of Speulative Work
In speulative PDES, we an always identify a point on the simulation time axis
whih is the ommit horizonommonly referred to as Global Virtual Time
(GVT). This is the simulation time instant that distinguishes between events
whih might be undone (e.g., due to some ausality violation) and events whih
will never be undone. This time instant an be logially identied by onsidering
that any simulation event e exeuted at simulation time T an only generate
some new event e′ assoiated with timestamp T ′ ≥ T . In fat, violating this
assumption would imply that an event in the future might aet the past, whih
is learly a non-meaningful ondition for any real-world proess/phenomenon.
Therefore, to identify the ommit horizon, it is suient to identify, aross all the
events whih are urrently sheduled at (or have just been proessed by) any LP
in the system the one assoiated with the minimum timestamp. Suh timestamp
orresponds to the ommit horizon. In fat, no event still to be exeuted in the
system might produe a ausal inonsisteny involving the LP in harge of the
exeution of the ommit horizon event
2
.
With our target engine organization, the ommit horizon is assoiated with
the oldest event that is urrently being exeuted (or has just been exeuted) at
any worker thread. Therefore, keeping trak of the ommit horizon boils down
to registering, for eah worker thread, the timestamp of the event e urrently
2
Simultaneous events do not violate this assumption. Nevertheless, if not properly
handled by some tie-breaking funtion [11, 13℄, they ould indue liveloks in the
speulative exeution.
commit
horizon
high
likelihood
low
likelihood
ST
abort probability
delay required
Fig. 2. Three logial regions on the simulation time axis, with varying density of pend-
ing eventsthose still to be proessed, whih will possibly generate new ones along the
simulation time (ST) axis.
being exeuted, by replaing the value only after a new event is fethed for
proessing from the event pool, so that any new event possibly produed by e
has its timestamp already reeted into the event pool. The ommit horizon an
be omputed as the minimum among the registered values.
At any time, the ommit horizon event an be onsidered as a safe (namely,
ausally onsistent) one, and therefore does not require any reversibility meh-
anism for its exeution. Let us now disuss about the likelihood of safety of
other events to be proessed, whih stand ahead of the ommit horizon. Empir-
ial evidene plus statistial onsiderations based on lassial distributions for
the timestamp inrement driving the generation of events in ommon simulation
models (see, e.g., [5, 6℄) have shown that event patterns are, at any time, hara-
terized by greater density of events, say loality of ativities, in the near future of
the atual GVT. This situation is depited in Figure 2. Also, suh loality tends
to move along the time axis just based on the advanement of the ommit hori-
zon. The impliation is that the risk of materialization of ausal inonsistenies
when speulatively proessing one event that is ahead of the ommit horizon is
somehow linked to its distane from suh horizon. This is also linked to the no-
tion of lookahead of DES models, a quantity expressing the minimal timestamp
inrement we an experiene for a given model when proessing whihever event
that originates new events to be injeted in the system. Larger lookahead leads
to produe new events in the far future, hene those getting loser to (although
not oiniding with) the urrent ommit horizon beome automatially safe.
By this onsideration, the speulative proessing of events that are loser
to the ommit horizon looks more valuable in terms of avoidane of ausality
inonsistenies, hene our approah is to enable the proessing of these events
as HTM-based transations, say via the more eient (lower overhead) reover-
ability support. We also note that running events that are lose to the ommit
horizon as HTM-based transations will also lead to faster advanement of suh
horizon, as ompared to what we would expet if running them via software-
based reversibility, sine this would lead to longer proessing times due to the
overhead for produing the undo ode bloks. However, an HTM-based trans-
ation an ommit only after events standing in the past have already been
ommitted and the orresponding worker threads have already updated their
entries in the meta-data array keeping their urrent timestamp. So, in order to
inrease the likelihood of ommitting the HTM-based transational exeution
of some event, this transation typially needs to inlude a busy-loop delay en-
abling a wait phase just before heking whether the meta-data were updated
3
.
Cheking the meta-data at some wrong point in time will in its turn lead to
the impossibility to rehek these data fruitfully in the future, sine the updates
ourring between the two heks will lead to a data onit and to the abort
of the heking transation. In Figure 2 we show how suh a delay should be
seleted somehow proportionally to the distane (in terms of event ount) of the
event proessed via HTM support from the ommit horizon. Overall, for events
that are further ahead from the ommit horizon, the delay ould not pay o,
hene a more protable approah to speulatively proessing them is the one to
run them outside the HTM-based transation, still with reversibility guarantee
ahieved via software.
The problem of determining what is the threshold distane from the om-
mit horizon beyond whih HTM support does not pay o is learly also related
to the interferene between onurrent HTM-based transations when using the
underlying hardware resoures. In fat, if we experiene a senario where two
onurrent transations both require large transational ahe storage for exe-
uting the orresponding dispathed events, and the ahe is shared aross the
ores, then even if an event would ideally reveal as ausally onsistent upon
attempting to nalize the transations, it would anyhow be doomed to abort
due to ahe apaity onits. A similar ahe apaity-due abort may even be
experiened in ase of single HTM-based transation instane, just depending
on the transation data set, whih might exeed the ahe apaity.
To ope with the runtime adaptive seletion of the threshold value, we rely
on a hill limbing sheme based on the following parameters, easily measurable
at runtime aross suessive wall-lok-time windows:
 THTM , the total proessing time spent aross all the worker threads while
proessing events (either ommitted or aborted) via HTM support
 COMMITHTM , the total number of ommitted events whose speulative
exeution has been based on HTM support;
 Tsoft, the total proessing time spent aross all the worker threads while
proessing events (either ommitted or aborted) that are made reoverable
via software-based support (here we inlude the time spent for instrumenta-
tion ode used to generate undo ode bloks, plus the time for running the
undo ode bloks in ase the events are eventually undone);
 COMMITsoft, the total number of ommitted events whose exeution has
been based on the software support for reoverability.
By the above quantities, we ompute the so alled work-value ratio (WVR)
for both HTM-based and software-based reoverability just like:
WVRHTM =
THTM
COMMITHTM
WVRsoft =
Tsoft
COMMITsoft
(1)
3
Other kind of delays, suh as operating system sleeps, are unfeasible sine any
user/kernel transition will lead an HTM-based transation to abort deterministi-
ally on urrent HTM-equipped proessors.
whih express the average amount of CPU time required for performing useful
work (namely, for proessing an event that is not undone) with the two dierent
reoverability supports. Then, the threshold value THR determining the ommit
horizon distane (evaluated as event ount) beyond whih we onsider it more
onvenient to proess the event via software reversibility, rather than HTM-based
one, is inreased or dereased depending on whether the relation WVRHTM ≤
WVRsoft is veried (as omputed on the basis of statistis, on the baseline
parameters listed above, olleted in the last observation window). In order to
avoid stalling in loal minima (e.g. due to the avoidane of runtime samples for
any of the above listed parameters), we intentionally perturb THR by ±1 within
the hill limbing sheme if its value reahes either zero or the number of threads
urrently running in the PDES platform.
3.3 Engine Arhiteture
As mentioned, our engine allows the o-existene of hardware-based and software-
based reversibility failities. While introduing hardware-based reversibility fail-
ities is somehow easyit an be done using the primitives TRANSACTION_START,
TRANSACTION_END, and TRANSACTION_ABORT to drive event proessingsoftware-
based reversibility requires a bit more are, espeially when targeting full trans-
pareny to the appliation-level developer. To ope with this issue, we rely on
stati binary instrumentation. In partiular, we exploit the Hijaker [14℄ open-
soure ustomizable stati binary instrumentation tool. Using this tool, we are
able (before the nal linking stage of the appliation-level simulation model) to
identify any memory writing instrution (either a simple mov or a more omplex
ones, like move or movs instrutions) and to plae just before eah memory-
update instrution a all to a reverse_generatormodule whih reads the ur-
rent value of the target memory loation so as to diretly generate the reverse
instrution able to undo the orresponding side eet aording to the proposal
in [3℄. The sequene of reversing instrutions for a same event forms the undo
ode blok of the event. Clearly, the instrumented and the non-instrumented ver-
sions of the appliation modules also need to oexist (sine the non-instrumented
version is the one to be run in ase of HTM-based reversibility). Suh oexistene
has been ahieved by using a multi-oding sheme when rewriting the ELF of
the program at instrumentation time, and by identifying the entry points to the
two versions of ode (instrumented and not) within the same exeutable using
funtion pointers exposed to the PDES engine.
In our implementation the reversing instrutions assoiated with an event
(those forming the undo ode blok of the event) are organized into a reverse
window, whih is used as a stak of negative instrutions that an be invoked via
a all. Corret exeution of an undo ode blok is ensured by the presene of a
ret instrution at the end of the reverse window. Also, if the forward exeution
of an event updates multiple times the same memory loation, only the rst
instrution updating that loation should be assoiated with the generation of
an inverse instrution, sine the following updates would be anyhow undone by
Algorithm 1 Shared Lok Aquisition/Release
1: int lok_vetor[n℄
2: double timestamp[n℄ ⊲ To avoid priority inversion
3: int thread_id[n℄ ⊲ To avoid priority inversion
4: proedure Lok_LP(e, LP, mode, loking)
5: if mode = EXCLUSIVE then
6: acquired ← false
7: while ¬acquired∧ loking do
8: while lok_vetor[LP℄ > 0 do
9: nop
10: old_lock← lock_vector[LP ]
11: if CAS(-1, old_lock, lock_vector[LP ]) then
12: acquired← true
13: else
14: acquired ← false
15: while ¬acquired∧ loking do
16: while lok_vetor[LP℄ < 0 do
17: nop
18: old_lock← lock_vector[LP ]
19: if CAS(old_lock + 1, old_lock, lock_vector[LP ]) then
20: acquired← true
21: if ¬acquired then
22: atomially {
23: timestamp[LP℄ ← T (e)
24: thread_id[LP℄ ← thread_id
25: }
26: return acquired
27: proedure Unlok_LP(LP, mode)
28: if mode = EXCLUSIVE then
29: lock_vector[LP ]← 0
30: else
31: do
32: old_lock ← lock_vector[LP ]
33: while ¬ CAS(old_lock − 1, old_lock, lock_vector[LP ])
the rst inverse instrution. We therefore employ a fast hashmap to keep trak of
destination addresses within a forward event. Whenever reverse_generator is
ativated, this hashmap is queried to determine whether the destination address
was already involved in a negative instrution generation.
As mentioned before, to ensure onsisteny and minimize the eets of data
ontention on HTM-based exeution of events, we must ensure that at no time
two dierent worker threads an exeute both software-reversible and hardware-
reversible events at one, whih target the same LP state. In fat, if this would
happen, we might inur the risk of having less valuable work to invalidate more
valuable one (sine the HTM-based transation would be aborted if its data
set would overlap the write set of the event exeuted via software-based re-
versibility). Also, we annot allow two (or more) events run via software-based
reversibility to simultaneously target the same LP state. In fat, these events
would not be regulated by any transational exeution sheme
4
. To this end,
we rely on a synhronization mehanism similar in spirit to an atomi shared
read/write lok [4℄. Whenever a worker thread extrats an event from the shared
4
The undo ode bloks guarantee reversibility of memory updates limited to events
exeuting the updates on the LP state in isolation, whih omplies with lassial
PDES where eah LP is an intrinsially sequential entity.
event pool, it rst determines whether it should exeute it using hardware-based
or software-based reversibility aording to the poliy introdued in Setion 3.2.
If the seleted exeution mode is HTM-based, the worker thread tries to aquire
the lok on the target LP in a non-exlusive way, whih nevertheless fails (i.e.,
requires spinning) in ase any other worker thread already took it in an exlu-
sive way. On the other hand, if the seleted exeution mode is based on software
reversibility, the worker thread tries to aquire the lok in an exlusive way, yet
this operation requires spinning if at least one worker thread has non-exlusively
taken the lok. Nevertheless, this approah might lead to some priority inversion,
among the threads whih are running more valuable events via HTM support and
threads whih are running less valuable events via software-based reversibility.
To avoid this, we use a locking ag to instrut the algorithm to avoid spin-
ning if it was not possible, for any reason, to aquire the loknamely, setting
locking to false transforms the lok into a trylok. Therefore, if the lok is not
taken, two additional values in two arrays are updated atomially: timestamp
and thread_id, whih are exploited on a per-LP basis. In partiular, the worker
thread registers the timestamp it has an event to proess at, and its thread id.
The latter value is only used to reate a total order among threads in ase simul-
taneous events are present, to avoid possible deadlok onditions. These values
are periodially inspeted by other worker threads (upon a safety hek for the
urrent proessed event, whih fails), so as to determine whether some higher
priority event is waiting. In that ase, if the work arried out is not likely to
be ommitted shortly, thanks to the reversibility supports it gets squashed, so
that higher priority is given immediately to events at a smaller timestamp. Algo-
rithm 1 shows the lok management pseudo-ode, whih relies on the Compare
and Swap (CAS) read-modify-write primitive to inrease/derease the value of a
shared per-LP ounter. Value -1 for the ounter means that the lok is exlusively
taken, while value 0 indiates that no thread is running an event bound to the
LP. A positive value is a sort of referene ounter whih tells how many worker
threads are onurrently exeuting events via hardware-based reversibility.
We an now disuss the organization of the main loop of threads within our
speulative PDES engine, whose pseudo-ode is shown in Algorithm 2. Essen-
tially, it is made up by three dierent exeution paths, eah one assoiated with
one of the dierent exeution modes. Initially, a all to a Feth() proedure al-
lows to extrat from the shared event pool the event with the smallest timestamp.
Then, a statistial approximation of the number of events whih are expeted
to fall before the urrently fethed event (sine others my still be proessed or
might be produed as a result of the proessing) is omputed as:
T (e)− commit_horizon
average_timestamp_increment
(2)
where average_timestamp_increment is omputed as
commit_horizon
total_committed_events
(
5
). This value, together with the threshold THR (see Setion 3.2), is used to
5
For non stationary models, where the distribution of the timestamp inrement be-
tween suessive events an hange over time in non-negligible way, this same statis-
Algorithm 2 Main loop
1: proedure MainLoop
2: new_events = ∅ ⊲ Set of events generated during the exeution of an event
3: while ¬endSimulation do
4: e ← Feth( )
5: if e = NULL then
6: goto 3
7: events_before←
T (e) − commit_horizon
average_timestamp_increment
8: if Safe( ) then ⊲ Safe exeution: on the ommit horizon
9: Lok_LP((e, LP (e), NON_EXCLUSIVE, true))
10: new_events ← ProessEvent(e)
11: Unlok_LP(LP (e), NON_EXCLUSIVE)
12: else if events_before ≤ THR then ⊲ HTM-based exeution: high likelihood region
13: if ¬ Lok_LP((e, LP (e), NON_EXCLUSIVE, false)) then
14: goto 7
15: BeginTransation( )
16: new_events ← ProessEvent(e)
17: Throttle(events_before)
18: if Safe( ) then
19: CommitTransation( )
20: Unlok_LP(LP (e), NON_EXCLUSIVE)
21: else
22: AbortTransation( )
23: Unlok_LP(LP (e), NON_EXCLUSIVE)
24: goto 7
25: else ⊲ Software-reversible exeution: low likelihood region
26: if ¬ Lok_LP((e, LP (e), EXCLUSIVE, false)) then
27: goto 7
28: SetupUndoCodeBlok( )
29: new_events ← ProessEvent_Reversible(e)
30: while ¬ Safe( ) do
31: if timestamp[LP℄ < T (e) ∨ ( timestamp[LP℄ = T (e)∧ thread_id[LP℄ < tid) then
32: Unlok_LP(LP (e), EXCLUSIVE)
33: UndoEvent(e)
34: new_events = ∅
35: goto 7
36: Flush(e, new_events)
37: atomially {
38: if thread_id[LP℄ = tid
39: timestamp[LP℄ ← T (e)
40: thread_id[LP℄ ← tid
41: }
determine whether a ertain event might be more valuable or not, thus requiring
either HTM-support or software-based reversibility (line 12). Additionally, it an
event is exeuted exploiting HTM, this value drives as well the seletion of a
delay before heking again the safety of the orresponding transation (namely,
whether the timestamp of the event has in the meanwile beome the ommit
horizon), so as to avoid making it doomed with a high likelihood (line 17).
In ase of a safe exeution, i.e. the exeution of the event on the ommit hori-
zon (lines 811), we take a non-exlusive lok, whih is used to inform any other
thread that the destination LP is urrently proessing an event. This avoids that
any other worker thread starts proessing an event via software-based reversibil-
ti ould be simply rejuvenated periodially, by disarding non-reent events om-
mitments and subtrating from commit_horizon the upper limit of the disarded
simulation time portion.
ity at the same LP while we are proessing in safe mode. Moreover, we ongure
the lok to spin beause the worker thread in harge of exeuting this event has
the highest priority and any other ompeting thread will try to give it permission
to ontinue exeution as fast as possible.
For a transational exeution (lines 12-24), we use the trylok version of the
per-LP lok. If we fail to aquire the lok, the exeution resumes from line 7,
meaning that we hek again whether the extrated event has beome safe or
not, in the meanwhile. Otherwise, as already explained before, we start exeuting
the event within an HTM-based transation, introduing an artiial delayvia
the Throttle(events_before) allwhih is proportional to the estimated
number of events in between the ommit horizon and the urrently exeuted
event. If the transation beomes doomed (lines 2124) the exeution restarts
from line 7, so as to hek whether the just-aborted event has beome safe.
The ase of exeution via software reversibility (lines 2534) is a bit dierent.
In fat, rst we have to take an exlusive lokin a trylok fashion, for the same
onsideration related to the HTM exeutionand we have to setup the undo
ode blok, by alloating a reverse window buer. At the end of the exeution
of the event, similarly to the HTM-based ase, we have to wait for the event
to beome safe. Nevertheless, sine this exeution entails taking an exlusive
lok, we ontinuously hek whether some other thread is registered at the same
LP with a higher priority (line 31). This situation might arise due to another
event, exeuted at any other worker thread, generating a new event to the same
LP with a timestamp smaller than the one of the event urrently proessed
via software-based reversibility. Failing to make this spei hek ould either
hamper liveness (a thread waits its event to be the ommit horizon, whih annot
happen) or orretness (events are ommitted out of order). Line 31, paired with
lines 2125 of Algorithm 1, is able to ensure both orretness and liveness.
Whenever an event is exeuted, and then ommitted thanks to safety assur-
ane, in whihever exeution mode, we rst plae into the alendar queue any
possible new event generated (line 36), and we then unregister the thread from
the timestamp and thread_id vetors whih are used to avoid priority inversion
(lines 3740). For the implementations of Feth(), Flush(), and Safe(), we
refer the reader to [19℄.
4 Experimental Results
We tested our proposal with the Phold benhmark for PDES systems [8℄. This is
made up by syntheti LPs whose behavior an be tailored depending on test se-
nario one would like to generate. We inluded 1024 LPs in the simulation model,
eah one sheduling events for itself or for the other objets. Speially, upon
proessing an event, the probability to shedule a new event destined to another
simulation objet has been set to 0.2, whih is representative of senarios with
non-minimal interations aross the simulated parts. Also, the initial population
of events has been set to 1 event per simulation objet, while the timestamp
inrement determining the atual timestamp of newly sheduled events has been
set to follow the exponential distribution with mean value equal to one simula-
tion time unit. The model lookahead has been set to a minimal value omputed
as the 0.5% of the average timestamp inrement. Further, the overall simula-
tion is partitioned in to 4 phases where the LPs exhibit alternate behaviors in
terms of updates into their states. Speially, phases 1 and 3 are write-mild
sine eah event only updates the lassial ounter of proessed events and a
few other statistial values within the LP state. Contrariwise, phases 2 and 4
are write-intensive, sine event proessing also updates an array of ounters' val-
ues, still embedded with the LP state (partiularly, by performing 500 updates
on the array entries). Overall, the dierent phases mimi varying loality and
memory aess proles one might expet from real appliations' workloads. A
lassial busy-loop haraterizing PHOLD event proessing steps is also added
whih is set to generate an average event granularity of about 25 miroseonds.
In this experiment, we ompare the performane of our mixed hardware- and
software-based approah to both pure hardware-based reversibility (as proposed
in [19℄) and pure software-based one exlusively relying on undo ode bloks (this
is ahieved by preventing any thread to exploit HTM in our engine). We did not
ompare with the performane ahievable by some last generation traditional
speulative PDES platform just beause the data reported in [19℄ have shown
that event granularity values of a few (tens of) miroseonds do not allow this
type of platforms to provide signiative speedup values (due to the fat that
they are based on expliit partitioning of the workload aross the threads, and
on expliit message passing for event ross-sheduling, thus resulting muh more
adequate for larger grain simulation models). Overall, we assessed our proposal
with a workload onguration just requiring alternative forms of speulative
parallelization (like the one we propose), as ompared to the lassial ones.
We have run this benhmark by varying the number of employed threads
from 1 to the maximum number of physial CPU-ores in the underlying HTM-
equipped mahine, whih is equipped with two Intel Haswell 3.5 GHz proessors,
24 GB of RAM and runs Linuxkernel 3.2
6
. For the ase of single-thread runs,
the exeution time values are those ahieved by simply running the appliation
ode on top of a alendar queue sheduler.
In Figure 3 we report the observed exeution time values while varying the
number of threads (eah reported value resulting as the average over 5 dierent
samples). The data learly show how our mixed HW/SW approah to reversibil-
ity outperforms both the others, with a maximum gain of up to 10% vs the pure
HW approah and of 30% vs the pure SW approah (ahieved when running with
4 threads). Suh a gain by the mixed approah is learly related to the fat that
write-intensive phases lead the pure software reoverability support to beome
more intrusive, beause of ostly generation of bigger undo ode bloks, whih
does not pay-o ompared to the reliane on pure HTM-based reversibility. On
the other hand, the pure HTM-based approah does not allow the maximization
6
The hyper-threading support oered by the proessors has been exluded just to
avoid ross-thread interferenesdue to oniting hyper-threads' aesses to hard-
ware resoureswhih might alter the reliability of our analysis
 100
 200
 300
 400
 500
 1  2  3  4
Ex
ec
ut
io
n 
tim
e 
(se
co
nd
s)
Number of threads
Mixed HW/SW
Pure HW
Pure SW
Fig. 3. Exeution time - log sale on
the y-axis.
0.000010
0.000100
0.001000
0.010000
0.100000
1.000000
 2  3  4
Un
do
 p
ro
ba
bi
lity
Number of threads
SW undo (mixed)
HW undo (mixed)
HW undo (pure)
SW undo (pure)
Fig. 4. Undo probability for HW and
SW speulatively proessed events.
of the usefulness of the arried out speulative work for larger thread ounts. In
fat, the slope of the exeution time urve for the pure HW approah beomes
slightly worse than the one of the pure SW approah when moving from 3 to 4
threads. Our mixed approah is able to get the best of the two by just avoiding
exessive aborts of HTM transations when relying on larger thread ounts, also
reduing the ost of undo ode bloks generation thanks to a fration of events
exeuted with HTM support. The data reported in Figure 4 show how the pure
HW approah suers from a kind of thrashing when inreasing the thread ount,
while the pure SW approah has minimal inidene of events undo, and that the
mixed approah avoids the thrashing phenomenon just like the pure SW ap-
proah does (but has less overhead sine exeutes a portion of the events via
HTM support).
5 Conlusions
We have presented a speulative PDES engine where reversibility of ausal inon-
sistent events is based on a mix of hardware and software failities. The hardware
part relies on HTM support oered by modern proessors, partiularly the Intel
Haswell, while software reversibility is based on transparent instrumentation and
on the dynami generation of bloks of ode able to undo memory side eets.
We have shown via an experimental study with a lassial benhmark how the
proposed mixed approah an overome the drawbaks of both the two baseline
ones, in terms of delivered performane of by the simulation engine.
Referenes
1. Brown, R.: Calendar queues: a fast O(1) priority queue implementation for the
simulation event set problem. Communiations of the ACM 31(10), 12201227
(1988)
2. Carothers, C.D., Perumalla, K.S., Fujimoto, R.M.: Eient optimisti parallel sim-
ulations using reverse omputation. ACM Transations on Modeling and Computer
Simulation 9(3), 224253 (1999)
3. Cingolani, D., Pellegrini, A., Quaglia, F.: Transparently Mixing Undo Logs and
Software Reversibility for State Reovery in Optimisti PDES. In: Proeedings of
the 2015 ACM SIGSIM Conferene on Priniples of Advaned Disrete Simulation.
PADS, ACM Press (2015)
4. Die, D., Shavit, N.: TLRW: Return of the Read-write Lok. Proeedings of the
22nd Annual ACM Symposium on Parallel Algorithms and Arhitetures pp. 284
293 (2010)
5. Fersha, A.: Probabilisti Adaptive Diret Optimism Control in Time Warp. In:
Proeedings of the 9th Workshop on Parallel and Distributed Simulation. pp. 120
129. IEEE Computer Soiety (1995)
6. Fersha, A., Luthi, J.: Estimating Rollbak Overhead for Optimism Control in
Time Warp. In: Proeedings of the 28th Annual Simulation Symposium. pp. 212.
IEEE Computer Soiety (apr 1995)
7. Fujimoto, R.M.: Parallel Disrete Event Simulation. In: Communiations of the
ACM. WSC, vol. 33, pp. 1928. ACM Press (1989)
8. Fujimoto, R.M.: Performane of Time Warp Under Syntheti Workloads. In: Pro-
eedings of the Multionf. on Distributed Simulation. pp. 2328. Soiety for Com-
puter Simulation (1990)
9. Fujimoto, R.M., Tsai, J.J., Gopalakrishnan, G.: Design and Evaluation of the Roll-
bak Chip: Speial Purpose Hardware for {Time Warp}. IEEE Transations on
Computers 41(1), 6882 (1992)
10. Jeerson, D.R.: Virtual Time. ACM Transations on Programming Languages and
System 7(3), 404425 (1985)
11. Jha, V., Bagrodia, R.: Simultaneous Events and Lookahead in Simulation Proto-
ols. ACM Transations on Modeling and Computer Simulation 10(3), 241267
(2000), http://doi.am.org/10.1145/361026.361032
12. LaPre, J.M., Gonsiorowski, E.J., Carothers, C.D.: LORAIN: a step loser to the
PDES 'holy grail'. In: Proeedings of the 2nd ACM SIGSIM/PADS onferene on
Priniples of Advaned Disrete Simulation. pp. 314. PADS, ACM Press, New
York, New York, USA (2014)
13. Mehl, H.: A deterministi tie-breaking sheme for sequential and distributed sim-
ulation. In: Proeedings of the Workshop on Parallel and Distributed Simulation.
ACM (1992)
14. Pellegrini, A.: Hijaker: Eient stati software instrumentation with appliations
in high performane omputing: Poster paper. In: Proeedings of the 2013 Interna-
tional Conferene on High Performane Computing and Simulation, HPCS 2013.
pp. 650655. Helsinki, Finland (2013)
15. Pellegrini, A., Vitali, R., Quaglia, F., Pellegrini, A., Quaglia, F.: Autonomi State
Management for Optimisti Simulation Platforms. IEEE Transations on Parallel
and Distributed Systems 26(6), 15601569 (2015)
16. Preiss, B.R., Louks, W.M., MaIntyre, D.: Eets of the Chekpoint Interval on
Time and Spae in Time Warp. ACM Transations on Modeling and Computer
Simulation 4(3), 223253 (1994)
17. Quaglia, F., Baldoni, R.: Exploiting Intra-Objet Dependenies in Parallel Simu-
lation. Inf. Proess. Lett. 70(3), 119125 (1999)
18. Quaglia, F., Santoro, A.: Non-Bloking Chekpointing for Optimisti Parallel Sim-
ulation: Desription and an Implementation. IEEE Transations on Parallel and
Distributed Systems 14(6), 593610 (2003)
19. Santini, E., Ianni, M., Pellegrini, A., Quaglia, F.: HTM Based Speulative Parallel
Disrete Event Simulation of Very Fine Grain Models. In: Proeedings of the 22nd
International Conferene on High Performane Computing (HiPC). HiPC (2015)
20. Soliman, H.M., Elmaghraby, A.S.: An Analytial Model for Hybrid Chekpointing
in Time Warp Distributed Simulation. IEEE Transations on Parallel and Dis-
tributed Systems 9(10), 947951 (1998)
21. West, D., Panesar, K.: Automati Inremental State Saving. In: Proeedings of the
10th Workshop on Parallel and Distributed Simulation. pp. 7885. PADS, IEEE
Computer Soiety (1996)
