171 research outputs found
A Conflict-Resilient Lock-Free Calendar Queue for Scalable Share-Everything PDES Platforms
Emerging share-everything Parallel Discrete Event Simulation (PDES) platforms rely on worker threads fully sharing the workload of events to be processed. These platforms require efficient event pool data structures enabling high concurrency of extraction/insertion operations. Non-blocking event pool algorithms are raising as promising solutions for this problem. However, the classical non-blocking paradigm leads concurrent conflicting operations, acting on a same portion of the event pool data structure, to abort and then retry. In this article we present a conflict-resilient non-blocking calendar queue that enables conflicting dequeue operations, concurrently attempting to extract the minimum element, to survive, thus improving the level of scalability of accesses to the hot portion of the data structure---namely the bucket to which the current locality of the events to be processed is bound. We have integrated our solution within an open source share-everything PDES platform and report the results of an experimental analysis of the proposed concurrent data structure compared to some literature solutions
Optimizing simulation on shared-memory platforms: The smart cities case
Modern advancements in computing architectures have been accompanied by new emergent paradigms to run Parallel Discrete Event Simulation models efficiently. Indeed, many new paradigms to effectively use the available underlying hardware have been proposed in the literature. Among these, the Share-Everything paradigm tackles massively-parallel shared-memory machines, in order to support speculative simulation by taking into account the limits and benefits related to this family of architectures. Previous results have shown how this paradigm outperforms traditional speculative strategies (such as data-separated Time Warp systems) whenever the granularity of executed events is small. In this paper, we show performance implications of this simulation-engine organization when the simulation models have a variable granularity. To this end, we have selected a traffic model, tailored for smart cities-oriented simulation. Our assessment illustrates the effects of the various tuning parameters related to the approach, opening to a higher understanding of this innovative paradigm
A Non-Blocking Priority Queue for the Pending Event Set
The large diffusion of shared-memory multi-core machines has impacted the way Parallel Discrete Event Simulation (PDES) engines are built. While they were originally conceived as data-partitioned platforms, where each thread is in charge of managing a subset of simulation objects, nowadays the trend is to shift towards share-everything settings. In this scenario, any thread can (in principle) take care of CPU-dispatching pending events bound to whichever simulation object, which helps to fully share the load across the available CPU-cores. Hence, a fundamental aspect to be tackled is to provide an efficient globally-shared pending events’ set from which multiple worker threads can concurrently extract events to be processed, and into which they can concurrently insert new produced events to be processed in the future. To cope with this aspect, we present the design and implementation of a concurrent non-blocking pending events’ set data structure, which can be seen as a variant of a classical calendar queue. Early experimental data collected with a synthetic stress test are reported, showing excellent scalability of our proposal on a machine equipped with 32 CPU-cores
A new approach to reversible computing with applications to speculative parallel simulation
In this thesis, we propose an innovative approach to reversible computing that shifts the focus from the operations to the memory outcome of a generic program. This choice allows us to overcome some typical challenges of "plain" reversible computing. Our methodology is to instrument a generic application with the help of an instrumentation tool, namely Hijacker, which we have redesigned and developed for the purpose. Through compile-time instrumentation, we enhance the program's code to keep track of the memory trace it produces until the end. Regardless of the complexity behind the generation of each computational step of the program, we can build inverse machine instructions just by inspecting the instruction that is attempting to write some value to memory. Therefore from this information, we craft an ad-hoc instruction that conveys this old value and the knowledge of where to replace it.
This instruction will become part of a more comprehensive structure, namely the reverse window. Through this structure, we have sufficient information to cancel all the updates done by the generic program during its execution.
In this writing, we will discuss the structure of the reverse window, as the building block for the whole reversing framework we designed and finally realized. Albeit we settle our solution in the specific context of the parallel discrete event simulation (PDES) adopting the Time Warp synchronization protocol, this framework paves the way for further general-purpose development and employment. We also present two additional innovative contributions coming from our innovative reversibility approach, both of them still embrace traditional state saving-based rollback strategy. The first contribution aims to harness the advantages of both the possible approaches. We implement the rollback operation combining state saving together with our reversible support through a mathematical model. This model enables the system to choose in autonomicity the best rollback strategy, by the mutable runtime dynamics of programs. The second contribution explores an orthogonal direction, still related to reversible computing aspects. In particular, we will address the problem of reversing shared libraries. Indeed, leading from their nature, shared objects are visible to the whole system and so does every possible external modification of their code. As a consequence, it is not possible to instrument them without affecting other unaware applications. We propose a different method to deal with the instrumentation of shared objects.
All our innovative proposals have been assessed using the last generation of the open source ROOT-Sim PDES platform, where we integrated our solutions. ROOT-Sim is a C-based package implementing a general purpose simulation environment based on the Time Warp synchronization protocol
A Non-blocking Buddy System for Scalable Memory Allocation on Multi-core Machines
Common implementations of core memory allocation components handle concurrent allocation/release requests by synchronizing threads via spin-locks. This approach is not prone to scale with large thread counts, a problem that has been addressed in the literature by introducing layered allocation services or replicating the core allocators - the bottom most ones within the layered architecture. Both these solutions tend to reduce the pressure of actual concurrent accesses to each individual core allocator. In this article we explore an alternative approach to scalability of memory allocation/release, which can be still combined with those literature proposals. We present a fully non-blocking buddy-system, that allows threads to proceed in parallel, and commit their allocations/releases unless a conflict is materialized while handling its metadata. Beyond improving scalability and performance it is resilient to performance degradation in face of concurrent accesses independently of the current level of fragmentation of the handled memory blocks
NBBS: A Non-blocking Buddy System for Multi-core Machines
Common implementations of core memory allocation components, like the Linux buddy system, handle concurrent allocation/release requests by synchronizing threads via spinlocks. This approach is not prone to scale with large thread counts, a problem that has been addressed in the literature by introducing layered allocation services or replicating the core allocators—the bottom most ones within the layered architecture. Both these solutions tend to reduce the pressure of actual concurrent accesses to each individual core allocator. In this article we explore an alternative approach to scalability of memory allocation/release, which can be still combined with those literature proposals. We present a fully non-blocking buddy-system, where threads performing concurrent allocations/releases do not undergo any spinlock based synchronization. Our solution allows threads to proceed in parallel, and commit their allocations/releases unless a conflict is materialized while handling its metadata. Conflict detection relies on conventional atomic machine instructions in the Read-Modify-Write (RMW) class. Beyond improving scalability and performance, our solution can also avoid wasting clock cycles for spin-lock operations by threads that could in principle carry out their memory allocation/release in full concurrency. Thus, it is resilient to performance degradation—in face of concurrent accesses—independently of the current level of fragmentation of the handled memory blocks
A Non-blocking Buddy System for Scalable Memory Allocation on Multi-core Machines
Common implementations of core memory allocation components, like the Linux
buddy system, handle concurrent allocation/release requests by synchronizing
threads via spin-locks. This approach is clearly not prone to scale with large
thread counts, a problem that has been addressed in the literature by
introducing layered allocation services or replicating the core allocators-the
bottom most ones within the layered architecture. Both these solutions tend to
reduce the pressure of actual concurrent accesses to each individual core
allocator. In this article we explore an alternative approach to scalability of
memory allocation/release, which can be still combined with those literature
proposals. Conflict detection relies on conventional atomic machine
instructions in the Read-Modify-Write (RMW) class. Furthermore, beyond
improving scalability and performance, it can also avoid wasting clock cycles
for spin-lock operations by threads that could in principle carry out their
memory allocation/release in full concurrency. Thus, it is resilient to
performance degradation---in face of concurrent accesses---independently of the
current level of fragmentation of the handled memory blocks
Techniques for Transparent Parallelization of Discrete Event Simulation Models
Simulation is a powerful technique to represent the evolution of real-world phenomena
or systems over time. It has been extensively used in different research
fields (from medicine to biology, to economy, and to disaster rescue) to study
the behaviour of complex systems during their evolution (symbiotic simulation)
or before their actual realization (what-if analysis).
A traditional way to achieve high performance simulations is the employment
of Parallel Discrete Event Simulation (PDES) techniques, which are based
on the partitioning of the simulation model into Logical Processes (LPs) that
can execute events in parallel on different CPUs and/or different CPU cores,
and rely on synchronization mechanisms to achieve causally consistent execution
of simulation events. As it is well recognized, the optimistic synchronization
approach, namely the Time Warp protocol, which is based on rollback for recovering
possible timestamp-order violations due to the absence of block-until-safe
policies for event processing, is likely to favour speedup in general application/
architectural contexts.
However, the optimistic PDES paradigm implicitly relies on a programming
model that shifts from traditional sequential-style programming, given
that there is no notion of global address space (fully accessible while processing
events at any LP). Furthermore, there is the underlying assumption that the
code associated with event handlers cannot execute unrecoverable operations
given their speculative processing nature. Nevertheless, even though no unrecoverable
action is ever executed by event handlers, a means to actually undo
the action if requested needs to be devised and implemented within the software
stack.
On the other hand, sequential-style programming is an easy paradigm for
the development of simulation code, given that it does not require the programmer
to reason about memory partitioning (and therefore message passing) and
speculative (concurrent) processing of the application.
In this thesis, we present methodological and technical innovations which
will show how it is possible, by developing innovative runtime mechanisms, to
allow a programmer to implement its simulation model in a fully sequential way,
and have the underlying simulation framework to execute it in parallel according
to speculative processing techniques. Some of the approaches we provide show
applicability in either shared- or distributed-memory systems, while others will
be specifically tailored to multi/many-core architectures.
We will clearly show, during the development of these supports, what is the
effect on performance of these solutions, which will nevertheless be negligible,
allowing a fruitful exploitation of the available computing power. In the end,
we will highlight which are the clear benefits on the programming model tha
- …