2,568 research outputs found
A Template for Implementing Fast Lock-free Trees Using HTM
Algorithms that use hardware transactional memory (HTM) must provide a
software-only fallback path to guarantee progress. The design of the fallback
path can have a profound impact on performance. If the fallback path is allowed
to run concurrently with hardware transactions, then hardware transactions must
be instrumented, adding significant overhead. Otherwise, hardware transactions
must wait for any processes on the fallback path, causing concurrency
bottlenecks, or move to the fallback path. We introduce an approach that
combines the best of both worlds. The key idea is to use three execution paths:
an HTM fast path, an HTM middle path, and a software fallback path, such that
the middle path can run concurrently with each of the other two. The fast path
and fallback path do not run concurrently, so the fast path incurs no
instrumentation overhead. Furthermore, fast path transactions can move to the
middle path instead of waiting or moving to the software path. We demonstrate
our approach by producing an accelerated version of the tree update template of
Brown et al., which can be used to implement fast lock-free data structures
based on down-trees. We used the accelerated template to implement two
lock-free trees: a binary search tree (BST), and an (a,b)-tree (a
generalization of a B-tree). Experiments show that, with 72 concurrent
processes, our accelerated (a,b)-tree performs between 4.0x and 4.2x as many
operations per second as an implementation obtained using the original tree
update template
A Concurrency-Agnostic Protocol for Multi-Paradigm Concurrent Debugging Tools
Today's complex software systems combine high-level concurrency models. Each
model is used to solve a specific set of problems. Unfortunately, debuggers
support only the low-level notions of threads and shared memory, forcing
developers to reason about these notions instead of the high-level concurrency
models they chose.
This paper proposes a concurrency-agnostic debugger protocol that decouples
the debugger from the concurrency models employed by the target application. As
a result, the underlying language runtime can define custom breakpoints,
stepping operations, and execution events for each concurrency model it
supports, and a debugger can expose them without having to be specifically
adapted.
We evaluated the generality of the protocol by applying it to SOMns, a
Newspeak implementation, which supports a diversity of concurrency models
including communicating sequential processes, communicating event loops,
threads and locks, fork/join parallelism, and software transactional memory. We
implemented 21 breakpoints and 20 stepping operations for these concurrency
models. For none of these, the debugger needed to be changed. Furthermore, we
visualize all concurrent interactions independently of a specific concurrency
model. To show that tooling for a specific concurrency model is possible, we
visualize actor turns and message sends separately.Comment: International Symposium on Dynamic Language
A Conflict-Resilient Lock-Free Calendar Queue for Scalable Share-Everything PDES Platforms
Emerging share-everything Parallel Discrete Event Simulation (PDES) platforms rely on worker threads fully sharing the workload of events to be processed. These platforms require efficient event pool data structures enabling high concurrency of extraction/insertion operations. Non-blocking event pool algorithms are raising as promising solutions for this problem. However, the classical non-blocking paradigm leads concurrent conflicting operations, acting on a same portion of the event pool data structure, to abort and then retry. In this article we present a conflict-resilient non-blocking calendar queue that enables conflicting dequeue operations, concurrently attempting to extract the minimum element, to survive, thus improving the level of scalability of accesses to the hot portion of the data structure---namely the bucket to which the current locality of the events to be processed is bound. We have integrated our solution within an open source share-everything PDES platform and report the results of an experimental analysis of the proposed concurrent data structure compared to some literature solutions
Low cost management of replicated data in fault-tolerant distributed systems
Many distributed systems replicate data for fault tolerance or availability. In such systems, a logical update on a data item results in a physical update on a number of copies. The synchronization and communication required to keep the copies of replicated data consistent introduce a delay when operations are performed. A technique is described that relaxes the usual degree of synchronization, permitting replicated data items to be updated concurrently with other operations, while at the same time ensuring that correctness is not violated. The additional concurrency thus obtained results in better response time when performing operations on replicated data. How this technique performs in conjunction with a roll-back and a roll-forward failure recovery mechanism is also discussed
Load sharing for optimistic parallel simulations on multicore machines
Parallel Discrete Event Simulation (PDES) is based on the partitioning of the simulation model into distinct Logical Processes (LPs), each one modeling a portion of the entire system, which are allowed to execute simulation events concurrently. This allows exploiting parallel computing architectures to speedup model execution, and to make very large models tractable. In this article we cope with the optimistic approach to PDES, where LPs are allowed to concurrently process their events in a speculative fashion, and rollback/ recovery techniques are used to guarantee state consistency in case of causality violations along the speculative execution path. Particularly, we present an innovative load sharing approach targeted at optimizing resource usage for fruitful simulation work when running an optimistic PDES environment on top of multi-processor/multi-core machines. Beyond providing the load sharing model, we also define a load sharing oriented architectural scheme, based on a symmetric multi-threaded organization of the simulation platform. Finally, we present a real implementation of the load sharing architecture within the open source ROme OpTimistic Simulator (ROOT-Sim) package. Experimental data for an assessment of both viability and effectiveness of our proposal are presented as well. Copyright is held by author/owner(s)
A Concurrency-Optimal Binary Search Tree
The paper presents the first \emph{concurrency-optimal} implementation of a
binary search tree (BST). The implementation, based on a standard sequential
implementation of an internal tree, ensures that every \emph{schedule} is
accepted, i.e., interleaving of steps of the sequential code, unless
linearizability is violated. To ensure this property, we use a novel read-write
locking scheme that protects tree \emph{edges} in addition to nodes.
Our implementation outperforms the state-of-the art BSTs on most basic
workloads, which suggests that optimizing the set of accepted schedules of the
sequential code can be an adequate design principle for efficient concurrent
data structures
Chiller: Contention-centric Transaction Execution and Data Partitioning for Modern Networks
Distributed transactions on high-overhead TCP/IP-based networks were
conventionally considered to be prohibitively expensive and thus were avoided
at all costs. To that end, the primary goal of almost any existing partitioning
scheme is to minimize the number of cross-partition transactions. However, with
the new generation of fast RDMA-enabled networks, this assumption is no longer
valid. In fact, recent work has shown that distributed databases can scale even
when the majority of transactions are cross-partition. In this paper, we first
make the case that the new bottleneck which hinders truly scalable transaction
processing in modern RDMA-enabled databases is data contention, and that
optimizing for data contention leads to different partitioning layouts than
optimizing for the number of distributed transactions. We then present Chiller,
a new approach to data partitioning and transaction execution, which aims to
minimize data contention for both local and distributed transactions. Finally,
we evaluate Chiller using various workloads, and show that our partitioning and
execution strategy outperforms traditional partitioning techniques which try to
avoid distributed transactions, by up to a factor of 2
- …