2,272 research outputs found
Lock-free Concurrent Data Structures
Concurrent data structures are the data sharing side of parallel programming.
Data structures give the means to the program to store data, but also provide
operations to the program to access and manipulate these data. These operations
are implemented through algorithms that have to be efficient. In the sequential
setting, data structures are crucially important for the performance of the
respective computation. In the parallel programming setting, their importance
becomes more crucial because of the increased use of data and resource sharing
for utilizing parallelism.
The first and main goal of this chapter is to provide a sufficient background
and intuition to help the interested reader to navigate in the complex research
area of lock-free data structures. The second goal is to offer the programmer
familiarity to the subject that will allow her to use truly concurrent methods.Comment: To appear in "Programming Multi-core and Many-core Computing
Systems", eds. S. Pllana and F. Xhafa, Wiley Series on Parallel and
Distributed Computin
Practical Dynamic Transactional Data Structures
Multicore programming presents the challenge of synchronizing multiple threads. Traditionally, mutual exclusion locks are used to limit access to a shared resource to a single thread at a time. Whether this lock is applied to an entire data structure, or only a single element, the pitfalls of lock-based programming persist. Deadlock, livelock, starvation, and priority inversion are some of the hazards of lock-based programming that can be avoided by using non-blocking techniques. Non-blocking data structures allow scalable and thread-safe access to shared data by guaranteeing, at least, system-wide progress. In this work, we present the first wait-free hash map which allows a large number of threads to concurrently insert, get, and remove information. Wait-freedom means that all threads make progress in a finite amount of time --- an attribute that can be critical in real-time environments. We only use atomic operations that are provided by the hardware; therefore, our hash map can be utilized by a variety of data-intensive applications including those within the domains of embedded systems and supercomputers. The challenges of providing this guarantee make the design and implementation of wait-free objects difficult. As such, there are few wait-free data structures described in the literature; in particular, there are no wait-free hash maps. It often becomes necessary to sacrifice performance in order to achieve wait-freedom. However, our experimental evaluation shows that our hash map design is, on average, 7 times faster than a traditional blocking design. Our solution outperforms the best available alternative non-blocking designs in a large majority of cases, typically by a factor of 15 or higher. The main drawback of non-blocking data structures is that only one linearizable operation can be executed by each thread, at any one time. To overcome this limitation we present a framework for developing dynamic transactional data containers. Transactional containers are those that execute a sequence of operations atomically and in such a way that concurrent transactions appear to take effect in some sequential order. We take an existing algorithm that transforms non-blocking sets into static transactional versions (LFTT), and we modify it to support maps. We implement a non-blocking transactional hash map using this new approach. We continue to build on LFTT by implementing a lock-free vector using a methodology to allow LFTT to be compatible with non-linked data structures. A static transaction requires all operands and operations to be specified at compile-time, and no code may be executed between transactions. These limitations render static transactions impractical for most use cases. We modify LFTT to support dynamic transactions, and we enhance it with additional features. Dynamic transactions allow operands to be specified at runtime rather than compile-time, and threads can execute code between the data structure operations of a transaction. We build a framework for transforming non-blocking containers into dynamic transactional data structures, called Dynamic Transactional Transformation (DTT), and provide a library of novel transactional containers. Our library provides the wait-free progress guarantee and supports transactions among multiple data structures, whereas previous work on data structure transactions has been limited to operating on a single container. Our approach is 3 times faster than software transactional memory, and its performance matches its lock-free transactional counterpart
PaRiS: Causally Consistent Transactions with Non-blocking Reads and Partial Replication
Geo-replicated data platforms are at the backbone of several large-scale
online services. Transactional Causal Consistency (TCC) is an attractive
consistency level for building such platforms. TCC avoids many anomalies of
eventual consistency, eschews the synchronization costs of strong consistency,
and supports interactive read-write transactions. Partial replication is
another attractive design choice for building geo-replicated platforms, as it
increases the storage capacity and reduces update propagation costs. This paper
presents PaRiS, the first TCC system that supports partial replication and
implements non-blocking parallel read operations, whose latency is paramount
for the performance of read-intensive applications. PaRiS relies on a novel
protocol to track dependencies, called Universal Stable Time (UST). By means of
a lightweight background gossip process, UST identifies a snapshot of the data
that has been installed by every DC in the system. Hence, transactions can
consistently read from such a snapshot on any server in any replication site
without having to block. Moreover, PaRiS requires only one timestamp to track
dependencies and define transactional snapshots, thereby achieving resource
efficiency and scalability. We evaluate PaRiS on a large-scale AWS deployment
composed of up to 10 replication sites. We show that PaRiS scales well with the
number of DCs and partitions, while being able to handle larger data-sets than
existing solutions that assume full replication. We also demonstrate a
performance gain of non-blocking reads vs. a blocking alternative (up to 1.47x
higher throughput with 5.91x lower latency for read-dominated workloads and up
to 1.46x higher throughput with 20.56x lower latency for write-heavy
workloads)
Scalable RDF Data Compression using X10
The Semantic Web comprises enormous volumes of semi-structured data elements.
For interoperability, these elements are represented by long strings. Such
representations are not efficient for the purposes of Semantic Web applications
that perform computations over large volumes of information. A typical method
for alleviating the impact of this problem is through the use of compression
methods that produce more compact representations of the data. The use of
dictionary encoding for this purpose is particularly prevalent in Semantic Web
database systems. However, centralized implementations present performance
bottlenecks, giving rise to the need for scalable, efficient distributed
encoding schemes. In this paper, we describe an encoding implementation based
on the asynchronous partitioned global address space (APGAS) parallel
programming model. We evaluate performance on a cluster of up to 384 cores and
datasets of up to 11 billion triples (1.9 TB). Compared to the state-of-art
MapReduce algorithm, we demonstrate a speedup of 2.6-7.4x and excellent
scalability. These results illustrate the strong potential of the APGAS model
for efficient implementation of dictionary encoding and contributes to the
engineering of larger scale Semantic Web applications
HeTM: Transactional Memory for Heterogeneous Systems
Modern heterogeneous computing architectures, which couple multi-core CPUs
with discrete many-core GPUs (or other specialized hardware accelerators),
enable unprecedented peak performance and energy efficiency levels.
Unfortunately, though, developing applications that can take full advantage of
the potential of heterogeneous systems is a notoriously hard task. This work
takes a step towards reducing the complexity of programming heterogeneous
systems by introducing the abstraction of Heterogeneous Transactional Memory
(HeTM). HeTM provides programmers with the illusion of a single memory region,
shared among the CPUs and the (discrete) GPU(s) of a heterogeneous system, with
support for atomic transactions. Besides introducing the abstract semantics and
programming model of HeTM, we present the design and evaluation of a concrete
implementation of the proposed abstraction, which we named Speculative HeTM
(SHeTM). SHeTM makes use of a novel design that leverages on speculative
techniques and aims at hiding the inherently large communication latency
between CPUs and discrete GPUs and at minimizing inter-device synchronization
overhead. SHeTM is based on a modular and extensible design that allows for
easily integrating alternative TM implementations on the CPU's and GPU's sides,
which allows the flexibility to adopt, on either side, the TM implementation
(e.g., in hardware or software) that best fits the applications' workload and
the architectural characteristics of the processing unit. We demonstrate the
efficiency of the SHeTM via an extensive quantitative study based both on
synthetic benchmarks and on a porting of a popular object caching system.Comment: The current work was accepted in the 28th International Conference on
Parallel Architectures and Compilation Techniques (PACT'19
A Template for Implementing Fast Lock-free Trees Using HTM
Algorithms that use hardware transactional memory (HTM) must provide a
software-only fallback path to guarantee progress. The design of the fallback
path can have a profound impact on performance. If the fallback path is allowed
to run concurrently with hardware transactions, then hardware transactions must
be instrumented, adding significant overhead. Otherwise, hardware transactions
must wait for any processes on the fallback path, causing concurrency
bottlenecks, or move to the fallback path. We introduce an approach that
combines the best of both worlds. The key idea is to use three execution paths:
an HTM fast path, an HTM middle path, and a software fallback path, such that
the middle path can run concurrently with each of the other two. The fast path
and fallback path do not run concurrently, so the fast path incurs no
instrumentation overhead. Furthermore, fast path transactions can move to the
middle path instead of waiting or moving to the software path. We demonstrate
our approach by producing an accelerated version of the tree update template of
Brown et al., which can be used to implement fast lock-free data structures
based on down-trees. We used the accelerated template to implement two
lock-free trees: a binary search tree (BST), and an (a,b)-tree (a
generalization of a B-tree). Experiments show that, with 72 concurrent
processes, our accelerated (a,b)-tree performs between 4.0x and 4.2x as many
operations per second as an implementation obtained using the original tree
update template
- …