1,099 research outputs found
Non-blocking Priority Queue based on Skiplists with Relaxed Semantics
Priority queues are data structures that store information in an orderly fashion. They are of tremendous importance because they are an integral part of many applications, like Dijkstra’s shortest path algorithm, MST algorithms, priority schedulers, and so on.
Since priority queues by nature have high contention on the delete_min operation, the design of an efficient priority queue should involve an intelligent choice of the data structure as well as relaxation bounds on the data structure. Lock-free data structures provide higher scalability as well as progress guarantee than a lock-based data structure. That is another factor to be considered in the priority queue design.
We present a relaxed non-blocking priority queue based on skiplists. We address all the design issues mentioned above in our priority queue. Use of skiplists allows multiple threads to concurrently access different parts of the skiplist quickly, whereas relaxing the priority queue delete_min operation distributes contention over the skiplist instead of just at the front. Furthermore, a non-blocking implementation guarantees that the system will make progress even when some process fails.
Our priority queue is internally composed of several priority queues, one for each thread and one shared priority queue common to all threads. Each thread selects the best value from its local priority queue and the shared priority queue and returns the value. In case a thread is unable to delete an item, it tries to spy items from other threads\u27 local priority queues.
We experimentally and theoretically show the correctness of our data structure. We also compare the performance of our data structure with other variations like priority queues based on coarse-grained skiplists for both relaxed and non-relaxed semantics
HeTM: Transactional Memory for Heterogeneous Systems
Modern heterogeneous computing architectures, which couple multi-core CPUs
with discrete many-core GPUs (or other specialized hardware accelerators),
enable unprecedented peak performance and energy efficiency levels.
Unfortunately, though, developing applications that can take full advantage of
the potential of heterogeneous systems is a notoriously hard task. This work
takes a step towards reducing the complexity of programming heterogeneous
systems by introducing the abstraction of Heterogeneous Transactional Memory
(HeTM). HeTM provides programmers with the illusion of a single memory region,
shared among the CPUs and the (discrete) GPU(s) of a heterogeneous system, with
support for atomic transactions. Besides introducing the abstract semantics and
programming model of HeTM, we present the design and evaluation of a concrete
implementation of the proposed abstraction, which we named Speculative HeTM
(SHeTM). SHeTM makes use of a novel design that leverages on speculative
techniques and aims at hiding the inherently large communication latency
between CPUs and discrete GPUs and at minimizing inter-device synchronization
overhead. SHeTM is based on a modular and extensible design that allows for
easily integrating alternative TM implementations on the CPU's and GPU's sides,
which allows the flexibility to adopt, on either side, the TM implementation
(e.g., in hardware or software) that best fits the applications' workload and
the architectural characteristics of the processing unit. We demonstrate the
efficiency of the SHeTM via an extensive quantitative study based both on
synthetic benchmarks and on a porting of a popular object caching system.Comment: The current work was accepted in the 28th International Conference on
Parallel Architectures and Compilation Techniques (PACT'19
Data Structures for Task-based Priority Scheduling
Many task-parallel applications can benefit from attempting to execute tasks
in a specific order, as for instance indicated by priorities associated with
the tasks. We present three lock-free data structures for priority scheduling
with different trade-offs on scalability and ordering guarantees. First we
propose a basic extension to work-stealing that provides good scalability, but
cannot provide any guarantees for task-ordering in-between threads. Next, we
present a centralized priority data structure based on -fifo queues, which
provides strong (but still relaxed with regard to a sequential specification)
guarantees. The parameter allows to dynamically configure the trade-off
between scalability and the required ordering guarantee. Third, and finally, we
combine both data structures into a hybrid, -priority data structure, which
provides scalability similar to the work-stealing based approach for larger
, while giving strong ordering guarantees for smaller . We argue for
using the hybrid data structure as the best compromise for generic,
priority-based task-scheduling.
We analyze the behavior and trade-offs of our data structures in the context
of a simple parallelization of Dijkstra's single-source shortest path
algorithm. Our theoretical analysis and simulations show that both the
centralized and the hybrid -priority based data structures can give strong
guarantees on the useful work performed by the parallel Dijkstra algorithm. We
support our results with experimental evidence on an 80-core Intel Xeon system
Lace: non-blocking split deque for work-stealing
Work-stealing is an efficient method to implement load balancing in fine-grained task parallelism. Typically, concurrent deques are used for this purpose. A disadvantage of many concurrent deques is that they require expensive memory fences for local deque operations.\ud
\ud
In this paper, we propose a new non-blocking work-stealing deque based on the split task queue. Our design uses a dynamic split point between the shared and the private portions of the deque, and only requires memory fences when shrinking the shared portion.\ud
\ud
We present Lace, an implementation of work-stealing based on this deque, with an interface similar to the work-stealing library Wool, and an evaluation of Lace based on several common benchmarks. We also implement a recent approach using private deques in Lace. We show that the split deque and the private deque in Lace have similar low overhead and high scalability as Wool
Load Balancing Algorithms for Parallel Spatial Join on HPC Platforms
Geospatial datasets are growing in volume, complexity, and heterogeneity. For efficient execution of geospatial computations and analytics on large scale datasets, parallel processing is necessary. To exploit fine-grained parallel processing on large scale compute clusters, partitioning of skewed datasets in a load-balanced way is challenging. The workload in spatial join is data dependent and highly irregular. Moreover, wide variation in the size and density of geometries from one region of the map to another, further exacerbates the load imbalance. This dissertation focuses on spatial join operation used in Geographic Information Systems (GIS) and spatial databases, where the inputs are two layers of geospatial data, and the output is a combination of the two layers according to join predicate.This dissertation introduces a novel spatial data partitioning algorithm geared towards load balancing the parallel spatial join processing. Unlike existing partitioning techniques, the proposed partitioning algorithm divides the spatial join workload instead of partitioning the individual datasets separately to provide better load-balancing. This workload partitioning algorithm has been evaluated on a high-performance computing system using real-world datasets. An intermediate output-sensitive duplication avoidance technique is proposed that decreases the external memory space requirement for storing spatial join candidates across the partitions. GPU acceleration is used to further reduce the spatial partitioning runtime. For dynamic load balancing in spatial join, a novel framework for fine-grained work stealing is presented. This framework is efficient and NUMA-aware. Performance improvements are demonstrated on shared and distributed memory architectures using threads and message passing. Experimental results show effective mitigation of data skew. The framework supports a variety of spatial join predicates and spatial overlay using partitioned and un-partitioned datasets
Cooperative kernels: GPU multitasking for blocking algorithms
There is growing interest in accelerating irregular data-parallel algorithms on GPUs. These algorithms are typically blocking , so they require fair scheduling. But GPU programming models (e.g. OpenCL) do not mandate fair scheduling, and GPU schedulers are unfair in practice. Current approaches avoid this issue by exploit- ing scheduling quirks of today’s GPUs in a manner that does not allow the GPU to be shared with other workloads (such as graphics rendering tasks). We propose cooperative kernels , an extension to the traditional GPU programming model geared towards writing blocking algorithms. Workgroups of a cooperative kernel are fairly scheduled, and multitasking is supported via a small set of language extensions through which the kernel and scheduler cooperate. We describe a prototype implementation of a cooperative kernel frame- work implemented in OpenCL 2.0 and evaluate our approach by porting a set of blocking GPU applications to cooperative kernels and examining their performance under multitasking
Scheduling computations with provably low synchronization overheads
Work Stealing has been a very successful algorithm for scheduling parallel
computations, and is known to achieve high performances even for computations
exhibiting fine-grained parallelism. We present a variant of \ws\ that provably
avoids most synchronization overheads by keeping processors' deques entirely
private by default, and only exposing work when requested by thieves. This is
the first paper that obtains bounds on the synchronization overheads that are
(essentially) independent of the total amount of work, thus corresponding to a
great improvement, in both algorithm design and theory, over state-of-the-art
\ws\ algorithms. Consider any computation with work and critical-path
length executed by processors using our scheduler. Our
analysis shows that the expected execution time is , and the expected synchronization overheads incurred during
the execution are at most , where and
respectively denote the maximum cost of executing a Compare-And-Swap
instruction and a Memory Fence instruction
- …