82 research outputs found

    Efficient concurrent data structure access parallelism techniques for increasing scalability

    Get PDF
    Multi-core processors have revolutionised the way data structures are designed by bringing parallelism to mainstream computing. Key to exploiting hardware parallelism available in multi-core processors are concurrent data structures. However, some concurrent data structure abstractions are inherently sequential and incapable of harnessing the parallelism performance of multi-core processors. Designing and implementing concurrent data structures to harness hardware parallelism is challenging due to the requirement of correctness, efficiency and practicability under various application constraints. In this thesis, our research contribution is towards improving concurrent data structure access parallelism to increase data structure performance. We propose new design frameworks that improve access parallelism of already existing concurrent data structure designs. Also, we propose new concurrent data structure designs with significant performance improvements. To give an insight into the interplay between hardware and concurrent data structure access parallelism, we give a detailed analysis and model the performance scalability with varying parallelism.In the first part of the thesis, we focus on data structure semantic relaxation. By relaxing the semantics of a data structure, a bigger design space, that allows weaker synchronization and more useful parallelism, is unveiled. Investigating new data structure designs, capable of trading semantics for achieving better performance in a monotonic way, is a major challenge in the area. We algorithmically address this challenge in this part of the thesis. We present an efficient, lock-free, concurrent data structure design framework for out-of-order semantic relaxation. We introduce a new two-dimensional algorithmic design, that uses multiple instances of a given data structure to improve access parallelism. In the second part of the thesis, we propose an efficient priority queue that improves access parallelism by reducing the number of synchronization points for each operation. Priority queues are fundamental abstract data types, often used to manage limited resources in parallel systems. Typical proposed parallel priority queue implementations are based on heaps or skip lists. In recent literature, skip lists have been shown to be the most efficient design choice for implementing priority queues. Though numerous intricate implementations of skip list based queues have been proposed in the literature, their performance is constrained by the high number of global atomic updates per operation and the high memory consumption, which are proportional to the number of sub-lists in the queue. In this part of the thesis, we propose an alternative approach for designing lock-free linearizable priority queues, that significantly improve memory efficiency and throughput performance, by reducing the number of global atomic updates and memory consumption as compared to skip-list based queues. To achieve this, our new design combines two structures; a search tree and a linked list, forming what we call a Tree Search List Queue (TSLQueue). Subsequently, we analyse and introduce a model for lock-free concurrent data structure access parallelism. The major impediment to scaling concurrent data structures is memory contention when accessing shared data structure access points, leading to thread serialisation, and hindering parallelism. Aiming to address this challenge, a significant amount of work in the literature has proposed multi-access techniques that improve concurrent data structure parallelism. However, there is little work on analysing and modelling the execution behaviour of concurrent multi-access data structures especially in a shared memory setting. In this part of the thesis, we analyse and model the general execution behaviour of concurrent multi-access data structures in the shared memory setting. We study and analyse the behaviour of the two popular random access patterns: shared (Remote) and exclusive (Local) access, and the behaviour of the two most commonly used atomic primitives for designing lock-free data structures: Compare and Swap, and, Fetch and Add

    Tight Bounds on the Round Complexity of the Distributed Maximum Coverage Problem

    Full text link
    We study the maximum kk-set coverage problem in the following distributed setting. A collection of sets S1,,SmS_1,\ldots,S_m over a universe [n][n] is partitioned across pp machines and the goal is to find kk sets whose union covers the most number of elements. The computation proceeds in synchronous rounds. In each round, all machines simultaneously send a message to a central coordinator who then communicates back to all machines a summary to guide the computation for the next round. At the end, the coordinator outputs the answer. The main measures of efficiency in this setting are the approximation ratio of the returned solution, the communication cost of each machine, and the number of rounds of computation. Our main result is an asymptotically tight bound on the tradeoff between these measures for the distributed maximum coverage problem. We first show that any rr-round protocol for this problem either incurs a communication cost of kmΩ(1/r) k \cdot m^{\Omega(1/r)} or only achieves an approximation factor of kΩ(1/r)k^{\Omega(1/r)}. This implies that any protocol that simultaneously achieves good approximation ratio (O(1)O(1) approximation) and good communication cost (O~(n)\widetilde{O}(n) communication per machine), essentially requires logarithmic (in kk) number of rounds. We complement our lower bound result by showing that there exist an rr-round protocol that achieves an ee1\frac{e}{e-1}-approximation (essentially best possible) with a communication cost of kmO(1/r)k \cdot m^{O(1/r)} as well as an rr-round protocol that achieves a kO(1/r)k^{O(1/r)}-approximation with only O~(n)\widetilde{O}(n) communication per each machine (essentially best possible). We further use our results in this distributed setting to obtain new bounds for the maximum coverage problem in two other main models of computation for massive datasets, namely, the dynamic streaming model and the MapReduce model

    Performance Analysis and Modelling of Concurrent Multi-access Data Structures

    Get PDF
    The major impediment to scaling concurrent data structures is memory contention when accessing shared data structure access-points, leading to thread serialisation, hindering parallelism. Aiming to address this challenge, significant amount of work in the literature has proposed multi-access techniques that improve concurrent data structure parallelism. However, there is little work on analysing and modelling the execution behaviour of concurrent multi-access data structures especially in a shared memory setting. In this paper, we analyse and model the general execution behaviour of concurrent multi-access data structures in the shared memory setting. We study and analyse the behaviour of the two popular random access patterns: shared (Remote) and exclusive (Local) access, and the behaviour of the two most commonly used atomic primitives for designing lock-free data structures: Compare and Swap, and, Fetch and Add. We model the concurrent multi-accesses by splitting the thread execution procedure into five logical sessions: i) side-work, ii) access-point search iii) access-point acquisition, iv) access-point data acquisition and v) access-point data operation. We model the acquisition of an access-point, as a system of closed queuing networks with parallel servers, and data acquisition in terms of where the data is located within the memory system. We evaluate our model on a set of concurrent data structure designs including a counter, a stack and a FIFO queue. The evaluation is carried out on two state of the art multi-core processors: Intel Xeon Phi CPU 7290 with 72 physical cores and Intel Xeon E5-2695 with 14 physical cores. Our model is able to predict the throughput performance of the given concurrent data structures with 80% to 100% accuracy on both architectures

    New streaming algorithms for parameterized maximal matching & beyond

    Get PDF
    Very recently at SODA'15 [2], we studied maximal matching via the framework of parameterized streaming, where we sought solutions under the promise that no maximal matching exceeds k in size. In this paper, we revisit this problem and provide a much simpler algorithm for this problem. We are also able to apply the same technique to the Point Line Cover problem [3]

    Coresets Meet EDCS: Algorithms for Matching and Vertex Cover on Massive Graphs

    Full text link
    As massive graphs become more prevalent, there is a rapidly growing need for scalable algorithms that solve classical graph problems, such as maximum matching and minimum vertex cover, on large datasets. For massive inputs, several different computational models have been introduced, including the streaming model, the distributed communication model, and the massively parallel computation (MPC) model that is a common abstraction of MapReduce-style computation. In each model, algorithms are analyzed in terms of resources such as space used or rounds of communication needed, in addition to the more traditional approximation ratio. In this paper, we give a single unified approach that yields better approximation algorithms for matching and vertex cover in all these models. The highlights include: * The first one pass, significantly-better-than-2-approximation for matching in random arrival streams that uses subquadratic space, namely a (1.5+ϵ)(1.5+\epsilon)-approximation streaming algorithm that uses O(n1.5)O(n^{1.5}) space for constant ϵ>0\epsilon > 0. * The first 2-round, better-than-2-approximation for matching in the MPC model that uses subquadratic space per machine, namely a (1.5+ϵ)(1.5+\epsilon)-approximation algorithm with O(mn+n)O(\sqrt{mn} + n) memory per machine for constant ϵ>0\epsilon > 0. By building on our unified approach, we further develop parallel algorithms in the MPC model that give a (1+ϵ)(1 + \epsilon)-approximation to matching and an O(1)O(1)-approximation to vertex cover in only O(loglogn)O(\log\log{n}) MPC rounds and O(n/polylog(n))O(n/poly\log{(n)}) memory per machine. These results settle multiple open questions posed in the recent paper of Czumaj~et.al. [STOC 2018]
    corecore