Search CORE

21 research outputs found

Towards Scalable Synchronization on Multi-Cores

Author: Trigonakis Vasileios
Publication venue: Lausanne, EPFL
Publication date: 19/10/2016
Field of study

The shift of commodity hardware from single- to multi-core processors in the early 2000s compelled software developers to take advantage of the available parallelism of multi-cores. Unfortunately, only few---so-called embarrassingly parallel---applications can leverage this available parallelism in a straightforward manner. The remaining---non-embarrassingly parallel---applications require that their processes coordinate their possibly interleaved executions to ensure overall correctness---they require synchronization. Synchronization is achieved by constraining or even prohibiting parallel execution. Thus, per Amdahl's law, synchronization limits software scalability. In this dissertation, we explore how to minimize the effects of synchronization on software scalability. We show that scalability of synchronization is mainly a property of the underlying hardware. This means that synchronization directly hampers the cross-platform performance portability of concurrent software. Nevertheless, we can achieve portability without sacrificing performance, by creating design patterns and abstractions, which implicitly leverage hardware details without exposing them to software developers. We first perform an exhaustive analysis of the performance behavior of synchronization on several modern platforms. This analysis clearly shows that the performance and scalability of synchronization are highly dependent on the characteristics of the underlying platform. We then focus on lock-based synchronization and analyze the energy/performance trade-offs of various waiting techniques. We show that the performance and the energy efficiency of locks go hand in hand on modern x86 multi-cores. This correlation is again due to the characteristics of the hardware that does not provide practical tools for reducing the power consumption of locks without sacrificing throughput. We then propose two approaches for developing portable and scalable concurrent software, hence hiding the limitations that the underlying multi-cores impose. First, we introduce OPTIK, a new practical design pattern for designing and implementing fast and scalable concurrent data structures. We illustrate the power of our OPTIK pattern by devising five new algorithms and by optimizing four state-of-the-art algorithms for linked lists, skip lists, hash tables, and queues. Second, we introduce MCTOP, a multi-core topology abstraction which includes low-level information, such as memory bandwidths. MCTOP enables developers to accurately and portably define high-level optimization policies. We illustrate several such policies through four examples, including automated backoff schemes for locks, and illustrate the performance and portability of these policies on five platforms

Infoscience - École polytechnique fédérale de Lausanne

Performance Anomalies in Concurrent Data Structure Microbenchmarks

Author: Brown Trevor
Kharal Rosina
Publication venue
Publication date: 19/09/2022
Field of study

Recent decades have witnessed a surge in the development of concurrent data structures with an increasing interest in data structures implementing concurrent sets (CSets). Microbenchmarking tools are frequently utilized to evaluate and compare the performance differences across concurrent data structures. The underlying structure and design of the microbenchmarks themselves can play a hidden but influential role in performance results. However, the impact of microbenchmark design has not been well investigated. In this work, we illustrate instances where concurrent data structure performance results reported by a microbenchmark can vary 10-100x depending on the microbenchmark implementation details. We investigate factors leading to performance variance across three popular microbenchmarks and outline cases in which flawed microbenchmark design can lead to an inversion of performance results between two concurrent data structure implementations. We further derive a prescriptive approach for best practices in the design and utilization of concurrent data structure microbenchmarks

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Performance Anomalies in Concurrent Data Structure Microbenchmarks

Author: Brown Trevor
Kharal Rosina F.
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 26th International Conference on Principles of Distributed Systems (OPODIS 2022)
Publication date: 01/01/2023
Field of study

Recent decades have witnessed a surge in the development of concurrent data structures with an increasing interest in data structures implementing concurrent sets (CSets). Microbenchmarking tools are frequently utilized to evaluate and compare the performance differences across concurrent data structures. The underlying structure and design of the microbenchmarks themselves can play a hidden but influential role in performance results. However, the impact of microbenchmark design has not been well investigated. In this work, we illustrate instances where concurrent data structure performance results reported by a microbenchmark can vary 10-100x depending on the microbenchmark implementation details. We investigate factors leading to performance variance across three popular microbenchmarks and outline cases in which flawed microbenchmark design can lead to an inversion of performance results between two concurrent data structure implementations. We further derive a set of recommendations for best practices in the design and usage of concurrent data structure microbenchmarks and explore advanced features in the Setbench microbenchmark

Dagstuhl Research Online Publication Server

Concurrent Search Data Structures Can Be Blocking and Practically Wait-Free

Author: Abramson Morton
Intel
John
Kivity Avi
McKenney Paul E
McKenney Paul E.
Timothy
Uhlig Volkmar
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 08/09/2016
Field of study

We argue that there is virtually no practical situation in which one should seek a "theoretically wait-free" algorithm at the expense of a state-of-the-art blocking algorithm in the case of search data structures: blocking algorithms are simple, fast, and can be made "practically wait-free". We draw this conclusion based on the most exhaustive study of blocking search data structures to date. We consider (a) different search data structures of different sizes, (b) numerous uniform and non-uniform workloads, representative of a wide range of practical scenarios, with different percentages of update operations, (c) with and without delayed threads, (d) on different hardware technologies, including processors providing HTM instructions. We explain our claim that blocking search data structures are practically wait-free through an analogy with the birthday paradox, revealing that, in state-of-the-art algorithms implementing such data structures, the probability of conflicts is extremely small. When conflicts occur as a result of context switches and interrupts, we show that HTM-based locks enable blocking algorithms to cope with the

Infoscience - École polytechnique fédérale de Lausanne

Crossref

Universally Scalable Concurrent Data Structures

Author: David Tudor Alexandru
Publication venue: Lausanne, EPFL
Publication date: 20/09/2017
Field of study

The increase in the number of cores in processors has been an important trend over the past decade. In order to be able to efficiently use such architectures, modern software must be scalable: performance should increase proportionally to the number of allotted cores. While some software is inherently parallel, with threads seldom having to coordinate, a large fraction of software systems are based on shared state, to which access must be coordinated. This shared state generally comes in the form of a concurrent data structure. It is thus essential for these concurrent data structures to be correct, fast and scalable, regardless of the scenario (i.e.,different workloads, processors, memory units, programming abstractions). Nevertheless, few or no generic approaches exist that result in concurrent data structures which scale in a large spectrum of environments. This dissertation introduces a set of generic methods that allows to build - irrespective of the deployment environment - fast and scalable concurrent data structures. We start by identifying a set of sufficient conditions for concurrent search data structures to scale and perform well regardless of the workloads and processors they are running on.We introduce âasynchronized concurrencyâ, a paradigm consisting of four complementary programming patterns, which calls for the design of concurrent search data structures to resemble that of their sequential counterparts. Next, we show that there is virtually no practical situation in which one should seek a âtheoretically wait-freeâ algorithm at the expense of a state-of-the-art blocking algorithm in the case of search data structures: blocking algorithms are simple, fast, and can be made "practically wait-free". We then focus on the memory unit, and provide a method yielding fast concurrent data structures even when the memory is non-volatile, and structures must be recoverable in case of a transient failure. We start by introducing a generic technique that allows us to avoid doing expensive writes to non-volatile memory by using a fast software cache. We also study memory management, and propose a solution tailored to concurrent data structures that uses coarse-grained memory management in order to avoid logging. Moreover, we argue for the use of lock-free algorithms in this non-volatile context, and show how by optimizing them we can avoid expensive logging operations. Together, the techniques we propose enable us to avoid any form of logging in the common case, thus significantly improving concurrent data structure performance when using non-volatile RAM. Finally, we go beyond basic interfaces, and look at scalable partitioned data structures implemented through a transactional interface. We present multiversion timestamp locking (MVTL),a new genre of multiversion concurrency control algorithms for serializable transactions. The key idea behind MVTL is simple and novel: lock individual time points instead of locking objects or versions. We provide several MVTL-based algorithms, that address limitations of current concurrency-control schemes. In short, by spanning workloads, processors, storage abstractions, and system sizes, this dissertation takes a step towards concurrent data structures that are universally scalable

Infoscience - École polytechnique fédérale de Lausanne

Scale-Out ccNUMA: Exploiting Skew with Strongly Consistent Caching

Author: Gavrielatos Vasileios
Grot Boris
Joshi Arpit
Katsarakis Antonios
Nagarajan Vijayanand
Oswald Nicolai
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 23/04/2018
Field of study

Crossref

Edinburgh Research Explorer

Concurrent Data Structures Using Multiword Compare and Swap

Author: Sigouin William
Publication venue: 'University of Waterloo'
Publication date: 27/04/2020
Field of study

To maximize the performance of concurrent data structures, researchers have turned to highly complex fine-grained techniques. Resulting algorithms are often extremely difficult to understand and prove correct, allowing for highly cited works to contain correctness bugs that go undetected for long periods of time. This complexity is perceived as a necessary sacrifice: simpler, more general techniques cannot attain competitive performance with these fine-grained implementations. To challenge this perception, this work presents three data structures created using multi-word compare-and-swap (KCAS), version numbering, and double-collect searches that showcase the power of using a more coarse-grained approach. First, a novel lock-free binary search tree (BST) is presented that is both fully-internal and balanced, which is able to achieve competitive performance with the state-of-the-art fine-grained concurrent BSTs while being significantly simpler. Next, the first concurrent implementation of an Euler-tour data-structure is outlined, solving fully-dynamic graph connectivity. Finally, a KCAS variant of an (a,b)-tree implementation is presented, which shows significant performance improvements in certain workloads when compared to the original

University of Waterloo's Institutional Repository

Invalidation-based protocols for replicated datastores

Author: Katsarakis Antonis
Publication venue: The University of Edinburgh
Publication date: 30/11/2021
Field of study

Distributed in-memory datastores underpin cloud applications that run within a datacenter and demand high performance, strong consistency, and availability. A key feature of datastores is data replication. The data are replicated across servers because a single server often cannot handle the request load. Replication is also necessary to guarantee that a server or link failure does not render a portion of the dataset inaccessible. A replication protocol is responsible for ensuring strong consistency between the replicas of a datastore, even when faults occur, by determining the actions necessary to access and manipulate the data. Consequently, a replication protocol also drives the datastore's performance. Existing strongly consistent replication protocols deliver fault tolerance but fall short in terms of performance. Meanwhile, the opposite occurs in the world of multiprocessors, where data are replicated across the private caches of different cores. The multiprocessor regime uses invalidations to afford strongly consistent replication with high performance but neglects fault tolerance. Although handling failures in the datacenter is critical for data availability, we observe that the common operation is fault-free and far exceeds the operation during faults. In other words, the common operating environment inside a datacenter closely resembles that of a multiprocessor. Based on this insight, we draw inspiration from the multiprocessor for high-performance, strongly consistent replication in the datacenter. The primary contribution of this thesis is in adapting invalidating protocols to the nuances of replicated datastores, which include skewed data accesses, fault tolerance, and distributed transactions

arXiv.org e-Print Archive

Edinburgh Research Archive