17 research outputs found

    Creating a Concurrent In-Memory B-Tree Optimized for NUMA Systems

    Get PDF
    The size of main memory is becoming larger. With the number of Central Processing Unit (CPU) cores ever increasing in modern systems, with each of them being able to access memory, the organization of memory becomes more important. In multicore systems, there are two main architectures for memory organization with respect to the cores - Symmetric Multi-Processor (SMP) and Non-Uniform Memory Architecture (NUMA). Prior work has focused on the improvement of the performance of B-Trees in highly concurrent and distributed environments, as well as in memory, for shared-memory mul- tiprocessors. However, little focus has been given to the performance of main memory B-Trees for NUMA systems. This work focuses on improving the performance of B-Trees contained in main memory of NUMA systems by introducing modifications that consider its storage in the physically distributed main memory of the NUMA system. The work in this thesis makes the following contributions to the development of a distributed B-Tree, specifically in a NUMA environment, modified from a B-Tree originally designed for high concurrency: • It introduces replication of internal nodes of the tree and shows how this can improve its overall performance in a NUMA environment. • It introduces NUMA-aware locking procedures with the aim of managing contention and exploiting locality of lock requests with reference to previous client operation request locations. • It introduces changes in the granularity of locking, starting from the original locking of every node to the locking of certain levels of nodes, showing the tradeoff between the granularity of locking and the performance of the tree based on the workload. • It considers the combination of the different techniques, with the aim of finding the combination which performs well overall for varying read-heavy workloads and number of client threads

    Constant RMR Group Mutual Exclusion for Arbitrarily Many Processes and Sessions

    Get PDF
    Group mutual exclusion (GME), introduced by Joung in 1998, is a natural synchronization problem that generalizes the classical mutual exclusion and readers and writers problems. In GME a process requests a session before entering its critical section; processes are allowed to be in their critical sections simultaneously provided they have requested the same session. We present a GME algorithm that (1) is the first to achieve a constant Remote Memory Reference (RMR) complexity for both cache coherent and distributed shared memory machines; and (2) is the first that can be accessed by arbitrarily many dynamically allocated processes and with arbitrarily many session names. Neither of the existing GME algorithms satisfies either of these two important properties. In addition, our algorithm has constant space complexity per process and satisfies the two strong fairness properties, first-come-first-served and first-in-first-enabled. Our algorithm uses an atomic instruction set supported by most modern processor architectures, namely: read, write, fetch-and-store and compare-and-swap

    ATraPos: Adaptive Transaction Processing on Hardware Islands

    Get PDF
    Nowadays, high-performance transaction processing applications increasingly run on multisocket multicore servers. Such architectures exhibit non-uniform memory access latency as well as non-uniform thread communication costs. Unfortunately, traditional shared-everything database management systems are designed for uniform inter-core communication speeds. This causes unpredictable access latencies in the critical path. While lack of data locality may be a minor nuisance on systems with fewer than 4 processors, it becomes a serious scalability limitation on larger systems due to accesses to centralized data structures. In this paper, we propose ATraPos. a storage manager design that is aware of the non-uniform access latencies of multisocket systems. ATraPos achieves good data locality by carefully partitioning the data as well as internal data structures (e.g., state information) to the available processors and by assigning threads to specific partitions. Furthermore, ATraPos dynamically adapts to the workload characteristics, i.e., when the workload changes, ATraPos detects the change and automatically revises the data partitioning and thread placement to fit the current access patterns and hardware topology. We prototype ATraPos on top of an open-source storage manager Shore-MT and we present a detailed experimental analysis with both synthetic and standard (TPC-C and TATP) benchmarks. We show that ATraPos exhibits performance improvements of a factor ranging from 1.4 to 6.7x for a wide collection of transactional workloads. In addition, we show that the adaptive monitoring and partitioning scheme of ATraPos poses a negligible cost, while it allows the system to dynamically and gracefully adapt when the workload changes

    Contention Adapting Search Trees

    Get PDF
    Abstract-With multicores being ubiquitous, concurrent data structures are becoming increasingly important. This paper proposes a novel approach to concurrent data structure design where the data structure collects statistics about contention and adapts dynamically according to this statistics. We use this approach to create a contention adapting binary search tree (CA tree) that can be used to implement concurrent ordered sets and maps. Our experimental evaluation shows that CA trees scale similar to recently proposed algorithms on a big multicore machine on various scenarios with a larger set size, and outperform the same data structures in more contended scenarios and in sequential performance. We also show that CA trees are well suited for optimization with hardware lock elision. In short, we propose a practically useful and easy to implement and show correct concurrent search tree that naturally adapts to the level of contention. I. INTRODUCTION With multicores being widespread, the need for efficient concurrent data structures has increased. In this paper we propose a novel adaptive technique for creating concurrent data structures. Our technique collects statistics about contention in locks and does local adaptations dynamically to reduce the contention or to optimize for low contention. This is the first contribution of this paper. Previous research on adapting to the level of contention has focused on objects where access cannot be easily distibuted, such as locks We demonstrate the benefits of our contention adapting technique by describing and evaluating a data structure for concurrent ordered sets or maps. We call this data structure contention adapting search tree or CA tree for short. The design of CA trees is the second contribution of this paper. Curren

    Extremely fast (a,b)-trees at all contention levels

    Get PDF
    Many concurrent dictionary implementations are designed and evaluated with only low-contention workloads in mind. This thesis presents several concurrent linearizable (a,b)-tree implementations with the overarching goal of performing well on both low- and high-contention workloads, and especially update-heavy workloads. The OCC-ABtree uses optimistic concurrency control to achieve state-of-the-art low-contention performance. However, under high-contention, cache coherence traffic begins to affect its performance. This is addressed by replacing its test-and-compare-and-swap locks with MCS queue locks. The resulting MCS-ABtree scales well under both low- and high-contention workloads. This thesis also introduces two coalescing-based trees, the CoMCS-ABtree and the CoPub-ABtree, that achieve substantially better performance under high-contention by reordering and coalescing concurrent inserts and deletes. Comparing these algorithms against the state of the art in concurrent search trees, we find that the fastest algorithm, the CoPub-ABtree, outperforms the next fastest competitor by up to 2x. This thesis then describes persistent versions of the four trees, whose implementations use fewer sfence instructions than a leading competitor (the FPTree). The persistent trees are proved to be strictly linearizable. Experimentally, the persistent trees are only slightly slower than their volatile counterparts, suggesting that they have great use as in-memory databases that need to be able to recover after a crash

    Abstracting Multi-Core Topologies with MCTOP

    Get PDF
    Portability and efficiency are usually antagonists in multi-core computing. In order to develop efficient code, one needs to take into account the topology of the target multi-cores (e.g., for locality). This clearly hampers code portability. In this paper, we show that you can have the cake and eat it too. We introduce MCTOP, an abstraction of multi-core topologies augmented with important low-level hardware information, such as memory bandwidths and communication latencies. We show how to automatically generate MCTOP using libmctop, our library that leverages the determinism of cache-coherence protocols to infer the topology of multi-cores using only latency measurements. MCTOP enables developers to accurately and portably define high-level performance optimization policies. We illustrate several such policies through four examples: (i-ii) thread placement in OpenMP and in a MapReduce library, (iii) a topology-aware mergesort algorithm, as well as (iv) automatic backoff schemes for locks. We illustrate the portability of these optimizations on five processors from Intel, AMD, and Oracle, with low effort

    NUMA-Aware Reader-Writer Locks

    No full text
    Non-Uniform Memory Access (NUMA) architectures are gaining importance in mainstream computing systems due to the rapid growth of multi-core multi-chip machines. Extracting the best possible performance from these new machines will require us to revisit the design of the concurrent algorithms and synchronization primitives which form the building blocks of many of today’s applications. This paper revisits one such critical synchronization primitive – the reader-writer lock. We present what is, to the best of our knowledge, the first family of reader-writer lock algorithms tailored to NUMA architectures. We present several variations which trade fairness between readers and writers for higher concurrency among readers and better back-to-back batching of writers from the same NUMA node. Our algorithms leverage the lock cohorting technique to manage synchronization between writers in a NUMA-friendly fashion, binary flags to coordinate readers and writers, and simple distributed reader counter implementations to enable NUMA-friendly concurrency among readers. The end result is a collection of surprisingly simple NUMA-aware algorithms that outperform the state-of-theart reader-writer locks by up to a factor of 10 in our microbenchmark experiments. To evaluate our algorithms in a realistic setting we also present performance results of the kccachetest benchmark of the Kyoto-Cabinet distribution, an open-source database which makes heavy use of pthread reader-writer locks. Our locks boost the performance of kccachetest by up to 40 % over the best prior alternatives
    corecore