21 research outputs found

    Resizable, Scalable, Concurrent Hash Tables

    Get PDF
    We present algorithms for shrinking and expanding a hash table while allowing concurrent, wait-free, linearly scalable lookups. These resize algorithms allow the hash table to maintain constant-time performance as the number of entries grows, and reclaim memory as the number of entries decreases, without delaying or disrupting readers. We implemented our algorithms in the Linux kernel, to test their performance and scalability. Benchmarks show lookup scalability improved 125x over readerwriter locking, and 56% over the current state-of-the-art for Linux, with no performance degradation for lookups during a resize. To achieve this performance, this hash table implementation uses a new concurrent programming methodology known as relativistic programming. In particular, we make use of an existing synchronization primitive which waits for all current readers to finish, with little to no reader overhead; careful use of this primitive allows ordering of updates without read-side synchronization or memory barriers

    A Comparison of Relativistic and Reader-Writer Locking Approaches to Shared Data Access

    Get PDF
    This paper explores the relationship between reader-writer locking and relativistic programming approaches to managing accesses to shared data. It demonstrates that by placing certain restrictions on writers, relativistic programming allows more concurrency than reader-writer locking while still providing the same isolation guarantees. Relativistic programming also allows for a straightforward model for reasoning about the correctness of programs that allow concurrent read-write accesses

    Lock-free atom garbage collection for multithreaded Prolog

    Get PDF
    The runtime system of dynamic languages such as Prolog or Lisp and their derivatives contain a symbol table, in Prolog often called the atom table. A simple dynamically resizing hash-table used to be an adequate way to implement this table. As Prolog becomes fashionable for 24x7 server processes we need to deal with atom garbage collection and concurrent access to the atom table. Classical lock-based implementations to ensure consistency of the atom table scale poorly and a stop-the-world approach to implement atom garbage collection quickly becomes a bottle-neck, making Prolog unsuitable for soft real-time applications. In this article we describe a novel implementation for the atom table using lock-free techniques where the atom-table remains accessible even during atom garbage collection. Relying only on CAS (Compare And Swap) and not on external libraries, the implementation is straightforward and portable. Under consideration for acceptance in TPLP.Comment: Paper presented at the 32nd International Conference on Logic Programming (ICLP 2016), New York City, USA, 16-21 October 2016, 14 pages, LaTeX, 4 PDF figure

    CPHASH: A cache-partitioned hash table

    Get PDF
    CPHash is a concurrent hash table for multicore processors. CPHash partitions its table across the caches of cores and uses message passing to transfer lookups/inserts to a partition. CPHash's message passing avoids the need for locks, pipelines batches of asynchronous messages, and packs multiple messages into a single cache line transfer. Experiments on a 80-core machine with 2 hardware threads per core show that CPHash has ~1.6x higher throughput than a hash table implemented using fine-grained locks. An analysis shows that CPHash wins because it experiences fewer cache misses and its cache misses are less expensive, because of less contention for the on-chip interconnect and DRAM. CPServer, a key/value cache server using CPHash, achieves ~5% higher throughput than a key/value cache server that uses a hash table with fine-grained locks, but both achieve better throughput and scalability than memcached. The throughput of CPHash and CPServer also scale near-linearly with the number of cores.Quanta Computer (Firm)National Science Foundation (U.S.). (Award 915164

    Runtime latency detection and analysis

    Get PDF
    Detecting latency-related problems in production environments is usually carried out at the application level with custom instrumentation. This is enough to detect high latencies in instrumented applications but does not provide all the information required to understand the source of the latency and is dependent on manually deployed instrumentation. The abnormal latencies usually start in the operating system kernel because of contention on physical resources or locks. Hence, finding the root cause of a latency may require a kernel trace. This trace can easily represent hundreds of thousands of events per second. In this paper, we propose and evaluate a methodology, efficient algorithms, and concurrent data structures to detect and analyze latency problems that occur at the kernel level. We introduce a new kernel-based approach that enables developers and administrators to efficiently track latency problems in production and trigger actions when abnormal conditions are detected. The result of this study is a working scalable latency tracker and an efficient approach to perform stateful tracing in production

    Performance analysis of the LSU3shell program

    Get PDF
    Ab initio přístup ke zkoumání struktury atomových jader je na popředí součas- ného vývoje nukleární fyziky. LSU3shell je implementací ab initio metody zvané symmetry-adapted no-core shell model (SA-NCSM) pro vysoce paralelní výpočetní systémy. Cílem této práce bylo zanalyzovat výkonnové charakteristiky programu LSU3shell se zaměřením převážně na dynamickou alokaci paměti, provést rešerši metod a řešení která by mohla vést ke zlepšení výkonu a využití paměti, a provést následnou implementaci. Tento přístup se ukázal být správný, a bylo možné provést mnoho optimalizací vztahujících se k dynamické alokaci paměti. Výsledkem této práce je průměrné zrychlení LSU3shell o 41 %, což nám ušetří až 1,4 milionu core-hodin z naší alokace na superpočítači BlueWaters.Ab initio approaches to nuclear structure exploration are at the forefront of current nuclear physics research. LSU3shell is an implementation of the ab initio method called symmetry-adapted no-core shell model (SA-NCSM) optimized for distributed HPC systems. The goal of this thesis was the analysis of the LSU3shell program with focus primarily on dynamic memory allocation, research of methods that can be used to improve performance and memory usage, and their application on LSU3shell. The focus on dynamic memory allocation proved to be the right one, leading to many possible optimizations. I was able to reduce the run time by 41 % on average, thus potentially saving up to 1.4 million core-hours of our total BlueWaters allocation

    Concurrent Lazy Splay Tree with Relativistic Programming Approach

    Full text link
    A splay tree is a self-adjusting binary search tree in which recently accessed elements are quick to access again. Splay operation causes the sequential bottleneck at the root of the tree in concurrent environment. The Lazy splaying is to rotate the tree at most one per access so that very frequently accessed item does full splaying. We present the RCU (Read-copy-update) based synchronization mechanism for splay tree operations which allows reads to occur concurrently with updates such as deletion and restructuring by splay rotation. This approach is generalized as relativistic programming. The relativistic programming is the programming technique for concurrent shared-memory architectures which tolerates different threads seeing events occurring in different orders, so that events are not necessarily globally ordered, but rather subject to constraints of per-thread ordering. The main idea of the algorithm is that the update operations are carried out concurrently with traversals/reads. Each update is carried out for new reads to see the new state, while allowing pre-existing reads to proceed on the old state. Then the update is completed after all pre-existing reads have completed
    corecore