21 research outputs found
Resizable, Scalable, Concurrent Hash Tables
We present algorithms for shrinking and expanding a hash table while allowing concurrent, wait-free, linearly scalable lookups. These resize algorithms allow the hash table to maintain constant-time performance as the number of entries grows, and reclaim memory as the number of entries decreases, without delaying or disrupting readers.
We implemented our algorithms in the Linux kernel, to test their performance and scalability. Benchmarks show lookup scalability improved 125x over readerwriter locking, and 56% over the current state-of-the-art for Linux, with no performance degradation for lookups during a resize.
To achieve this performance, this hash table implementation uses a new concurrent programming methodology known as relativistic programming. In particular, we make use of an existing synchronization primitive which waits for all current readers to finish, with little to no reader overhead; careful use of this primitive allows ordering of updates without read-side synchronization or memory barriers
A Comparison of Relativistic and Reader-Writer Locking Approaches to Shared Data Access
This paper explores the relationship between reader-writer locking and relativistic programming approaches to managing accesses to shared data. It demonstrates that by placing certain restrictions on writers, relativistic programming allows more concurrency than reader-writer locking while still providing the same isolation guarantees. Relativistic programming also allows for a straightforward model for reasoning about the correctness of programs that allow concurrent read-write accesses
Lock-free atom garbage collection for multithreaded Prolog
The runtime system of dynamic languages such as Prolog or Lisp and their
derivatives contain a symbol table, in Prolog often called the atom table. A
simple dynamically resizing hash-table used to be an adequate way to implement
this table. As Prolog becomes fashionable for 24x7 server processes we need to
deal with atom garbage collection and concurrent access to the atom table.
Classical lock-based implementations to ensure consistency of the atom table
scale poorly and a stop-the-world approach to implement atom garbage collection
quickly becomes a bottle-neck, making Prolog unsuitable for soft real-time
applications. In this article we describe a novel implementation for the atom
table using lock-free techniques where the atom-table remains accessible even
during atom garbage collection. Relying only on CAS (Compare And Swap) and not
on external libraries, the implementation is straightforward and portable.
Under consideration for acceptance in TPLP.Comment: Paper presented at the 32nd International Conference on Logic
Programming (ICLP 2016), New York City, USA, 16-21 October 2016, 14 pages,
LaTeX, 4 PDF figure
CPHASH: A cache-partitioned hash table
CPHash is a concurrent hash table for multicore processors. CPHash partitions its table across the caches of cores and uses message passing to transfer lookups/inserts to a partition. CPHash's message passing avoids the need for locks, pipelines batches of asynchronous messages, and packs multiple messages into a single cache line transfer. Experiments on a 80-core machine with 2 hardware threads per core show that CPHash has ~1.6x higher throughput than a hash table implemented using fine-grained locks. An analysis shows that CPHash wins because it experiences fewer cache misses and its cache misses are less expensive, because of less contention for the on-chip interconnect and DRAM. CPServer, a key/value cache server using CPHash, achieves ~5% higher throughput than a key/value cache server that uses a hash table with fine-grained locks, but both achieve better throughput and scalability than memcached. The throughput of CPHash and CPServer also scale near-linearly with the number of cores.Quanta Computer (Firm)National Science Foundation (U.S.). (Award 915164
Runtime latency detection and analysis
Detecting latency-related problems in production environments is usually carried out at the application
level with custom instrumentation. This is enough to detect high latencies in instrumented applications but
does not provide all the information required to understand the source of the latency and is dependent on
manually deployed instrumentation. The abnormal latencies usually start in the operating system kernel
because of contention on physical resources or locks. Hence, finding the root cause of a latency may require a
kernel trace. This trace can easily represent hundreds of thousands of events per second. In this paper,
we propose and evaluate a methodology, efficient algorithms, and concurrent data structures to detect and
analyze latency problems that occur at the kernel level. We introduce a new kernel-based approach that
enables developers and administrators to efficiently track latency problems in production and trigger actions
when abnormal conditions are detected. The result of this study is a working scalable latency tracker and an
efficient approach to perform stateful tracing in production
Performance analysis of the LSU3shell program
Ab initio přístup ke zkoumání struktury atomových jader je na popředí součas- ného vývoje nukleární fyziky. LSU3shell je implementací ab initio metody zvané symmetry-adapted no-core shell model (SA-NCSM) pro vysoce paralelní výpočetní systémy. Cílem této práce bylo zanalyzovat výkonnové charakteristiky programu LSU3shell se zaměřením převážně na dynamickou alokaci paměti, provést rešerši metod a řešení která by mohla vést ke zlepšení výkonu a využití paměti, a provést následnou implementaci. Tento přístup se ukázal být správný, a bylo možné provést mnoho optimalizací vztahujících se k dynamické alokaci paměti. Výsledkem této práce je průměrné zrychlení LSU3shell o 41 %, což nám ušetří až 1,4 milionu core-hodin z naší alokace na superpočítači BlueWaters.Ab initio approaches to nuclear structure exploration are at the forefront of current nuclear physics research. LSU3shell is an implementation of the ab initio method called symmetry-adapted no-core shell model (SA-NCSM) optimized for distributed HPC systems. The goal of this thesis was the analysis of the LSU3shell program with focus primarily on dynamic memory allocation, research of methods that can be used to improve performance and memory usage, and their application on LSU3shell. The focus on dynamic memory allocation proved to be the right one, leading to many possible optimizations. I was able to reduce the run time by 41 % on average, thus potentially saving up to 1.4 million core-hours of our total BlueWaters allocation
Concurrent Lazy Splay Tree with Relativistic Programming Approach
A splay tree is a self-adjusting binary search tree in which recently accessed elements are quick to access again. Splay operation causes the sequential bottleneck at the root of the tree in concurrent environment. The Lazy splaying is to rotate the tree at most one per access so that very frequently accessed item does full splaying. We present the RCU (Read-copy-update) based synchronization mechanism for splay tree operations which allows reads to occur concurrently with updates such as deletion and restructuring by splay rotation. This approach is generalized as relativistic programming. The relativistic programming is the programming technique for concurrent shared-memory architectures which tolerates different threads seeing events occurring in different orders, so that events are not necessarily globally ordered, but rather subject to constraints of per-thread ordering.
The main idea of the algorithm is that the update operations are carried out concurrently
with traversals/reads. Each update is carried out for new reads to see the new state, while allowing pre-existing reads to proceed on the old state. Then the update is completed after all pre-existing reads have completed
Recommended from our members
Scalable Emulation of Heterogeneous Systems
The breakdown of Dennard's transistor scaling has driven computing systems toward application-specific accelerators, which can provide orders-of-magnitude improvements in performance and energy efficiency over general-purpose processors.
To enable the radical departures from conventional approaches that heterogeneous systems entail, research infrastructure must be able to model processors, memory and accelerators, as well as system-level changes---such as operating system or instruction set architecture (ISA) innovations---that might be needed to realize the accelerators' potential. Unfortunately, existing simulation tools that can support such system-level research are limited by the lack of fast, scalable machine emulators to drive execution.
To fill this need, in this dissertation we first present a novel machine emulator design based on dynamic binary translation that makes the following improvements over the state of the art: it scales on multicore hosts while remaining memory efficient, correctly handles cross-ISA differences in atomic instruction semantics, leverages the host floating point (FP) unit to speed up FP emulation without sacrificing correctness, and can be efficiently instrumented to---among other possible uses---drive the execution of a full-system, cross-ISA simulator with support for accelerators.
We then demonstrate the utility of machine emulation for studying heterogeneous systems by leveraging it to make two additional contributions. First, we quantify the trade-offs in different coupling models for on-chip accelerators. Second, we present a technique to reuse the private memories of on-chip accelerators when they are otherwise inactive to expand the system's last-level cache, thereby reducing the opportunity cost of the accelerators' integration