34 research outputs found

    Toward Linearizability Testing for Multi-Word Persistent Synchronization Primitives

    Get PDF
    Persistent memory makes it possible to recover in-memory data structures following a failure instead of rebuilding them from state saved in slow secondary storage. Implementing such recoverable data structures correctly is challenging as their underlying algorithms must deal with both parallelism and failures, which makes them especially susceptible to programming errors. Traditional proofs of correctness should therefore be combined with other methods, such as model checking or software testing, to minimize the likelihood of uncaught defects. This research focuses specifically on the algorithmic principles of software testing, particularly linearizability analysis, for multi-word persistent synchronization primitives such as conditional swap operations. We describe an efficient decision procedure for linearizability in this context, and discuss its practical applications in detecting previously-unknown bugs in implementations of multi-word persistent primitives

    Efficient Multi-Word Compare and Swap

    Get PDF
    Atomic lock-free multi-word compare-and-swap (MCAS) is a powerful tool for designing concurrent algorithms. Yet, its widespread usage has been limited because lock-free implementations of MCAS make heavy use of expensive compare-and-swap (CAS) instructions. Existing MCAS implementations indeed use at least 2k+1 CASes per k-CAS. This leads to the natural desire to minimize the number of CASes required to implement MCAS. We first prove in this paper that it is impossible to "pack" the information required to perform a k-word CAS (k-CAS) in less than k locations to be CASed. Then we present the first algorithm that requires k+1 CASes per call to k-CAS in the common uncontended case. We implement our algorithm and show that it outperforms a state-of-the-art baseline in a variety of benchmarks in most considered workloads. We also present a durably linearizable (persistent memory friendly) version of our MCAS algorithm using only 2 persistence fences per call, while still only requiring k+1 CASes per k-CAS

    A Scalable Recoverable Skip List for Persistent Memory on NUMA Machines

    Get PDF
    Interest in recoverable, persistent-memory-resident (PMEM-resident) data structures is growing as availability of Intel Optane Data Center Persistent Memory increases. An interesting use case for in-memory, recoverable data structures is for database indexes, which need high availability and reliability. Skip lists are a data structure particularly well-suited for usage as a fully PMEM-resident index, due to their reduced amount of writes from their probabilistic balancing in comparison to other index data structures like B-trees. The Untitled Persistent Skip List (UPSkipList) is a PMEM-resident recoverable skip list derived from Herlihy et al.'s lock-free skip list algorithm. It is developed using a new conversion technique that extends the RECIPE algorithm by Lee et al. to work on lock-free algorithms with non-blocking writes and no inherent recovery mechanism. It does this by tracking the current time period between two failures, or failure-free epoch, and recording the current epoch in nodes when they are being modified. This way, an observing thread can determine if an inconsistent node is being modified in this epoch or was being modified in a previous epoch and now is in need of recovery. The algorithm is also extended to support concurrent data node splitting to improve performance, which is easily made recoverable using the extension to RECIPE allowing detection of incomplete node splits. UPSkipList also supports cache-efficient NUMA awareness of dynamically allocated objects using an extension to the Region-ID in Value (RIV) method by Chen et al. By using additional bits after the most significant bits in an RIV pointer to indicate the object in which the remaining bits are referenced relative to, chunks of memory can by dynamically allocated to UPSkipList from multiple shared pools without the need for fat pointers, which reduce cache efficiency by halving the number of pointers that can fit in a cache line. This combines the benefits of both the RIV method and the dynamic memory allocation method built into the Persistent Memory Development Kit (PMDK), improving both performance and practicality. Additionally, memory manually managed within a chunk using the RIV method can have its recovery after a crash deferred to the next attempted allocation by a thread sharing the ID with the thread responsible for the allocation of the memory being recovered, reducing recovery time for large pools with many threads active during the time of a crash. Comparison was done against the BzTree of Arulraj et al., as implemented by Lersch et al., which has non-blocking, non-repairing writes implemented using the persistent multi-word CAS (PMwCAS) primitive by Wang et al., and a transactional recoverable skip list implemented using the PMDK. Tested with the Yahoo Cloud Serving Benchmark (YCSB), UPSkipList achieves better performance in write-heavy workloads at high levels of concurrency than BzTree, and outperforms the PMDK-based skip list, due to the PMDK-based skip list's higher average latency. Using the extended RIV pointers to dynamically allocate memory resulted in a 40% performance increase over using the PMDK's fat pointers. The impact of NUMA awareness using multiple pools of memory compared with striping a single pool across multiple nodes was found to only be a 5.6% decrease in performance. Finally, recovery time of UPSkipList was found to be comparable to the PMDK-based skip list, and 9 times faster than BzTree with 500K descriptors in its PMwCAS pool. Correctness of UPSkipList and its conversion and recovery techniques were tested using black-box recoverable linearizability analysis, which found UPSkipList to be free of strict linearizability errors across 30 trials

    Effective testing for concurrency bugs

    Get PDF
    In the current multi-core era, concurrency bugs are a serious threat to software reliability. As hardware becomes more parallel, concurrent programming will become increasingly pervasive. However, correct concurrent programming is known to be extremely challenging for developers and can easily lead to the introduction of concurrency bugs. This dissertation addresses this challenge by proposing novel techniques to help developers expose and detect concurrency bugs. We conducted a bug study to better understand the external and internal effects of real-world concurrency bugs. Our study revealed that a significant fraction of concurrency bugs qualify as semantic or latent bugs, which are two particularly challenging classes of concurrency bugs. Based on the insights from the study, we propose a concurrency bug detector, PIKE that analyzes the behavior of program executions to infer whether concurrency bugs have been triggered during a concurrent execution. In addition, we present the design of a testing tool, SKI, that allows developers to test operating system kernels for concurrency bugs in a practical manner. SKI bridges the gap between user-mode testing and kernel-mode testing by enabling the systematic exploration of the kernel thread interleaving space. Our evaluation shows that both PIKE and SKI are effective at finding concurrency bugs.Im gegenwĂ€rtigen Multicore-Zeitalter sind Fehler aufgrund von NebenlĂ€ufigkeit eine ernsthafte Bedrohung der ZuverlĂ€ssigkeit von Software. Mit der wachsenden Parallelisierung von Hardware wird nebenlĂ€ufiges Programmieren nach und nach allgegenwĂ€rtig. Diese Art von Programmieren ist jedoch als Ă€ußerst schwierig bekannt und kann leicht zu Programmierfehlern fĂŒhren. Die vorliegende Dissertation nimmt sich dieser Herausforderung an indem sie neuartige Techniken vorschlĂ€gt, die Entwicklern beim Aufdecken von NebenlĂ€ufigkeitsfehlern helfen. Wir fĂŒhren eine Studie von Fehlern durch, um die externen und internen Effekte von in der Praxis vorkommenden NebenlĂ€ufigkeitsfehlern besser zu verstehen. Diese ergibt, dass ein bedeutender Anteil von solchen Fehlern als semantisch bzw. latent zu charakterisieren ist -- zwei besonders herausfordernde Klassen von NebenlĂ€ufigkeitsfehlern. Basierend auf den Erkenntnissen der Studie entwickeln wir einen Detektor (PIKE), der ProgrammausfĂŒhrungen daraufhin analysiert, ob NebenlĂ€ufigkeitsfehler aufgetreten sind. Weiterhin prĂ€sentieren wir das Design eines Testtools (SKI), das es Entwicklern ermöglicht, Betriebssystemkerne praktikabel auf NebenlĂ€ufigkeitsfehler zu ĂŒberprĂŒfen. SKI fĂŒllt die LĂŒcke zwischen Testen im Benutzermodus und Testen im Kernelmodus, indem es die systematische Erkundung der Kernel-Thread-Verschachtelungen erlaubt. Unsere Auswertung zeigt, dass sowohl PIKE als auch SKI effektiv NebenlĂ€ufigkeitsfehler finden

    TANDEM: taming failures in next-generation datacenters with emerging memory

    Get PDF
    The explosive growth of online services, leading to unforeseen scales, has made modern datacenters highly prone to failures. Taming these failures hinges on fast and correct recovery, minimizing service interruptions. Applications, owing to recovery, entail additional measures to maintain a recoverable state of data and computation logic during their failure-free execution. However, these precautionary measures have severe implications on performance, correctness, and programmability, making recovery incredibly challenging to realize in practice. Emerging memory, particularly non-volatile memory (NVM) and disaggregated memory (DM), offers a promising opportunity to achieve fast recovery with maximum performance. However, incorporating these technologies into datacenter architecture presents significant challenges; Their distinct architectural attributes, differing significantly from traditional memory devices, introduce new semantic challenges for implementing recovery, complicating correctness and programmability. Can emerging memory enable fast, performant, and correct recovery in the datacenter? This thesis aims to answer this question while addressing the associated challenges. When architecting datacenters with emerging memory, system architects face four key challenges: (1) how to guarantee correct semantics; (2) how to efficiently enforce correctness with optimal performance; (3) how to validate end-to-end correctness including recovery; and (4) how to preserve programmer productivity (Programmability). This thesis aims to address these challenges through the following approaches: (a) defining precise consistency models that formally specify correct end-to-end semantics in the presence of failures (consistency models also play a crucial role in programmability); (b) developing new low-level mechanisms to efficiently enforce the prescribed models given the capabilities of emerging memory; and (c) creating robust testing frameworks to validate end-to-end correctness and recovery. We start our exploration with non-volatile memory (NVM), which offers fast persistence capabilities directly accessible through the processor’s load-store (memory) interface. Notably, these capabilities can be leveraged to enable fast recovery for Log-Free Data Structures (LFDs) while maximizing performance. However, due to the complexity of modern cache hierarchies, data hardly persist in any specific order, jeop- ardizing recovery and correctness. Therefore, recovery needs primitives that explicitly control the order of updates to NVM (known as persistency models). We outline the precise specification of a novel persistency model – Release Persistency (RP) – that provides a consistency guarantee for LFDs on what remains in non-volatile memory upon failure. To efficiently enforce RP, we propose a novel microarchitecture mechanism, lazy release persistence (LRP). Using standard LFDs benchmarks, we show that LRP achieves fast recovery while incurring minimal overhead on performance. We continue our discussion with memory disaggregation which decouples memory from traditional monolithic servers, offering a promising pathway for achieving very high availability in replicated in-memory data stores. Achieving such availability hinges on transaction protocols that can efficiently handle recovery in this setting, where compute and memory are independent. However, there is a challenge: disaggregated memory (DM) fails to work with RPC-style protocols, mandating one-sided transaction protocols. Exacerbating the problem, one-sided transactions expose critical low-level ordering to architects, posing a threat to correctness. We present a highly available transaction protocol, Pandora, that is specifically designed to achieve fast recovery in disaggregated key-value stores (DKVSes). Pandora is the first one-sided transactional protocol that ensures correct, non-blocking, and fast recovery in DKVS. Our experimental implementation artifacts demonstrate that Pandora achieves fast recovery and high availability while causing minimal disruption to services. Finally, we introduce a novel target litmus-testing framework – DART – to validate the end-to-end correctness of transactional protocols with recovery. Using DART’s target testing capabilities, we have found several critical bugs in Pandora, highlighting the need for robust end-to-end testing methods in the design loop to iteratively fix correctness bugs. Crucially, DART is lightweight and black-box, thereby eliminating any intervention from the programmers

    EA-PHT-HPR: Designing Scalable Data Structures for Persistent Memory

    Get PDF
    Volatile memory has dominated the realm of main memory on servers and computers for a long time. In 2019, Intel released to the public the Optane data center persistent memory modules (DCPMM). These devices offer the capacity and persistence of block devices while providing the byte addressability and low latency of DRAM devices. The introduction of this technology now allows programmers to develop data structures that can remain in main memory across crashes and power failures. Implementing recoverable code is not an easy task, and adds a new degree of complexity to how we develop and prove the correctness of code. This thesis explores the different approaches that have been taken for the development of persistent data structures, specifically for hash tables. The work presents an iterative process for the development of a persistent hash table. The proposed designs are based on a previously implemented DRAM design. We intend for the design of the hash table to remain similar to its original DRAM design while achieving high performance and scalability in persistent memory. Through each step of the iterative process, the proposed design's weak points are identified, and the implementations are compared to current state-of-the-art persistent hash tables. The final proposed design consists of a hybrid hash table implementation that achieves up to 47% higher performance in write-heavy workloads, and up to 19% higher performance in read-only workloads in comparison to the dynamic and scalable hashing (DASH) implementation, which currently is one of the fastest hash tables for persistent memory. As well, to reduce the latency of a full table resize operation, the proposed design incorporates a new full table resize mechanism that takes advantage of parallelization

    Benchmarking Eventually Consistent Distributed Storage Systems

    Get PDF
    Cloud storage services and NoSQL systems typically offer only "Eventual Consistency", a rather weak guarantee covering a broad range of potential data consistency behavior. The degree of actual (in-)consistency, however, is unknown. This work presents novel solutions for determining the degree of (in-)consistency via simulation and benchmarking, as well as the necessary means to resolve inconsistencies leveraging this information

    Infrastructure for Performance Monitoring and Analysis of Systems and Applications

    Get PDF
    The growth of High Performance Computer (HPC) systems increases the complexity with respect to understanding resource utilization, system management, and performance issues. HPC performance monitoring tools need to collect information at both the application and system levels to yield a complete performance picture. Existing approaches limit the abilities of the users to do meaningful analysis on actionable timescale. Efficient infrastructures are required to support largescale systems performance data analysis for both run-time troubleshooting and post-run processing modes. In this dissertation, we present methods to fill these gaps in the infrastructure for HPC performance monitoring and analysis. First, we enhance the architecture of a monitoring system to integrate streaming analysis capabilities at arbitrary locations within its data collection, transport, and aggregation facilities. Next, we present an approach to streaming collection of application performance data. We integrate these methods with a monitoring system used on large-scale computational platforms. Finally, we present a new approach for constructing durable transactional linked data structures that takes advantage of byte-addressable non-volatile memory technologies. Transactional data structures are building blocks of in-memory databases that are used by HPC monitoring systems to store and retrieve data efficiently. We evaluate the presented approaches on a series of case studies. The experiment results demonstrate the impact of our tools, while keeping the overhead in an acceptable margin

    Programming Languages and Systems

    Get PDF
    This open access book constitutes the proceedings of the 31st European Symposium on Programming, ESOP 2022, which was held during April 5-7, 2022, in Munich, Germany, as part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2022. The 21 regular papers presented in this volume were carefully reviewed and selected from 64 submissions. They deal with fundamental issues in the specification, design, analysis, and implementation of programming languages and systems
    corecore