730 research outputs found

    Tailoring Transactional Memory to Real-World Applications

    Get PDF
    Transactional Memory (TM) promises to provide a scalable mechanism for synchronizationin concurrent programs, and to offer ease-of-use benefits to programmers. Since multiprocessorarchitectures have dominated CPU design, exploiting parallelism in program

    TANDEM: taming failures in next-generation datacenters with emerging memory

    Get PDF
    The explosive growth of online services, leading to unforeseen scales, has made modern datacenters highly prone to failures. Taming these failures hinges on fast and correct recovery, minimizing service interruptions. Applications, owing to recovery, entail additional measures to maintain a recoverable state of data and computation logic during their failure-free execution. However, these precautionary measures have severe implications on performance, correctness, and programmability, making recovery incredibly challenging to realize in practice. Emerging memory, particularly non-volatile memory (NVM) and disaggregated memory (DM), offers a promising opportunity to achieve fast recovery with maximum performance. However, incorporating these technologies into datacenter architecture presents significant challenges; Their distinct architectural attributes, differing significantly from traditional memory devices, introduce new semantic challenges for implementing recovery, complicating correctness and programmability. Can emerging memory enable fast, performant, and correct recovery in the datacenter? This thesis aims to answer this question while addressing the associated challenges. When architecting datacenters with emerging memory, system architects face four key challenges: (1) how to guarantee correct semantics; (2) how to efficiently enforce correctness with optimal performance; (3) how to validate end-to-end correctness including recovery; and (4) how to preserve programmer productivity (Programmability). This thesis aims to address these challenges through the following approaches: (a) defining precise consistency models that formally specify correct end-to-end semantics in the presence of failures (consistency models also play a crucial role in programmability); (b) developing new low-level mechanisms to efficiently enforce the prescribed models given the capabilities of emerging memory; and (c) creating robust testing frameworks to validate end-to-end correctness and recovery. We start our exploration with non-volatile memory (NVM), which offers fast persistence capabilities directly accessible through the processor’s load-store (memory) interface. Notably, these capabilities can be leveraged to enable fast recovery for Log-Free Data Structures (LFDs) while maximizing performance. However, due to the complexity of modern cache hierarchies, data hardly persist in any specific order, jeop- ardizing recovery and correctness. Therefore, recovery needs primitives that explicitly control the order of updates to NVM (known as persistency models). We outline the precise specification of a novel persistency model – Release Persistency (RP) – that provides a consistency guarantee for LFDs on what remains in non-volatile memory upon failure. To efficiently enforce RP, we propose a novel microarchitecture mechanism, lazy release persistence (LRP). Using standard LFDs benchmarks, we show that LRP achieves fast recovery while incurring minimal overhead on performance. We continue our discussion with memory disaggregation which decouples memory from traditional monolithic servers, offering a promising pathway for achieving very high availability in replicated in-memory data stores. Achieving such availability hinges on transaction protocols that can efficiently handle recovery in this setting, where compute and memory are independent. However, there is a challenge: disaggregated memory (DM) fails to work with RPC-style protocols, mandating one-sided transaction protocols. Exacerbating the problem, one-sided transactions expose critical low-level ordering to architects, posing a threat to correctness. We present a highly available transaction protocol, Pandora, that is specifically designed to achieve fast recovery in disaggregated key-value stores (DKVSes). Pandora is the first one-sided transactional protocol that ensures correct, non-blocking, and fast recovery in DKVS. Our experimental implementation artifacts demonstrate that Pandora achieves fast recovery and high availability while causing minimal disruption to services. Finally, we introduce a novel target litmus-testing framework – DART – to validate the end-to-end correctness of transactional protocols with recovery. Using DART’s target testing capabilities, we have found several critical bugs in Pandora, highlighting the need for robust end-to-end testing methods in the design loop to iteratively fix correctness bugs. Crucially, DART is lightweight and black-box, thereby eliminating any intervention from the programmers

    Accelerating Transactional Memory by Exploiting Platform Specificity

    Get PDF
    Transactional Memory (TM) is one of the most promising alternatives to lock-based concurrency, but there still remain obstacles that keep TM from being utilized in the real world. Performance, in terms of high scalability and low latency, is always one of the most important keys to general purpose usage. While most of the research in this area focuses on improving a specific single TM implementation and some default platform (a certain operating system, compiler and/or processor), little has been conducted on improving performance more generally, and across platforms.We found that by utilizing platform specificity, we could gain tremendous performance improvement and avoid unnecessary costs due to false assumptions of platform properties, on not only a single TM implementation, but many. In this dissertation, we will present our findings in four sections: 1) we discover and quantify hidden costs from inappropriate compiler instrumentations, and provide sug- gestions and solutions; 2) we boost a set of mainstream timestamp-based TM implementations with the x86-specific hardware cycle counter; 3) we explore compiler opportunities to reduce the transaction abort rate, by reordering read-modify-write operations — the whole technique can be applied to all TM implementations, and could be more effective with some help from compilers; and 4) we coordinate the state-of-the-art Intel Haswell TSX hardware TM with a software TM “Cohorts”, and develop a safe and flexible Hybrid TM, “HyCo”, to be our final performance boost in this dissertation.The impact of our research extends beyond Transactional Memory, to broad areas of concurrent programming. Some of our solutions and discussions, such as the synchronization between accesses of the hardware cycle counter and memory loads and stores, can be utilized to boost concurrent data structures and many timestamp-based systems and applications. Others, such as discussions of compiler instrumentation costs and reordering opportunities, provide additional insights to compiler designers. Our findings show that platform specificity must be taken into consideration to achieve peak performance

    Efficient Precise Dynamic Data Race Detection For Cpu And Gpu

    Get PDF
    Data races are notorious bugs. They introduce non-determinism in programs behavior, complicate programs semantics, making it challenging to debug parallel programs. To make parallel programming easier, efficient data race detection has been a research topic in the last decades. However, existing data race detectors either sacrifice precision or incur high overhead, limiting their application to real-world applications and scenarios. This dissertation proposes approaches to improve the performance of dynamic data race detection without undermining precision, by identifying and removing metadata redundancy dynamically. This dissertation also explores ways to make it practical to detect data races dynamically for GPU programs, which has a disparate programming and execution model from CPU workloads. Further, this dissertation shows how the structured synchronization model in GPU programs can simplify the algorithm design of data race detection for GPU, and how the unique patterns in GPU workloads enable an efficient implementation of the algorithm, yielding a high-performance dynamic data race detector for GPU programs

    Fast and transparent recovery for continuous availability of cluster-based servers

    Full text link
    corecore