31 research outputs found

    Synchronising C/C++ and POWER

    Get PDF
    Shared memory concurrency relies on synchronisation primitives: compare-and-swap, load-reserve/store-conditional (aka LL/SC), language-level mutexes, and so on. In a sequentially consistent setting, or even in the TSO setting of x86 and Sparc, these have well-understood semantics. But in the very relaxed settings of IBM®, POWER®, ARM, or C/C++, it remains surprisingly unclear exactly what the programmer can depend on. This paper studies relaxed-memory synchronisation. On the hardware side, we give a clear semantic characterisation of the load-reserve/store-conditional primitives as provided by POWER multiprocessors, for the first time since they were introduced 20 years ago; we cover their interaction with relaxed loads, stores, barriers, and dependencies. Our model, while not officially sanctioned by the vendor, is validated by extensive testing, comparing actual implementation behaviour against an oracle generated from the model, and by detailed discussion with IBM staff. We believe the ARM semantics to be similar. On the software side, we prove sound a proposed compilation scheme of the C/C++ synchronisation constructs to POWER, including C/C++ spinlock mutexes, fences, and read-modify-write operations, together with the simpler atomic operations for which soundness is already known from our previous work; this is a first step in verifying concurrent algorithms that use load-reserve/store-conditional with respect to a realistic semantics. We also build confidence in the C/C++ model in its own terms, fixing some omissions and contributing to the C standards committee adoption of the C++11 concurrency model

    TriCheck: Memory Model Verification at the Trisection of Software, Hardware, and ISA

    Full text link
    Memory consistency models (MCMs) which govern inter-module interactions in a shared memory system, are a significant, yet often under-appreciated, aspect of system design. MCMs are defined at the various layers of the hardware-software stack, requiring thoroughly verified specifications, compilers, and implementations at the interfaces between layers. Current verification techniques evaluate segments of the system stack in isolation, such as proving compiler mappings from a high-level language (HLL) to an ISA or proving validity of a microarchitectural implementation of an ISA. This paper makes a case for full-stack MCM verification and provides a toolflow, TriCheck, capable of verifying that the HLL, compiler, ISA, and implementation collectively uphold MCM requirements. The work showcases TriCheck's ability to evaluate a proposed ISA MCM in order to ensure that each layer and each mapping is correct and complete. Specifically, we apply TriCheck to the open source RISC-V ISA, seeking to verify accurate, efficient, and legal compilations from C11. We uncover under-specifications and potential inefficiencies in the current RISC-V ISA documentation and identify possible solutions for each. As an example, we find that a RISC-V-compliant microarchitecture allows 144 outcomes forbidden by C11 to be observed out of 1,701 litmus tests examined. Overall, this paper demonstrates the necessity of full-stack verification for detecting MCM-related bugs in the hardware-software stack.Comment: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating System

    A concurrency semantics for relaxed atomics that permits optimisation and avoids thin-air executions

    Get PDF
    Copyright is held by the owner/author(s). Despite much research on concurrent programming languages, especially for Java and C/C++, we still do not have a satisfactory definition of their semantics, one that admits all common optimisations without also admitting undesired behaviour. Especially problematic are the "thin-Air" examples involving high-performance concurrent accesses, such as C/C++11 relaxed atomics. The C/C++11 model is in a per-candidate-execution style, and previous work has identified a tension between that and the fact that compiler optimisations do not operate over single candidate executions in isolation; rather, they operate over syntactic representations that represent all executions. In this paper we propose a novel approach that circumvents this difficulty. We define a concurrency semantics for a core calculus, including relaxed-Atomic and non-Atomic accesses, and locks, that admits a wide range of optimisation while still forbidding the classic thin-Air examples. It also addresses other problems relating to undefined behaviour. The basic idea is to use an event-structure representation of the current state of each thread, capturing all of its potential executions, and to permit interleaving of execution and transformation steps over that to reflect optimisation (possibly dynamic) of the code. These are combined with a non-multi-copy-Atomic storage subsystem, to reflect common hardware behaviour. The semantics is defined in a mechanised and executable form, and designed to be implementable above current relaxed hardware and strong enough to support the programming idioms that C/C++11 does for this fragment. It offers a potential way forward for concurrent programming language semantics, beyond the current C/C++11 and Java models.This work was partly funded by the EPSRC Programme Grant REMS: Rigorous Engineering for Mainstream Systems, EP/K008528/

    The problem of programming language concurrency semantics

    Get PDF
    Despite decades of research, we do not have a satisfactory concurrency semantics for any general-purpose programming language that aims to support concurrent systems code. The Java Memory Model has been shown to be unsound with respect to standard compiler optimisations, while the C/C++11 model is too weak, admitting undesirable thin-air executions. Our goal in this paper is to articulate this major open problem as clearly as is currently possible, showing how it arises from the combination of multiprocessor relaxed-memory behaviour and the desire to accommodate current compiler optimisations. We make several novel contributions that each shed some light on the problem, constraining the possible solutions and identifying new difficulties. First we give a positive result, proving in HOL4 that the existing axiomatic model for C/C++11 guarantees sequentially consistent semantics for simple race-free programs that do not use low-level atomics (DRF-SC, one of the core design goals). We then describe the thin-air problem and show that it cannot be solved, without restricting current compiler optimisations, using any per-candidate-execution condition in the style of the C/C++11 model. Thin-air executions were thought to be confined to programs using relaxed atomics, but we further show that they recur when one attempts to integrate the concurrency model with more of C, mixing atomic and nonatomic accesses, and that also breaks the DRF-SC result. We then describe a semantics based on an explicit operational construction of out-of-order execution, giving the desired behaviour for thin-air examples but exposing further difficulties with accommodating existing compiler optimisations. Finally, we show that there are major difficulties integrating concurrency semantics with the C/C++ notion of undefined behaviour. We hope thereby to stimulate and enable research on this key issue

    TransForm: Formally Specifying Transistency Models and Synthesizing Enhanced Litmus Tests

    Full text link
    Memory consistency models (MCMs) specify the legal ordering and visibility of shared memory accesses in a parallel program. Traditionally, instruction set architecture (ISA) MCMs assume that relevant program-visible memory ordering behaviors only result from shared memory interactions that take place between user-level program instructions. This assumption fails to account for virtual memory (VM) implementations that may result in additional shared memory interactions between user-level program instructions and both 1) system-level operations (e.g., address remappings and translation lookaside buffer invalidations initiated by system calls) and 2) hardware-level operations (e.g., hardware page table walks and dirty bit updates) during a user-level program's execution. These additional shared memory interactions can impact the observable memory ordering behaviors of user-level programs. Thus, memory transistency models (MTMs) have been coined as a superset of MCMs to additionally articulate VM-aware consistency rules. However, no prior work has enabled formal MTM specifications, nor methods to support their automated analysis. To fill the above gap, this paper presents the TransForm framework. First, TransForm features an axiomatic vocabulary for formally specifying MTMs. Second, TransForm includes a synthesis engine to support the automated generation of litmus tests enhanced with MTM features (i.e., enhanced litmus tests, or ELTs) when supplied with a TransForm MTM specification. As a case study, we formally define an estimated MTM for Intel x86 processors, called x86t_elt, that is based on observations made by an ELT-based evaluation of an Intel x86 MTM implementation from prior work and available public documentation. Given x86t_elt and a synthesis bound as input, TransForm's synthesis engine successfully produces a set of ELTs including relevant ELTs from prior work.Comment: *This is an updated version of the TransForm paper that features updated results reflecting performance optimizations and software bug fixes. 14 pages, 11 figures, Proceedings of the 47th Annual International Symposium on Computer Architecture (ISCA

    Lasagne : a static binary translator for weak memory model architectures

    Get PDF
    Funding: This work was supported by a UK RISE Grant.The emergence of new architectures create a recurring challenge to ensure that existing programs still work on them. Manually porting legacy code is often impractical. Static binary translation (SBT) is a process where a program’s binary is automatically translated from one architecture to another, while preserving their original semantics. However, these SBT tools have limited support to various advanced architectural features. Importantly, they are currently unable to translate concurrent binaries. The main challenge arises from the mismatches of the memory consistency model specified by the different architectures, especially when porting existing binaries to a weak memory model architecture. In this paper, we propose Lasagne, an end-to-end static binary translator with precise translation rules between x86 and Arm concurrency semantics. First, we propose a concurrency model for Lasagne’s intermediate representation (IR) and formally proved mappings between the IR and the two architectures. The memory ordering is preserved by introducing fences in the translated code. Finally, we propose optimizations focused on raising the level of abstraction of memory address calculations and reducing the number offences. Our evaluation shows that Lasagne reduces the number of fences by up to about 65%, with an average reduction of 45.5%, significantly reducing their runtime overhead.Postprin
    corecore