79 research outputs found

    Clarifying and compiling C/C++ concurrency: from C++11 to POWER

    Get PDF
    The upcoming C and C++ revised standards add concurrency to the languages, for the first time, in the form of a subtle *relaxed memory model* (the *C++11 model*). This aims to permit compiler optimisation and to accommodate the differing relaxed-memory behaviours of mainstream multiprocessors, combining simple semantics for most code with high-performance *low-level atomics* for concurrency libraries. In this paper, we first establish two simpler but provably equivalent models for C++11, one for the full language and another for the subset without consume operations. Subsetting further to the fragment without low-level atomics, we identify a subtlety arising from atomic initialisation and prove that, under an additional condition, the model is equivalent to sequential consistency for race-free programs

    Polynomial-Time Fence Insertion for Structured Programs

    Get PDF
    To enhance performance, common processors feature relaxed memory models that reorder instructions. However, the correctness of concurrent programs is often dependent on the preservation of the program order of certain instructions. Thus, the instruction set architectures offer memory fences. Using fences is a subtle task with performance and correctness implications: using too few can compromise correctness and using too many can hinder performance. Thus, fence insertion algorithms that given the required program orders can automatically find the optimum fencing can enhance the ease of programming, reliability, and performance of concurrent programs. In this paper, we consider the class of programs with structured branch and loop statements and present a greedy and polynomial-time optimum fence insertion algorithm. The algorithm incrementally reduces fence insertion for a control-flow graph to fence insertion for a set of paths. In addition, we show that the minimum fence insertion problem with multiple types of fence instructions is NP-hard even for straight-line programs

    Compositional Verification of Compiler Optimisations on Relaxed Memory

    Get PDF
    This paper is about verifying program transformations on an axiomatic relaxed memory model of the kind used in C/C++ and Java. Relaxed models present particular challenges for verifying program transformations, because they generate many additional modes of interaction between code and context. For a block of code being transformed, we define a denotation from its behaviour in a set of representative contexts. Our denotation summarises interactions of the code block with the rest of the program both through local and global variables, and through subtle synchronisation effects due to relaxed memory. We can then prove that a transformation does not introduce new program behaviours by comparing the denotations of the code block before and after. Our approach is compositional: by examining only representative contexts, transformations are verified for any context. It is also fully abstract, meaning any valid transformation can be verified. We cover several tricky aspects of C/C++-style memory models, including release-acquire operations, sequentially consistent fences, and non-atomics. We also define a variant of our denotation that is finite at the cost of losing full abstraction. Based on this variant, we have implemented a prototype verification tool and ap

    Lasagne : a static binary translator for weak memory model architectures

    Get PDF
    Funding: This work was supported by a UK RISE Grant.The emergence of new architectures create a recurring challenge to ensure that existing programs still work on them. Manually porting legacy code is often impractical. Static binary translation (SBT) is a process where a program’s binary is automatically translated from one architecture to another, while preserving their original semantics. However, these SBT tools have limited support to various advanced architectural features. Importantly, they are currently unable to translate concurrent binaries. The main challenge arises from the mismatches of the memory consistency model specified by the different architectures, especially when porting existing binaries to a weak memory model architecture. In this paper, we propose Lasagne, an end-to-end static binary translator with precise translation rules between x86 and Arm concurrency semantics. First, we propose a concurrency model for Lasagne’s intermediate representation (IR) and formally proved mappings between the IR and the two architectures. The memory ordering is preserved by introducing fences in the translated code. Finally, we propose optimizations focused on raising the level of abstraction of memory address calculations and reducing the number offences. Our evaluation shows that Lasagne reduces the number of fences by up to about 65%, with an average reduction of 45.5%, significantly reducing their runtime overhead.Postprin

    Overhauling SC atomics in C11 and OpenCL

    Get PDF
    Despite the conceptual simplicity of sequential consistency (SC), the semantics of SC atomic operations and fences in the C11 and OpenCL memory models is subtle, leading to convoluted prose descriptions that translate to complex axiomatic formalisations. We conduct an overhaul of SC atomics in C11, reducing the associated axioms in both number and complexity. A consequence of our simplification is that the SC operations in an execution no longer need to be totally ordered. This relaxation enables, for the first time, efficient and exhaustive simulation of litmus tests that use SC atomics. We extend our improved C11 model to obtain the first rigorous memory model formalisation for OpenCL (which extends C11 with support for heterogeneous many-core programming). In the OpenCL setting, we refine the SC axioms still further to give a sensible semantics to SC operations that employ a ‘memory scope’ to restrict their visibility to specific threads. Our overhaul requires slight strengthenings of both the C11 and the OpenCL memory models, causing some behaviours to become disallowed. We argue that these strengthenings are natural, and that all of the formalised C11 and OpenCL compilation schemes of which we are aware (Power and x86 CPUs for C11, AMD GPUs for OpenCL) remain valid in our revised models. Using the HERD memory model simulator, we show that our overhaul leads to an exponential improvement in simulation time for C11 litmus tests compared with the original model, making exhaustive simulation competitive, time-wise, with the non-exhaustive CDSChecker tool

    Concurrency-aware thread scheduling for high-level synthesis

    Get PDF
    When mapping C programs to hardware, high-level synthesis (HLS) tools seek to reorder instructions so they can be packed into as few clock cycles as possible. However, when synthesising multi-threaded C, instruction reordering is inhibited by the presence of atomic operations (‘atomics’), such as compare- and-swap. Atomics, the fundamental concurrency primitive in C, are the basis of more abstract concurrency mechanisms such as locks, and also of efficient lock-free data structures. Whether a particular atomic can be legally reordered within a thread can depend on the memory access patterns of other threads. Existing HLS tools that support atomics typically sched- ule each thread independently, and so must be conservative when optimising around atomics. Yet HLS tools are distinguished from conventional compilers by having the entire program available. Can this information be exploited to allow more reorderings within each thread, and hence to obtain more efficient schedules? In this work, we propose a global analysis that determines, for each thread, which pairs of instructions must not be reordered. Our analysis is sensitive to the C consistency mode of the atomics involved (e.g. relaxed, release, acquire, and sequentially- consistent). We have used the Alloy model checker to validate our analysis against the C language standard, and have implemented it in the LegUp HLS tool. An evaluation on several lock-free data structure benchmarks indicates that our analysis leads to a 1.6 × average global speedup

    Automatically Comparing Memory Consistency Models

    Get PDF
    A memory consistency model (MCM) is the part of a programming language or computer architecture specification that defines which values can legally be read from shared memory locations. Because MCMs take into account various optimisations employed by archi- tectures and compilers, they are often complex and counterintu- itive, which makes them challenging to design and to understand. We identify four tasks involved in designing and understanding MCMs: generating conformance tests, distinguishing two MCMs, checking compiler optimisations, and checking compiler mappings. We show that all four tasks are instances of a general constraint-satisfaction problem to which the solution is either a program or a pair of programs. Although this problem is intractable for automatic solvers when phrased over programs directly, we show how to solve analogous constraints over program executions, and then construct programs that satisfy the original constraints. Our technique, which is implemented in the Alloy modelling framework, is illustrated on several software- and architecture-level MCMs, both axiomatically and operationally defined. We automatically recreate several known results, often in a simpler form, including: distinctions between variants of the C11 MCM; a failure of the ‘SC-DRF guarantee’ in an early C11 draft; that x86 is ‘multi-copy atomic’ and Power is not; bugs in common C11 compiler optimisations; and bugs in a compiler mapping from OpenCL to AMD-style GPUs. We also use our technique to develop and validate a new MCM for NVIDIA GPUs that supports a natural mapping from OpenCL

    Locality and Singularity for Store-Atomic Memory Models

    Full text link
    Robustness is a correctness notion for concurrent programs running under relaxed consistency models. The task is to check that the relaxed behavior coincides (up to traces) with sequential consistency (SC). Although computationally simple on paper (robustness has been shown to be PSPACE-complete for TSO, PGAS, and Power), building a practical robustness checker remains a challenge. The problem is that the various relaxations lead to a dramatic number of computations, only few of which violate robustness. In the present paper, we set out to reduce the search space for robustness checkers. We focus on store-atomic consistency models and establish two completeness results. The first result, called locality, states that a non-robust program always contains a violating computation where only one thread delays commands. The second result, called singularity, is even stronger but restricted to programs without lightweight fences. It states that there is a violating computation where a single store is delayed. As an application of the results, we derive a linear-size source-to-source translation of robustness to SC-reachability. It applies to general programs, regardless of the data domain and potentially with an unbounded number of threads and with unbounded buffers. We have implemented the translation and verified, for the first time, PGAS algorithms in a fully automated fashion. For TSO, our analysis outperforms existing tools
    corecore