The ISA is a multi-part specification of hardware behavior as seen by software. One significant, yet often underappreciated, aspect of this specification is the memory consistency model (MCM) which governs inter-module interactions in a shared memory system. This paper makes a case for MCM-aware ISA design, and provides a toolflow to support it. Our primary contribution is a methodology for evaluating and refining an ISA's MCM to ensure a correct and complete specification. We also apply our framework to the open source RISC-V ISA [51] , focusing on the goal of accurate, efficient, and legal compilations from C/C++. We uncover under-specifications and potential inefficiencies in the current RISC-V ISA documentation and identify possible solutions for each. As an example, we find that a RISC-V-compliant microachitecture allows 144 outcomes forbidden by C/C++ to be observed out of 1,701 litmus tests examined. Overall, this paper demonstrates the effectiveness of our methodology for pinpointing MCM-related issues early in the ISA design process.
Introduction
The Instruction Set Architecture (ISA) is a definitive specification of hardware behavior as seen by software. It is the target for compiler translation and the specification for what the hardware must implement. The ISA comprises two primary components: an operational specification that defines how instructions modify software-visible state, and an often underappreciated memory consistency model (MCM) [26, 1, 18] . In the era of multi-core, the ISA must provide a MCM abstraction for inter-core communication by defining the values that a read of a shared memory location is allowed to return. Three main categories of errors arise when translating a program from a high-level language (HLL) into correct, portable, and efficient assembly code: (1) incorrect compilation or mapping of instructions from the HLL to the target ISA, (2) an inadequate ISA specification, and (3) incorrect microarchitectural implementations of the ISA. If the ISA MCM is inadequately or incorrectly specified, or if a microarchitecture fails to implement the defined MCM, code compiled for that ISA may produce incorrect results.
Consider the C++ program in Figure 1a . When compiled by Clang++ v3.8, the resulting program intermittently produces a result which is illegal according to the C/C++ HLL memory model specification [13] when run on some ARM hardware platforms. This behavior was first reported by Alglave et al. [2] , and we have observed this phenomenon on a Galaxy Nexus (ARM Cortex-A9) and a Nexus 6 (Qualcomm Snapdragon 805). Mis-and under-specification of MCMs in modern hardware is a real problem that leads to processors producing incorrect or counter-intuitive outcomes. This particular example, the ARM load-after-load hazard, was formally acknowledged by ARM in 2011 as a "problem that can occur, in rare circumstances, when performing successive reads from the same memory location on a Cortex-A9 MPCore processor" -program order is not preserved between successive reads from the same location [4] . Whether this is the result of an incorrect compiler mapping, an inadequate ISA specification, or an incorrect hardware implementation is largely irrelevant. Figure 2 depicts a "sliding lever" of design options regarding compiler-and hardwareenforced orderings. Regardless of how ordering responsibilities are apportioned at various layers of the hardware-software stack, MCMs require careful and precise translation at all layer interfaces in order to assure correct program operation.
Central to precise translation and implementation is a precise ISA MCM specification. If the ISA MCM is unclear, or is defined in a way that is fundamentally at odds with the HLL requirements 1 it intends to support, there is no longer a verifiable interface for compilers to target and hardware to implement.
Through this paper, we aim to make MCMs a first-class citizen of ISA design through the following contributions:
• We present a framework intended to enable architects to iteratively design and refine an ISA MCM in an effort to provide an accurate and minimally-constrained target for compiled C/C++ programs. Our methodology systematically compares the language-level execution semantics of C/C++ programs with their corresponding ISA-level executions on microarchitectural implementations of the ISA in question. When a microarchitectural execution fundamentally differs from its corresponding language level execution, we determine if this is the result of an incorrect compiler mapping, ISA specification, or hardware implementation.
• We apply our toolflow for ISA MCM evaluation and de-
sign to the open-source RISC-V ISA [51] to validate its applicability to modern ISA design, and determine how accurately and precisely the current specification can serve as a compiler target for C/C++ programs. As an example, we find that a RISC-V-compliant microachitecture allows 144 outcomes forbidden by C/C++ to be observed out of 1,701 litmus tests examined.
• We propose improvements to the RISC-V MCM specification based on the results of our evaluation, in order to address the model's current shortcomings.
Background: Features of Memory Models
MCMs specify the rules and guarantees governing the ordering and visibility of accesses to shared memory. Frequently regarded as the most intuitive MCM, Sequential Consistency (SC) [26] requires that the result of a program execution is the same as if all cores execute their own instructions in program order (PO), and a total global order exists on all instructions from all cores. In hardware, there have been many attempts at mitigating SC's performance cost, commonly leveraging techniques such as aggressive post-retirement speculation and rolling back execution in the case of a coherence violation [11, 14, 19, 20, 39, 52] . Additionally, techniques have been proposed that aim to enforce SC only for conflicting accesses [21, 27, 44] . Nevertheless, single-core performance optimizations violate SC, and most manufacturers have elected to build hardware with MCMs that relax SC.Various issues can arise when the effects of weakening MCM features are not carefully considered at ISA design time.
Coherence and Same-Address Ordering
Coherence ensures that (1) all stores are eventually made visible to all cores and (2) there exists a single total order on all stores to the same address that all threads agree on [18, 17] . Consistency can be thought of as a superset of coherence in that it is additionally concerned with orderings of accesses to different addresses. Accesses from the same thread to the same address generally must maintain local PO-specified orderings, but there are exceptions: some old Power models and SPARC RMO relax same-address load-after-load ordering [46, 47] . Notably, imprecision in the coherence specification led to the ARM load-after-load hazard discussed in Section 1. ARM acknowledged that due to the vast number of load instructions in programs, binary patching in the linker is infeasible; they instead suggest that compilers be rewritten to issue a dmb fence instruction immediately following susceptible loads. To demonstrate the cost of imprecise ISA MCM specifications, we estimate the overhead of this workaround using the parallel sieve of Eratosthenes algorithm [12] . This application gives the same results regardless of whether there is any synchronization between threads. Thus, its reading and marking of entries can be implemented with either relaxed atomics or sequentially consistent atomics without compromising correctness.
We implemented three variants of the parallel sieve algorithm. The first variant uses relaxed atomics. The second uses relaxed atomics with a dmb ish fence added after the relaxed load, in accordance with ARM's recommended fix for the hazard. The third uses sequentially consistent atomics (implemented with DMB fences). We ran all variants with problem size = 10 8 on a Samsung Galaxy S7 with an Exynos 8890 8-core processor, varying the number of threads between 1 and 32. Figure 3 compares the three variants; in particular, it shows the overheads of the SC atomic variant and the relaxed variant with the fence fix compared to the relaxed atomic variant (which is the fastest of the three).
The relaxed variant with the fix is always slower than the relaxed atomic variant due to the extra dmb ish fence. The overhead of the fix is 20.3% additional execution time on average. Furthermore, the performance of the fixed variant degrades to the level of fully sequentially consistent atomics for thread counts over 13. This experiment indicates that the overhead of fixing the load-after-load hazard can be quite significant. We revisit the issue of same address load-load ordering in Section 5.1.3.
Dependencies
A dependency relates a load with a PO-later load or store. An address dependency results when the address of a load or store depends on the value returned by a PO-prior load. A data dependency exists between a load and a PO-later store when the store's value depends on the loaded value. A control dependency occurs when a load or store is in a branch that is controlled by the value returned by a PO-prior load. Intuitively, dependencies would seem to imply that ordering must be enforced, as the dependee seemingly cannot execute until it has all of its inputs available. However, in the presence of microarchitectural speculation, the dependee can in fact behave as if it were reordered with the instruction it depends on [33] , unless such behavior is explicitly prevented by the ISA specification. We address dependencies in the context of the RISC-V memory model in Section 4.1.
Store Atomicity, Cumulativity, and C/C++ Atomics
As defined by Collier, a store is multiple-copy atomic if all cores in the system, including the performing core, see the updated value at the same instant [15] . To enable microarchitectural optimizations like store buffers, some architectures allow a core to read its own writes prior to their being made visible to other cores; we refer to this as ROWE-multiple-copy atomic 2 (rMCA). However, rMCA writes must be made visible to cores other than the performing core at the same time. Weaker models, like ARM and Power, feature non-multiplecopy atomic (nMCA) stores that may become visible to some remote cores before they become visible to others. Figure 4 demonstrates the often counterintuitive effects of nMCA stores. The specified non-SC outcome corresponds to a causality chain where T0 sets a flag by writing 1 to x, and T1 reads the updated value of x, subsequently setting its own 2 ROWE: Read Own Write Early flag by writing 1 to y. T2 then sees the update of y, reading 1; however, it has still not observed the update of x and reads its value as 0. If this C++ program is compiled down to regular loads and stores on a nMCA system, the forbidden outcome will (perhaps surprisingly) be observable.
C/C++ supports acquire and release operations to synchronize between threads. These operations were initially proposed as part of release consistency (RC) [18] . An acquire ensures that it was made visible before accesses after the acquire in program order, while a release ensures that accesses before it in program order were made visible before the release. The store and load of y in Figure 4 form a release-acquire pair that synchronizes the instructions on T1 and T2. C/C++ additionally requires release-acquire synchronization to be transitive [10, 13] . This means that T2 must observe the store to x when it acquires y, because T1 observed the store to x before its release of y. As a result, the outcome in Figure 4 is forbidden.
To implement release-acquire synchronization using fences on an nMCA architecture, the fences must be cumulative. Cumulative fences ensure that accesses before the fence in program order as well as writes observed prior to the fence are made visible to cores reading from writes after the fence, which is sufficient to support C/C++ release-acquire synchronization. The Power architecture implements release-acquire synchronization using such cumulative fences.
C/C++ also supports sequentially consistent (SC) atomics. An SC load is an acquire operation and an SC store is a release operation, and there must also be a total order on all SC atomic operations that is respected by all cores. As such, the program in Figure 5 must forbid the listed outcome, as there is no total order of SC operations that would allow it. At the architecture level, cumulative fences are not sufficient to implement the required ordering for this program. Even if a cumulative fence were placed between each pair of loads on T2 and T3, neither T2 nor T3 reads from writes after the fences, so the writes observed before the fences need not be propagated to other cores. Instead, fences used to implement C/C++ SC atomics must be strong cumulative fences. Intuitively, a strong cumulative fence enforces that all accesses before the fence (including observed writes from other cores) are propagated to all other cores before any instruction (from any core) after the fence is made visible to any core. ARM and Power use strong cumulative fences to implement C/C++ SC atomics. Cumulativity is difficult to define completely-we refer the reader to previous work [2] for a full definition.
The C/C++ memory model has various forms of synchronization with different strength/performance tradeoffs, and so ISAs often provide a corresponding set of synchronization primitives with similar tradeoffs. For example, strong cumulative fences are capable of implementing release-acquire synchronization, but enforce unnecessary orderings while doing so. We explore the impact of synchronization primitive variety and strength in Section 5.2 and Section 6. 
Our Framework for MCM-Aware ISA Design
ISAs are designed somewhat frequently and evolve even more often. For example, ARM has recently added one-way acquire and release barriers to better accommodate the requirements of C/C++ atomics [5] . There is a window of opportunity in which ISA design and modification can take place prior to its official release as a target for compiler tool-chains and as an implementation specification for hardware design. Inspired by the effects of weakening MCM features and potential practical issues that arise when ISA MCMs are inadequately specified, we have designed a toolflow and methodology for ISA MCM evaluation and refinement. Our framework acknowledges the roles of software, hardware, and ISA in memory model design, and it incorporates the subtle ways in which common microarchitectural optimizations manifest at the architecture level.
Background: µCheck and Herd
Our approach builds on the µCheck [28, 29, 32] family of tools. A hardware designer can use a domain-specific language called µSpec to describe a processor by defining a set of ordering axioms. This specification along with a collection of userprovided litmus tests and corresponding required outcomes for each test serve as inputs to µCheck. µCheck evaluates correctness of the processor model by comparing the required litmus test outcomes with outcomes that are observable on the model. Our framework uses these tools along with Herd [2], a memory model simulator that takes as input a user-defined MCM (in a concise format) and a litmus test and outputs all executions of that test that are permitted by the model. Recent work has added support for language-level MCMs, and in particular, a model has been constructed for C/C++ [7] , which we use.
Iterative MCM-Aware ISA Design
As depicted in Figure 2 , the ISA MCM specification is the contract between software and hardware that determines which orderings should be enforced by hardware and which are the responsibility of the compiler to enforce by leveraging ISA instructions. In this work, we focus on the time in the design process when the ISA MCM can be modified. When the ISA MCM is a fluid design component, it becomes necessary to have the ability to modify both microarchitectural implementations of that ISA MCM as well as compiler mappings from C/C++ MCM primitives to the ISA MCM.
In this light, our framework uses the µCheck framework to consider ISA MCM and microarchitecture co-design. The µCheck framework has been used in prior work as way to verify that a microarchitecture is correctly upholding the requirements of its corresponding ISA MCM [28, 29, 32] . Here we show how we can leverage this framework to assist in determining if a specific combination of a C/C++→ISA MCM compiler mapping, an ISA MCM specification, and a microarchitectural implementation can prohibit all illegal-according-to-C/C++ executions and permit as many legal-according-to-C/C++ as possible (i.e., minimally constrain C/C++ programs).
µCheck also demonstrates that subtle microarchitectural modifications may directly affect the ISA-level MCM, and consequently how it can (or cannot) serve as a compiler target for C/C++ programs. While architects are selecting hardware optimizations for improved performance or simplifications for ease of verification, they can simultaneously study the effects of these choices on the ISA-visible MCM and the ability of their design to accurately and efficiently support C/C++ programs. µCheck is a robust framework that enables designers to model speculative execution as well as the subtle interplay between coherence and consistency [32] and virtual memory and consistency [29] . tests is run on the µCheck model of the ISA to determine the outcomes that are observable or unobservable on the microarchitecture.
HLL-MICROARCHITECTURE EQUIVALENCE CHECK:
The results of Step 3 and Step 5 are compared for each test to determine if the microarchitecturally realizable outcomes are stronger than, weaker than, or equivalent to the outcomes required by the C/C++ model. A stronger than (resp. weaker than) outcome corresponds to a C/C++ program that is permitted (resp. forbidden) by the C/C++ MCM, yet unobservable (resp. observable) on the microarchitectural implementation of the ISA. If
Step 6 concludes that the microarchitecturally realizable outcomes are more restrictive than what C/C++ requires, the designer may wish to conduct refinement for performance reasons. On the other hand, refinement is mandatory when outcomes forbidden by C/C++ are observable. 7. REFINEMENT: This step corresponds to refining any combination of the ISA MCM, compiler mappings, and microarchitectural implementation. Refinement refers to the process of modifying an input in response to a microarchitectural implementation that differs from the C/C++-specified outcome for a given set of program executions.
Microarchitectural Implementations
To support our framework, we have implemented a set of baseline microarchitectures in µSpec that relax various aspects of program order and store atomicity. These models, summarized in Table 7 , can be extended to implement specific ISA primitives and subsequently serve as inputs into our tool-flow. We achieve these relaxation combinations as follows: 1. WR: W→R reordering is achieved by buffering stores in a FIFO queue prior to eventually pushing them out to the rest of the memory hierarchy. Value forwarding is disallowed, but younger loads may complete when their effective address does not match the address of a pending store. 2. rWR: Builds on WR by allowing value forwarding from pending stores to later loads of the same address. 3. rWM: Extends rWR by allowing writes (to different addresses) to retire from the store buffer out of order. Coherence requires a total global order on same-address stores. 4. rMM: Extends rWM by allowing reads to commit out of order with earlier reads or writes. We maintain that read→write ordering must be maintained for same address reads and writes, but we allow reordering for all read→read pairs in this baseline version. 5. nWR: Extends rWR by allowing cores to share store buffers. This is analogous to a shared write-through cache scenario [13] , and enables nMCA stores. 6. nMM: Extends rMM by allowing shared store buffers in the same vein as nWR. 7. A9like: To demonstrate that the intricacies of nMCA stores do not depend on a design that contains a shared buffer or shared write-through cache, we model another microarchitecture with ISA-visible relaxations that match those of nMM. This time we leverage the ability of µCheck to model subtleties of the cache-coherence/consistency interface. To implement nMCA stores, this model features: i) write-back caches that allow multiple requests for write permission (for different addresses) to be in progress at the same time and ii) a non-stalling directory protocol that allows the storing core to forward the store's value to another core before it has received all invalidations for the access. In this scenario, coherence is preserved, but nMCA stores arise. This design captures reordering features similar to those of the ARM Cortex-A9 [3] . To demonstrate the applicability of our framework to modern ISA design, we have conducted a case study that applies our toolflow from Section 3 to the latest version of the RISC-V ISA specification [51] . In this experiment, we study both the Baseline (i.e., Base) and the Baseline + Atomics Extension (i.e., Base+A) RISC-V ISAs, evaluating each on how efficiently and accurately they are able to (or not able to) serve as compiler targets for C/C++ programs. For example, we show that the program in Figure 4 can show the forbidden outcome when compiled to the RISC-V ISA (via the "Intuitive" compiler mappings detailed in Table 1 ) if the microarchitectural implementation leverages non-MCA stores (which the specification allows). For the Base ISA, we show that there is no way to provide a correct mapping. For the Base+Atomics ISA, we show that likely-unintended inefficiencies can result from modifying the mapping to force the correct C/C++-required outcome.
RISC-V Baseline MCM
Relaxed memory model: The Base RISC-V MCM (Section 2.7 of the RISC-V ISA specification [51] ) allows multiple threads of execution within a single user address space that may communicate and synchronize via the shared memory system. Each thread must observe its own memory operations as if they executed sequentially in program order. However the manual specifies that RISC-V has a "relaxed memory model" that requires explicit FENCE instructions to guaranteed any specific ordering between memory operations as viewed by other RISC-V threads.
FENCE Instructions for Memory Accesses: Any combination of memory read and write instructions may be ordered with any combination of the same. The manual states that, "Informally, no other RISC-V thread or external device can observe any operation in the successor set following a FENCE before any operation in the predecessor set preceding the FENCE." We interpret successor and predecessor sets here to be the accesses of the specified type(s) that come before and after the FENCE in program order, respectively.
Dependencies: Of particular note is the fact RISC-V does not require memory ordering to be enforced for dependent instructions, even though this can result in counter-intuitive outcomes in multiprocessor systems [33] . Many commercial architectures such as x86, ARM, and Power respect address, data, and some control dependencies between instructions, and such dependencies can also be used as lightweight synchronization to enforce orderings locally [42] . More importantly, if dependency orderings are not preserved by default, they must be explicitly enforced through ISA instructions when necessary. For example, the Linux kernel includes a read_barrier_depends() barrier that is used to conditionally enforce data dependencies on systems that do not respect them, such as Alpha [48] . We note that the current Linux port of RISC-V does not map read_barrier_depends() to any fence, and so may be incorrect for some microarchitectural implementations [40] . Our recommendation is to require the preservation of dependency orderings in the ISA memory model. Other issues with the RISC-V memory model are discussed in Section 5.
RISC-V Baseline + Atomics Extension
RMWs with memory orders: The Standard Extension for Atomic Instructions (Chapter 6 of the RISC-V ISA specification [51] ) contains atomic fetch-and-op instructions (i.e., AMOs) and Load-Reserve/Store-Conditional (LR/SC) instructions. Both of these read-modify-write mechanisms may be annotated with various memory ordering semantics-unordered, acquire, release, and sequentially consistent. The manual states that these ordering mechanisms are meant to "implement the C11 and C++11 memory models efficiently." They are defined as follows:
• Unordered: "No additional ordering constraints are imposed on the atomic memory operation." • Acquire (Acq): "No following memory operations on this RISC-V thread can be observed to take place before the Acq memory operation." The manual also states that FENCE R, RW suffices to implement acquire orderings.
• Release (Rel): "The Rel operation cannot be observed to take place before any earlier memory operations on this RISC-V thread." The manual also states that FENCE RW, W suffices to implement release orderings.
• Acquire+Release (AcqRel): "The AcqRel operation is sequentially consistent and cannot be observed to happen before any earlier memory operations or after any later memory operations in the same RISC-V thread, and can only be observed by any other thread in the same global order of all sequentially consistent atomic memory operations to the same address domain." Store atomicity: The manual states that nMCA implementations are allowed, but that for AcqRel operations, the specification requires "full sequential consistency for the atomic operation which implies global store atomicity in addition to both acquire and release semantics."
Applying our Methodology to RISC-V
As a case study of our approach, we analyze the RISC-V ISA's memory models. We divide our case study into two halves, one for the Base ISA model and one for the Base+A ISA model. For each of these specifications, we begin with the MCM as specified in Sections 4.1 and 4.2. Our initial compiler mappings are the "Intuitive" mappings from Table 1 . These mappings are derived from information in the RISC-V manual [51] .
For the microarchitecture component of the analysis, we utilize the nMM microarchitecture (detailed in Section 3.3) as a baseline and augment it with the relevant components of the Base and Base+A ISA models. This microarchitecture has an out-of-order core to exhibit re-orderings and uses store buffers shared between pairs of cores to model the effects of a nMCA memory system. We apply the iterative design and refinement methodology of Figure 6 to these inputs. Our analysis shows that parts of the current RISC-V MCMs are too weak and others are too strong to implement C/C++ atomics correctly and efficiently. We recommend a set of possible model refinements to fix their problems, and use our framework to ensure that these changes have the desired effect.
Base RISC-V Model Analysis & Refinement
The Base ISA only provides memory fence instructions to establish synchronization between threads. As such, C/C++ atomics must be implemented in the Base ISA using a combination of fences and ordinary loads and stores. Our analysis discovered issues in the Base ISA's memory model related to cumulativity and the ordering of read operations to the same address.
Initial conditions: x=0, y=0 T0 T1 T2 T3 a: fence rw, rw c: fence rw, rw e: fence rw, rw k: fence rw, rw b: sw x1, (x7) d: sw x2, (x8) f: lw x3, (x7) l: lw x5, (x8) g: fence rw, rw m: fence rw, rw h: fence rw, rw n: fence rw, rw i: lw x4, (x8) o: lw x6, (x7) j: fence rw, rw p: fence rw, rw Forbidden HLL Outcome: x1=1, x2=1, x3=1, x4=0, x5=1, x6=0 
Lack of Cumulative Fences
As covered in Section 2.3, C/C++ release-acquire synchronization is required to be transitive, ordering both accesses before a release in program order *and* writes that were observed by the releasing core prior to the release. As such, in the WRC variant of Figure 4 , it is forbidden for T2 to return 0 for its load of x if it observes the release to y using its acquire. This ordering is not implicitly enforced for regular loads and stores in nMCA memory systems, which RISC-V allows.
When we ran the Base MCM through our tool-flow using the "Intuitive" compiler mappings from Table 1 , the test in Figure 4 compiled down to that in Figure 8 . Analysis of this program with µCheck indicated that the forbidden outcome was observable on the microarchitecture. Upon investigation of the results, we deduced that the bug was due to the absence of cumulative barriers in the Base ISA.
The Base RISC-V ISA does not contain any cumulative fences that are capable of enforcing this ordering. Thus, this problem cannot be fixed simply by changing the compiler mapping. Our recommended solution to the problem is to modify the ISA such that the fences used to implement acquires and releases are cumulative.
We modified the microarchitectural implementation of the fences used for acquire and release to be cumulative, and reran the framework with the new microarchitecture. This time, the tests such as WRC that require cumulative orderings duly disallowed the forbidden outcomes. 5.1.2. Lack of Strong Cumulative Fences As discussed in Section 2.3, the enforcement of a total order is necessary for C/C++ SC atomics. This requirement is exhibited by the variant of the IRIW litmus test shown in Figure 5 , whose non-SC outcome is forbidden by C/C++.
Using the "Intuitive" compiler mappings from Table 1 , the test compiles down to Figure 9 for the Base ISA. µCheck reported that the forbidden outcome was allowed by our microarchitectural implementation for this test. Examination of the graph generated by µCheck showed that this was also due to the lack of cumulativity in fences. However, unlike the WRC case above, cumulative fences between the pairs of loads on T2 and T3 are insufficient to enforce the ordering required, and we verified this using our tool-flow. Instead, as discussed in Section 2.3, strong cumulative fences are required to prohibit the forbidden outcome in this case -a feature which the Base ISA does not provide.
As in Section 5.1.1, our recommended solution to this problem is to modify the ISA to include strong cumulative fences. We modified the microarchitectural implementation to support strong cumulative fences, changed the compiler mapping to use these fences when mapping sequentially consistent atomics, and reran the modified setup through our tool-flow. We observed that the forbidden outcome of Figure 5 was duly disallowed with the new instructions and mapping.
5.1.3. Reordering Loads to the Same Address After making the above changes and rerunning with the modified setup, we discovered that variants of the CoRR and CO-RSDWI litmus tests were still producing forbidden outcomes. Upon investigation of the µCheck output, we discovered that the forbidden outcome was occurring because the microarchitectural implementation was not ordering loads to the same address. As discussed in Section 2, C/C++ atomics require that two loads of the same address maintain program order. The "Intuitive" compiler mapping for relaxed atomics from Table 1 implements relaxed atomics with regular loads and stores, which implies that the microarchitecture should enforce this ordering requirement; however, the microarchitecture was not doing so because the Base ISA does not require this. As a result, the forbidden outcome is visible on the microarchitecture.
This issue can be fixed in one of two ways: either the compiler mapping for relaxed loads can be changed to add a fence after each relaxed load, or the ISA memory model can be modified to require loads to the same address to be ordered by hardware. As a relatively new ISA, RISC-V can use either option. However, adding fences after each relaxed load can result in significant performance degradation for programs that liberally use relaxed atomics, as seen in Section 2.1. As such, a more efficient solution is for the ISA memory model to require program order to be preserved between two loads to the same address. We modified the microarchitecture to provide this ordering and used our tool-flow to verify that the forbidden outcome no longer occurred.
RISC-V Base+Atomics Model Analysis & Refinement
Virtually all of the instructions unique to Base+A are readmodify-write (RMW) instructions. The deficiencies in the Base model mentioned above apply to the Base+A MCM as well. However, the new instructions in Base+A have some deficiencies with respect to mapping correctly and efficiently to C/C++, which we detail below.
The RISC-V manual [51] states that an atomic load operation may be implemented as an AMOADD to the zero register and an atomic store operation can be implemented as an AMOSWAP operation that writes the old value to the zero register (in other words, by discarding the store and load portions of certain RMWs). 
Lack of Transitive Releases
As discussed in Section 2.3, C/C++ releases are required to be transitive by the C/C++ memory model. However, releases in RISC-V are not required to be transitive, and only order the accesses before them in program order. As a result, using the "Intuitive" compiler mapping for atomics in Table 1 , the test in Figure 4 compiles down to that in Figure 10 . µCheck analysis of this test indicates that the forbidden outcome is visible on the microarchitecture, signifying a bug.
Note that even if the compiler mapping were changed to use AMO.aq.rl operations (the strongest synchronization instructions the ISA provides) for releases, the problem would persist. Even though AMO.aq.rl operations are store atomic and have both acquire and release semantics, they are not cumulative and will not enforce the required ordering (we verified this with our tool-flow).
Our recommended solution to this issue is to make release operations in the RISC-V ISA transitive, requiring that accesses before a release in program order and writes observed by the releasing core before the release are made visible before the release is made visible. We verified that making these changes to the microarchitecture's implementation of releases resulted in the forbidden outcome of Figure 8 being correctly disallowed.
With regards to the requirement of strong cumulativity for SC operations, a store-atomic cumulative release is sufficient to implement an SC store (we verified this using our framework). This is because the cumulative release semantics ensure that all previous accesses (including previously observed writes) are made visible before the release, and the store atomicity of the release ensures that the release is made visible to all cores at the same time.
Absence of Roach-Motel Movement for SC Atomics
In the C/C++ memory model, SC loads and stores only need to enforce acquire and release orderings respectively, in addition to appearing in a total order observed by all cores. SC loads do not need to implement release semantics and SC stores do not need to implement acquire semantics [24] . As a result, ordinary loads and stores (as well as relaxed atomics) that follow an SC store or precede an SC load in program order can be reordered before the SC store or after the SC load respectively. This is known as roach-motel movement and intuitively corresponds to making a critical section larger, which will not break code that uses atomic operations and locks in wellstructured ways [13] . Roach motel movement allows acquires and releases to function as one-way barriers, allowing more reordering of memory operations and theoretically improved performance.
The RISC-V ISA requires both the aq and rl bits to be set on an AMO operation in order to ensure the store atomicity required to correctly implement SC operations. There is no way to have store atomic operations with only acquire or release semantics, which would map closely to the requirements of C/C++ SC loads and stores. As a result, the implementations of SC loads and stores in the RISC-V Base+A ISA is too strict, unnecessarily enforcing more orderings than the C/C++ model requires. For example, in the version of the mp litmus test shown in Figure 11 , the C/C++ memory model allows the relaxed store to y to be reordered before the SC store to x by roach motel movement. Thus, it is possible for T1 to observe the store to y before it observes the store to x.
However, when using the "Intuitive" RISC-V mapping from Table 1 , the program in Figure 11 compiles down to that in Figure 12 . When we ran this program through µCheck, it deduced that the allowed outcome was in fact forbidden by the microarchitecture. Specifically, the acquire semantics of the AMO.aq.rl used to implement the SC store to x prevents the store to y from being reordered with it through roach motel movement.
One way to fix this excessive ordering enforcement is to decouple the store atomicity setting of an AMO from its acquire and release semantics, allowing AMOs to be store atomic when only having acquire or release semantics. We denote such store atomic AMOs as AMO.<aq|rl>.sc. Using our framework, we verified that if SC loads are mapped to AMO.aq.sc and SC stores are mapped to AMO.rl.sc, the outcome in Figure 11 is allowed, and no forbidden outcomes are allowed as a result of this relaxation. Table 1 "Intuitive" compiler mappings. Registers x4 and x5 contain the addresses of x and y respectively on both cores.
Lazy Implementation Incompatibility
In the C/C++ memory model, acquires (and SC loads) can only synchronize with releases (and SC stores). In other words, if a release is observed by a relaxed atomic access, it is not necessary for the thread performing the relaxed atomic access to observe all accesses before the release as well. It is only when the release is observed by an acquire or an SC load that the accesses before the release must be observed by the loading core. As such, in the version of the mp litmus test shown in Figure 13 , it is valid for T1 to observe the store to y but then still return the old value of 0 for x. This is true even though the execution of the two loads on T1 is locally ordered through means of an address dependency (assuming dependencies are respectedsee Section 4.1). Enforcing that releases only synchronize with acquire operations allows for "lazy" implementations, which can delay processing coherence invalidations until an acquire operation is reached. Such implementations can help reduce false sharing and consume less bandwidth [16, 23, 25] , and should not be outlawed by an ISA memory model specification if possible. The C/C++ constraints on the observation of a release are slightly different to the constraints on the observation of a release in RISC-V. In RISC-V, a release is considered to synchronize with respect to a given core when it is observed by any load on that core, and not necessarily by an acquire. Using the "Intuitive" compiler mappings from Table 1 , the test in Figure 13 compiles down to the code in Figure 14 . The µCheck portion of our framework confirmed that the allowed outcome was unnecessarily forbidden by the microarchitecture.
In order to allow this outcome and enable lazy implementations of the RISC-V Base+A ISA, we recommend that a release should only synchronize with respect to a core when it is observed by an acquire operation from that core. Upon making this modification to the microarchitectural implemen-tation used in our analysis, we verified that the outcome from Figure 13 was now allowed by the microarchitecture.
RISC-V MCM Shortcomings Quantified
Throughout our case study, we modeled several RISC-V processor designs with varying levels of microarchitectural optimization. We constructed RISC-V µSpec models by extending the baseline models of Table 7 to resemble Rocket Chip [6] , a 6-stage in-order pipeline that supports the Base RISC-V ISA and some optional extensions, including the Atomics extension. These models were augmented with the appropriate RISC-V instructions depending on whether they were implementing the Base or Base+A ISA.
For each of the Base and Base+A MCMs, Figure 15 shows results for riscv-curr and riscv-ours versions as inputs to our toolflow. The riscv-curr version of the Base (resp. Base+A) MCM corresponds to the initial set of inputs to our toolflow: current Base (resp. Base+A) RISC-V MCM [51] , "Intuitive" compiler mappings of Table 1 , and Base (resp. Base+A) RISC-V implementations of the Table 7 µSpec models. The riscvours version of the Base (resp. Base+A) MCM corresponds to the final results of the refinement process of Section 5: refined Base (resp. Base+A) RISC-V MCM, "Refined" compiler mappings of Table 1 , and refined Base (resp. Base+A) RISC-V implementations of the Table 7 µSpec models to accommodate RISC-V MCM changes.
The chart in the bottom right-hand corner of Figure 15 additionally depicts results aggregated across litmus tests in each litmus test suite. Bug bars correspond to the percentage of tests that ever produced an illegal outcome for a litmus test variation of a specified type when executed on any of our microarchitectural implementations. The Overly Strict display the percentage of tests that ever produced an Overly Strict outcome, but never a Bug. Equivalent bars are the percentage of tests that always produced C/C++-specified outcomes.
Litmus Test Suite Evaluation
In Section 2, we discussed MCM features and alluded to issues that can result when these features are not carefully taken into account at design time. Through our Section 5 case study, we found that the Base and Base+A RISC-V MCMs are prone to pitfalls via these same MCM features. All of these errors enumerated below are eliminated in our refined riscv-ours uSpec model, ISA MCM, and compiler mappings, for both RISC-V Base and Base+A."
Lack of Cumulative Fences from Section 5.1.1: The µSpec models that are subject to errors as a result of the RISC-V MCM omitting cumulative fences are those with nMCA stores-nWR, nMM, and A9like. A lack of cumulative fences in the Base riscv-curr versions of these nMCA models resulted in 108 illegal outcomes out of the 243 variants of the wrc litmus test. The A9like model additionally exhibited 36 buggy outcomes for mp, and 1 for sb.
Lack of Strong Cumulative Fences from Section 5.1.2: Also applicable to the three nMCA µSpec models, is the omission of strong cumulative fences from the RISC-V MCM. The result of this can be seen in the Base riscv-curr versions of the three nMCA microarchitectures for the rwc and iriw litmus tests. Each model exhibited 2 illegal outcomes out of the 243 variants of rwc. Out of the 729 variations of iriw, the nWR and nWW models experience 4 buggy executions, and the A9like model displayed 2.
Re-ordering Loads to the Same Address from Section 5.1.3: We observed read→read reordering of reads of the same address on both the Base and Base+A RISC-V ISAs for the CoRR and CO-RSDWI litmus tests. We do not include quantitative results for these tests in Figure 15 , as they do not rely on any subtle interplay of instructions and are straightforwardly observable (yet forbidden by C/C++) when sameaddress loads are implemented with RISC-V relaxed loads. For the Base and Base+A riscv-curr versions, CoRR variants produced 18 illegal results out of 81 for the µSpec models that relax read→read ordering-rMM, nMM, and A9like. CO-RSDWI variants produced 54 illegal results out of 243 for the same µSpec models.
Lack of Cumulative Releases from Section 5.2.1: The lack of cumulative releases in RISC-V again affects only nMCA implementations -rWR, rMM, and A9like -as displayed by the wrc executions for the Base+A riscv-curr versions of these µSpec models. Out of the 243 wrc variants, the rWR and rMM variants produce 96 illegal outcomes, and A9like exhibits 72.
Note that the rwc and iriw litmus tests are only forbidden at the C++ level when SC atomics are involved. Thus, the non-transitive behavior of riscv-curr acquires and releases is not buggy for these tests unless SC atomics are used, resulting in fewer cases being flagged as bugs.
Absence of Roach-Motel Movement for SC Atomics from Section 5.2.2: The effects of this on overly-constraining C/C++ programs can be seen by comparing all Base+A riscvcurr and riscv-ours variants and noting that the Overly Strict bars decrease in size from riscv-curr to riscv-ours (or stay the same in a couple of cases). When they stay the same, in the iriw case for example, this is the result of the microarchitectures themselves not being relaxed enough to exploit the difference between SC operations that allow roach-motel and those that don't.
Related Work
In the past decade, researchers have formalized the specifications of a number of important real-world memory models. Java, x86-TSO, Power, ARM, C++, and OpenCL have been formalized operationally [37, 38, 42] and/or axiomatically [2, 7, 31, 49] . Most of these efforts, however, use some pre-existing document(s) as a starting point, and generally the refinement is performed according to the designers' original intent. In contrast, this work treats the software requirements and microarchitectural guarantees as fixed, and we instead explore the ISA MCM design space itself as it ranges over the space between those two endpoints. Verifying HLL Mappings onto Weak ISA MCMs: The two programming languages that have received the most attention in terms of memory model formalization are C/C++ and Java. In a series of work, Batty et al. developed a mathematically rigorous semantics for C++ concurrency, formalized using the Isabelle/HOL theorem prover via Lem [7, 10, 34, 35] . As part of this process, they produced a verified compilation scheme from C++11 onto the x86, ARM, and Power MCMs [9, 41] . Vafeiadis et al. developed various methods for proving the correctness of operations performed within a C/C++ compiler [49, 50] . Petri et al. developed an operational model of Java which specifically focused on its mapping onto x86 and Power [38] . Mappings from HLLs onto other architectures have also been considered with varying degrees of formality [43] . Verifying Microarchitectures against ISA MCMs: Prior work has also enabled flexible verification of hardware with respect to its ISA-level memory consistency model specification. Lustig et al. and Manerkar et al. developed a set of tools for specifying memory ordering behavior at the microarchitecture level and then comparing it to the ISA specification [28, 29, 32] . We use their µCheck tool in our work. They also developed a tool capable of injecting ordering enforcement mechanisms to restore orderings required by the ISA but not implemented (or incorrectly implemented) in hardware [30] . Extensive work has developed black-box testing methodologies using litmus tests [22] . We draw from these techniques, but we keep the microarchitecture constant and adjust the ISA MCM, rather than the converse in most related work.
Conclusion
MCM design choices are complicated and involve reasoning about the subtle interplay between many diverse features. Simply defining an ISA specification in light of the evaluation of a single microarchitecture is not sufficient, as future designs may expose inefficiencies or inaccuracies within the specification. In our RISC-V MCM case study, we found that one RISC-Vcompliant microachitecture allows 144 outcomes forbidden by C/C++ to be observed out of 1,701 litmus tests examined. We also demonstrated, however, that the same issues were not present across all RISC-V-compliant hardware designs. Using our framework, ISA designers can iteratively refine and evaluate ISA specifications in a microarchitecture-aware manner based on the ISA's ability to serve as a target for compiled C/C++ programs.
Acknowledgements
This work is sponsored in part by C-FAR, a funded center of STARnet, a Semiconductor Research Corporation (SRC) program sponsored by MARCO and DARPA.
