Counterexamples and Proof Loophole for the C/C++ to POWER and ARMv7
  Trailing-Sync Compiler Mappings by Manerkar, Yatin A. et al.
Counterexamples and Proof Loophole for the C/C++ to POWER
and ARMv7 Trailing-Sync Compiler Mappings
Yatin A. Manerkar Caroline Trippel Daniel Lustig∗ Michael Pellauer∗
Margaret Martonosi
Princeton University ∗NVIDIA
{manerkar,ctrippel,mrm}@princeton.edu {dlustig,mpellauer}@nvidia.com
Abstract
The C and C++ high-level languages provide programmers with atomic operations for writing high-
performance concurrent code. At the assembly language level, C and C++ atomics get mapped down to
individual instructions or combinations of instructions by compilers, depending on the ordering guarantees
and synchronization instructions provided by the underlying architecture. These compiler mappings must
uphold the ordering guarantees provided by C/C++ atomics or the compiled program will not behave
according to the C/C++ memory model. In this paper we discuss two counterexamples to the well-known
trailing-sync compiler mappings for the Power and ARMv7 architectures that were previously thought
to be proven correct. In addition to the counterexamples, we discuss the loophole in the proof of the
mappings that allowed the incorrect mappings to be proven correct. We also discuss the current state of
compilers and architectures in relation to the bug.
1 Introduction
The C and C++ high-level languages provide programmers with atomic memory operations for writing
high-performance concurrent code. Different types of atomic memory operations provide different levels of
memory ordering guarantees. Stronger memory ordering guarantees usually correlate to lower performance
and weaker memory ordering guarantees to higher performance. Users can utilise different types of atomic
memory operations depending on the guarantees and performance they require.
At the assembly language level, C and C++ atomics get mapped down to individual instructions
or combinations of instructions by compilers, depending on the ordering guarantees and synchronization
instructions provided by the underlying architecture. These compiler mappings must uphold the ordering
guarantees provided by C/C++ atomics or the compiled program will not behave according to the C/C++
memory model.
In this report we discuss two counterexamples to the well-known trailing-sync compiler mappings for
the Power and ARMv7 architectures that were supposedly proven correct [6]. In these counterexamples, a
particular execution of the program is forbidden by C/C++ but incorrectly allowed by the compiled program.
In addition to the counterexamples, we discuss the loophole in the proof of the mappings that allowed the
incorrect mappings to be proven correct. We also discuss the current state of compilers and architectures in
relation to the bug.
Section 2 provides background information on C/C++ atomic memory operations and the relations in
the memory model relevant to the counterexamples, as well as the relevant compiler mappings for ARMv7
and Power. Sections 3 and 4 discuss the two counterexamples for the trailing-sync mapping (variants of the
well-known IRIW and RWC litmus tests respectively) at both the C/C++ level and the Power/ARM level.
Section 5 discusses the loophole in the proof of the mappings [6] that allowed the incorrect mappings to be
proven correct. Section 6 discusses the current state of compilers and architectures in relation to the bug,
and Section 7 concludes.
1
ar
X
iv
:1
61
1.
01
50
7v
2 
 [c
s.P
L]
  1
7 N
ov
 20
16
2 Background Information
2.1 C/C++ atomics and Relevant Memory Model Relations
C/C++ introduced atomic objects and operations in the C/C++11 standards, along with a new memory
model governing the order in which C/C++ threads are allowed to observe each others’ memory accesses.
The C/C++ memory model is based on the data-race-free and properly-labeled models of Adve and Hill and
Gharachorloo et al. respectively [2, 9], which guarantee sequential consistency for programs that do not
contain any data races.
Here we provide a brief overview of the portions of C/C++ atomics and the C/C++ memory model that
are relevant to our counterexamples. We refer the reader to the large body of work on the C/C++ memory
model for further details [8, 10, 4, 15, 5].
Different operations can be conducted on C/C++ atomic objects, such as load, store, and compare-and-
exchange operations. Each such operation can be given a memory order, including memory order seq cst,
memory order release, and memory order acquire, which represent sequentially consistent (SC) operations,
release operations, and acquire operations respectively. Different memory orders provide different guarantees
with respect to the ordering of the atomic memory access with other accesses in the program. Conceptually,
a release operation is a write which ensures that prior accesses are made visible before the release. Likewise,
an acquire operation is conceptually a read which ensures that memory operations after the acquire are made
visible after the acquire itself. A release corresponds to granting permission to access a set of shared locations,
while an acquire is performed to gain access to a set of shared locations [1]. Note that an SC read is also an
acquire operation and an SC store is also a release operation. SC operations have further constraints on their
execution, which are detailed below.
In a given C/C++ execution, the rf relation relates a write operation to a read operation which reads the
value of that write. The mo relation enforces a total order on all write operations to the same address, and
all threads must observe writes to a given address in this mo order. The fr relation relates a given read to
the writes that follow the source write of the read in mo-order.
The happens-before relation hb is the transitive closure of the sequenced-before relation sb, which
corresponds to program order on an individual C/C++ thread, and the synchronizes-with relation sw, which
relates release store operations (and stores in their release sequence) to acquire read operations that read the
release store (or a store in its release sequence). (Release sequences are not necessary for understanding the
counterexamples in this paper.)
The total order on sequentially consistent operations sc must obey the following constraints [10, 5]:
• It must be a total order on SC operations, so any two SC operations must be ordered with respect to
each other.
• It must be consistent with hb and mo restricted to SC atomics. In other words, it is forbidden for
two accesses to be related by hb/mo in one direction and sc in the other direction. These first two
conditions are henceforth referred to as the consistent sc order property, using the terminology of Batty
et al. [6].
• SC reads (i.e. reads annotated with memory order seq cst) must either read from the latest SC write
before them in the sc order, or they must read from a non-SC write that does not happen-before the latest
SC write to that location. This condition is henceforth referred to as the sc accesses sc reads restricted
property, again using the terminology of Batty et al. [6].
Thus, according to the C/C++ memory model, the sc order must be consistent with hb, and any two
SC operations must be ordered with respect to each other by sc. Thus, if a hb edge exists between two SC
atomics, an sc edge must also exist between them in the same direction.
2.2 C/C++ Compiler Mappings to Power and ARMv7
The Power and ARMv7 architectures are two well-known architectures in use today. They are notable for
their weak memory models, which allow a great deal of reordering and can require careful use of dependencies
and synchronization instructions in order to ensure desired outcomes. Here we provide a brief overview of the
2
relevant instructions in the Power and ARMv7 memory models. We refer the reader to existing work on the
memory models of these architectures for further details [12, 3].
In Power, an lwsync is a fence which cumulatively orders all reads and writes prior to the fence before
any writes after the fence. An lwsync does not order writes prior to the fence with respect to reads after the
fence. A sync in Power is a fence which cumulatively orders all reads and writes prior to the fence before all
reads and writes after the fence. The ARMv7 dmb ish fence is analogous to the Power sync.
There are two commonly-accepted compiler mappings from C/C++ to Power and ARMv7: the leading-
sync mapping and the trailing-sync mapping [6, 13]. The relevant portions of these mappings are provided in
Tables 1 and 2 for reference. A notable difference between the Power and ARMv7 versions of each mapping is
that ARMv7 does not have an equivalent of the Power lightweight lwsync fence. It only has the heavyweight
dmb ish fence which provides orderings mostly equivalent to Power’s heavyweight sync fence. Thus, the
ARMv7 implementations of store releases and trailing-sync SC stores utilise a dmb ish fence where the
corresponding Power mappings use an lwsync.
The “cmp; bc; isync” and “teq; beq; isb” instruction sequences are known as ctrlisync and
ctrlisb respectively. The combination of a conditional branch followed by an isync (on Power) or an isb
(on ARMv7) instruction is enough to enforce that all instructions after the isync/isb begin execution after a
load which the branch depends on. This initially appears to be enough to implement the orderings required by
C/C++ memory order acquire primitives, but as our counterexamples show, issues can arise when acquires
interoperate with SC atomics.
As stated above, SC loads are also acquires and SC stores are also releases. In addition to providing
acquire and release semantics respectively, SC loads and stores must also obey the aforementioned constraints
on the total sc order. Part of these constraints requires ensuring that an SC store followed by an SC load in
sb appear in that order to all cores. This requires a heavyweight sync or dmb ish fence between the SC store
and the SC load on Power and ARMv7. Such a fence can either be incorporated into the mapping before all
SC loads (which gives the leading-sync mapping) or after all SC stores (which gives the trailing-sync mapping).
Only the instruction sequences for SC loads and stores change between the leading and trailing-sync mappings.
C/C++ Atomic Operation Power Mapping ARMv7 Mapping
Load Acquire ld; cmp; bc; isync ldr; teq; beq; isb
Load Seq Cst sync; ld; cmp; bc; isync dmb ish; ldr; teq; beq; isb
Store Release lwsync; st dmb ish; str
Store Seq Cst sync; st dmb ish; str
Table 1: Leading-sync compiler mapping from certain C/C++11 atomic operations to Power and ARMv7.
C/C++ Atomic Operation Power Mapping ARMv7 Mapping
Load Acquire ld; cmp; bc; isync ldr; teq; beq; isb
Load Seq Cst ld; sync ldr; dmb ish
Store Release lwsync; st dmb ish; str
Store Seq Cst lwsync; st; sync dmb ish; str; dmb ish
Table 2: Trailing-sync compiler mapping from certain C/C++11 atomic operations to Power and ARMv7.
Examining the mappings at a high level, one can notice the following:
• The lwsync/sync or dmb ish prior to a release or SC store ensures that accesses before the release or
SC store are made visible to other cores before they observe the release.
• The ctrlisync/ctrlisb following a load acquire enforces that all accesses after the acquire begin
execution after the acquire.
• The extra sync/dmb ish before SC loads or after SC stores enforces ordering between SC stores and
subsequent SC loads in program order.
3
Initial conditions: a: x=0, b: y=0
T0 T1 T2 T3
c: st(x, 1, seq cst) d: st(y, 1, seq cst) e: r1 = ld(x, acquire) g: r3 = ld(y, acquire)
f: r2 = ld(y, seq cst) h: r4 = ld(x, seq cst)
Outcome forbidden by C/C++: r1=1, r2=0, r3=1, r4=0
Figure 1: The IRIW counterexample, specifically the case where both of the first loads on the reading cores
are acquires. In this figure, memory order x is abbreviated to x for brevity.
a:Wna x=0
b:Wna y=0
sb
c:Wsc x=1mo
sw
d:Wsc y=1
mo,sw
e:Racq x=1
sw
f:Rsc y=0
sc_hb
g:Racq y=1
sw
h:Rsc x=0
sc_hb
sb
sc_fr
sbsc_fr
Figure 2: Execution graph of the IRIW counterexample generated with the help of CPPMEM [7], with relevant
edges showing why the execution is forbidden by the C/C++ memory model.
Both the leading and trailing-sync mappings were supposedly proven correct by Batty et al. [6]. However,
we discovered a loophole in their proof which allowed the incorrect trailing-sync mappings to be proven
correct. The loophole is detailed in Section 5.
The next two sections discuss the counterexamples we discovered for the trailing-sync mapping. These
counterexamples were discovered using a framework [14] capable of exhaustively enumerating common
C11 litmus tests with varied combinations of memory orders and comparing their outcomes against those
of the equivalent ISA-level litmus tests (obtained by compiling with a given mapping) on a variety of
microarchitectural implementations defined using the µspec language (as seen in the COATCheck paper [11]).
Specifically, these counterexamples were observed during runs of the framework on a microarchitecture with
Power/ARMv7-like features and using a trailing-sync compiler mapping. The runtimes of the framework are
very reasonable.
3 The IRIW Counterexample
The first counterexample is a variant of the well-known Independent Reads Independent Writes (IRIW) litmus
test, where at least one of the first loads on the reading cores is an acquire operation. All other accesses in
the test are SC accesses. The case where both of the first loads on the reading cores are acquires is shown in
Figure 1. The rest of this section focuses on this particular case of the counterexample, though the reasoning
for the cases where only one load is an acquire is very similar.
We begin by showing why this outcome is forbidden under the current C/C++ memory model. One
execution graph for the test’s outcome is shown in Figure 2. Note that the SC stores c and d synchronize-with
the acquire operations e and g respectively, as the acquires read the values of the SC stores, and SC stores
are also releases.
As per the current C/C++ memory model, both the sc fr and sc hb edges must be part of the sc total
order. The sc fr edges (marked in dark red) from h → c and f → d shadow the fr edges between these
operations, and must be part of the sc order to abide by the sc accesses sc reads restricted axiom mentioned
in Section 2. This is because both the SC reads f and h read from the non-SC initial writes (b and a
respectively) which hb all other writes. If the sc order contained the reverse of one of the sc fr edges, then the
initial writes (accesses b and a) that the SC reads read from would hb the latest SC writes to the locations (d
and c respectively), thus causing the SC reads to fail the sc accesses sc reads restricted axiom.
Meanwhile, there is a hb edge from c→ f through the transitive composition of the sw edge from c→ e
and the sb edge from e→ f . Likewise, there is a hb edge from d→ h through the transitive composition of
the sw edge from d→ g and the sb edge from g → h. Thus, in order to keep the sc order consistent with hb,
there must also be sc edges from c→ f and d→ h, which correspond to the sc hb edges in Figure 2.
4
Initial conditions: x=0, y=0
C0 C1 C2 C3
sync sync
st x = 1 st y = 1 r1 = ld x r3 = ld y
ctrlisync ctrlisync
sync sync
r2 = ld y r4 = ld x
ctrlisync ctrlisync
Forbidden: r1=1, r2=0, r3=1, r4=0
Initial conditions: x=0, y=0
C0 C1 C2 C3
dmb ish dmb ish
st x = 1 st y = 1 r1 = ld x r3 = ld y
ctrlisb ctrlisb
dmb ish dmb ish
r2 = ld y r4 = ld x
ctrlisb ctrlisb
Forbidden: r1=1, r2=0, r3=1, r4=0
Figure 3: IRIW counterexample compiled to Power (left) and ARMv7 (right) using the leading-sync compiler
mapping. Instructions relevant to the outcome are in bold. The heavyweight sync/dmb ish fences between
the pairs of loads on C2 and C3 are sufficient to disallow the forbidden outcome on ARMv7 and Power.
Initial conditions: x=0, y=0
C0 C1 C2 C3
lwsync lwsync
st x = 1 st y = 1 r1 = ld x r3 = ld y
sync sync ctrlisync ctrlisync
r2 = ld y r4 = ld x
sync sync
Allowed: r1=1, r2=0, r3=1, r4=0
Initial conditions: x=0, y=0
C0 C1 C2 C3
dmb ish dmb ish
st x = 1 st y = 1 r1 = ld x r3 = ld y
dmb ish dmb ish ctrlisb ctrlisb
r2 = ld y r4 = ld x
dmb ish dmb ish
Allowed: r1=1, r2=0, r3=1, r4=0
Figure 4: IRIW counterexample compiled to Power (left) and ARMv7 (right) using the trailing-sync compiler
mapping. Instructions relevant to the outcome are in bold. The absence of heavyweight sync/dmb ish fences
between the pairs of loads on C2 and C3 result in the outcome being allowed by both Power and ARMv7
models (and visible on Power hardware).
The combination of the sc hb edges with the sc fr edges results in a cycle in the sc order, which means it
is not a total order and is thus invalid. As a result, there are no consistent executions of the listed outcome
of this C/C++ program as it is impossible to create a total sc order for the outcome that abides by both the
consistent sc order and sc accesses sc reads restricted axioms.
The compilation of this test program to Power and ARMv7 using the leading-sync mapping is shown in
Figure 3. The heavyweight sync (or dmb ish in the case of ARMv7) fences between each pair of loads on C2
and C3 are enough to disallow the forbidden outcome of the test in this case. These compiled Power and
ARMv7 tests are forbidden by the Power and ARMv7 models of Alglave et al. [3], and the corresponding
outcomes are not observable on Power or ARMv7 hardware.
On the other hand, when the C/C++ program is compiled to Power and ARMv7 using the trailing-sync
mapping, the resultant programs are shown in Figure 4. In this case, there is only a ctrlisync (or ctrlisb
in the case of ARMv7) between each pair of loads on C2 and C3. This is not sufficient to disallow the
forbidden outcome of the test, which is allowed by both the Power and ARMv7 models of Alglave et al. [3].
Furthermore, Alglave et al. have observed the forbidden C/C++ outcome of the trailing-sync Power version
of this test on Power hardware [3].
The problem which results in this bug is that the counterexample program induces a hb edge between
two SC accesses (such as c and f in Figure 2) by means of the transitive composition of hb edges to and
from an intermediate non-SC access (in this case, the sw and sb edges from c→ e and e→ f respectively).
The requirement of the C/C++ memory model that sc be consistent with hb thus requires that c be before
f in sc. In other words, no thread can observe f before it observes c. Both Power and ARMv7 require a
heavyweight sync/dmb ish fence between accesses e and f in order to guarantee this property. Similarly,
there must exist a sync/dmb ish fence between accesses g and h in order to guarantee the other sc edge
induced by the requirement that hb is consistent with sc.
When the test is compiled using the leading-sync mapping, a sync/dmb ish fence is correctly added
between each pair of loads to provide the required guarantees. However, in the version of the test compiled
5
Initial conditions: a: x=0, b: y=0
T0 T1 T2
c: st(x, 1, seq cst) d: r1 = ld(x, acquire) f: st(y, 1, seq cst)
e: r2 = ld(y, seq cst) g: r3 = ld(x, seq cst)
Outcome forbidden by C/C++: r1=1, r2=0, r3=0
Figure 5: The RWC counterexample. In this figure, memory order x is abbreviated to x for brevity.
a:Wna x=0
b:Wna y=0
sb
c:Wsc x=1mo
sw
f:Wsc y=1
mo,sw
d:Racq x=1sw
e:Rsc y=0
sc_hb sb sc_fr
g:Rsc x=0
sb,sc_hb
sc_fr
Figure 6: Execution graph of the RWC counterexample generated with the help of CPPMEM [7], with relevant
edges showing why the execution is forbidden by the C/C++ memory model.
using the trailing-sync mapping, there is only a ctrlisync/ctrlisb between each pair of loads, which is not
enough to provide the required guarantees. This results in the forbidden C/C++ outcome being allowed by
the Power and ARMv7 models (and observable on Power hardware).
4 The RWC Counterexample
The second counterexample is a variant of the well-known Read-to-Write-Causality (RWC) litmus test, where
the first load on the second core is an acquire operation. All other accesses in the test are SC accesses. The
C/C++ code for this test is shown in Figure 5.
Once again, we begin by showing why the execution is forbidden under the C/C++ memory model. An
execution graph for the test’s outcome is shown in Figure 6. As in the IRIW counterexample, the total order
on SC operations must include the sc fr edges as well as the sc hb edges. The sc fr edges from e→ f and
g → c are required to satisfy sc accesses sc reads restricted, since both the SC reads read from non-SC writes.
Meanwhile, the sc hb edges from c→ e and f → g are required to satisfy consistent sc order, since there are
hb edges from c→ e and f → g.
The hb edge from c → e arises through the transitive composition of sw and sb edges from c → d and
d→ e respectively. Meanwhile, the accesses f and g are directly related by sequenced-before (and thus hb),
and not through an intermediate access.
The combination of the sc fr and sc hb edges generates a cycle in the sc order, which means it is not a
total order as C/C++ requires. Thus, there is no consistent execution of this program that generates the
listed outcome under the C/C++ memory model, as it is impossible to construct a correct sc order for such
an execution.
The compiled versions of this program to Power and ARMv7 using the leading-sync and trailing-sync
mappings are shown in Figures 7 and 8 respectively. As in the IRIW counterexample, to enforce the sc
ordering from c→ e, there must be a heavyweight sync/dmb ish between accesses d and e. In the version
compiled with the leading-sync mapping, there are sync/dmb ish fences between each pair of reads (including
between accesses d and e), and the overall outcome is correctly forbidden by both the Power and ARMv7
models of Alglave et al. [3]. On the other hand, when compiled with the trailing-sync mapping, there is
only a ctrlisync/ctrlisb between accesses d and e, which is not enough to enforce the sc ordering from
c→ e. Thus, the versions of the test compiled with the trailing-sync mapping incorrectly allow the forbidden
C/C++ outcome according to the Power and ARMv7 models of Alglave et al. In addition, Alglave et al.
have observed the forbidden C/C++ outcome of the trailing-sync Power version on Power hardware [3].
It is noteworthy that although the sc edge between accesses f and g is required because sc must be
consistent with hb, the accesses are directly related through sb, and not through an intermediate access.
6
Initial conditions: x=0, y=0
C0 C1 C2
sync sync
st x = 1 r1 = ld x st y = 1
ctrlisync
sync sync
r2 = ld y r3 = ld x
ctrlisync ctrlisync
Forbidden: r1=1, r2=0, r3=0
Initial conditions: x=0, y=0
C0 C1 C2
dmb ish dmb ish
st x = 1 r1 = ld x st y = 1
ctrlisb
dmb ish dmb ish
r2 = ld y r3 = ld x
ctrlisb ctrlisb
Forbidden: r1=1, r2=0, r3=0
Figure 7: RWC counterexample compiled to Power (left) and ARMv7 (right) using the leading-sync compiler
mapping. Instructions relevant to the outcome are in bold. The heavyweight sync/dmb ish fences between
the pairs of loads on C1 and C2 are sufficient to disallow the forbidden outcome on ARMv7 and Power.
Initial conditions: x=0, y=0
C0 C1 C2
lwsync lwsync
st x = 1 r1 = ld x st y = 1
sync ctrlisync sync
r2 = ld y r3 = ld x
sync sync
Forbidden: r1=1, r2=0, r3=0
Initial conditions: x=0, y=0
C0 C1 C2
dmb ish dmb ish
st x = 1 r1 = ld x st y = 1
dmb ish ctrlisb dmb ish
r2 = ld y r3 = ld x
dmb ish dmb ish
Forbidden: r1=1, r2=0, r3=0
Figure 8: RWC counterexample compiled to Power (left) and ARMv7 (right) using the trailing-sync compiler
mapping. Instructions relevant to the outcome are in bold. The absence of heavyweight sync/dmb ish fences
between the pairs of loads on C1 results in the outcome being allowed by both Power and ARMv7 models
(and visible on Power hardware).
Ordering between SC accesses in program order on the same core is guaranteed by the sync/dmb ish fences
used to implement SC accesses in both the leading-sync and trailing-sync mappings. As a result, the required
sync/dmb ish fence between accesses f and g exists in the versions compiled using either mapping. Similarly,
in the cases of the IRIW counterexample where both of the reads on T2 or T3 are SC reads, a sync/dmb
ish fence will always exist between them. It is only in the case where the hb edge between two SC accesses
arises due to an intermediate non-SC access (as in the case from c→ e in the RWC counterexample) that the
choice of mapping affects the correctness of compilation.
5 Loophole in the Compilation Proof of Batty et al.
The correctness of compilation from C/C++ to Power (and by analogy, to ARMv7) using both the leading-
sync and trailing-sync mappings was supposedly proven by Batty et al. [6]. Given the above counterexamples,
there must be a loophole in the proof that allowed the incorrect mappings to be proven correct. We examined
the proof and discovered that the authors did not correctly check whether a given mapping enforced the
consistency of the sc and hb relations with respect to each other. This allowed a mapping like the trailing-sync
mapping (which does not always ensure that sc is consistent with hb) to be proven correct.
As part of their proof, the authors state that the sc order is an arbitrary linearization of (posct ∪ cosct ∪
frsct ∪ erfsct )∗ (which is – at a high level – the combination of program order edges and coherence edges
directly between SC accesses). Later in the proof, they state that enforcing that the sc order is an arbitrary
linearization of the above relation is enough to ensure that sc is consistent with hb. These claims are not
always true. (posct ∪ cosct ∪ frsct ∪ erfsct )∗ does not take into account hb edges between SC accesses that can
arise through the transitive composition of hb edges to and from an intermediate non-SC access. (Such
edges arise in both our counterexamples.) Per C/C++ memory model requirements, the sc order must
be consistent with these hb edges, but an arbitrary linearization of (posct ∪ cosct ∪ frsct ∪ erfsct )∗ may not
be consistent with them. This also means that enforcing that the sc order is an arbitrary linearization of
7
(posct ∪ cosct ∪ frsct ∪ erfsct )∗ is not enough to guarantee that the compiled code enforces the constraint that
sc and hb are consistent with each other. This loophole allows the trailing-sync mapping, which does not
provide this guarantee (as seen in Sections 3 and 4), to be proven correct.
We contacted Batty et al. regarding this loophole in their proof, and they graciously confirmed our
findings.
6 Current State of Compilers and Architectures
The previous sections have established that the trailing-sync compiler mapping is invalid for the current
C/C++ memory model. Luckily, as of this paper, neither GCC nor Clang implement the exact trailing-
sync mapping in Table 2 for either Power or ARMv7. Specifically, GCC and Clang use the leading-sync
compiler mapping for Power, and while they use a trailing-sync mapping for ARMv7, the mapping for load
acquire operations is ld; dmb ish (or stronger). Thus, when compiled for ARMv7 with GCC or Clang, the
trailing-sync counterexamples do have sync/dmb ish fences between each pair of loads, which, as outlined in
Sections 3 and 4, is enough to disallow the forbidden C/C++ outcome.
Architecturally speaking, the forbidden C/C++ outcomes of both counterexamples have been observed
on Power hardware when the tests are compiled using the trailing-sync mapping [3]. While the ARMv7
model of Alglave et al. allows the forbidden C/C++ outcome when the test is compiled to ARMv7 using the
trailing-sync mapping, the behaviour has not been observed on ARMv7 hardware [3].
Finally, while the leading-sync compiler mapping is not vulnerable to the counterexamples discussed in this
paper, Vafeiadis et al. have recently also found a counterexample for the leading-sync compiler mapping [16],
which they will be publishing separately. The combination of these counterexamples means that it is currently
impossible to correctly compile C/C++ to Power or ARMv7 with either mapping, and either the mappings
or the C/C++ memory model will need to change for correct compilation to be possible.
7 Conclusion
In this paper we have outlined two counterexamples for the trailing-sync compiler mappings from C/C++ to
Power and ARMv7, as well as the loophole in a prior proof of correctness of these mappings that allowed
them to be proven correct. Looking forward, either the mappings or the C/C++ memory model will need to
change in order to ensure that the guarantees of the high-level language memory model are respected by
compiled programs running on the Power and ARMv7 architectures.
8 Acknowledgements
This work was supported in part by C-FAR, one of the six SRC STARnet Centers, sponsored by MARCO
and DARPA.
References
[1] Sarita Adve and Kourosh Gharachorloo. Shared memory consistency models: A tutorial. IEEE Computer,
29(12):66–76, 1996.
[2] Sarita V. Adve and Mark D. Hill. Weak ordering – a new definition. In Proceedings of the 17th Annual
International Symposium on Computer Architecture, ISCA ’90, pages 2–14, New York, NY, USA, 1990.
ACM.
[3] Jade Alglave, Luc Maranget, and Michael Tautschnig. Herding cats: Modelling, simulation, testing, and
data mining for weak memory. ACM Transactions on Programming Languages and Systems (TOPLAS),
36(2):7:1–7:74, July 2014.
[4] Mark Batty. The C11 and C++11 Concurrency Model. PhD thesis, University of Cambridge, Cambridge,
UK, 2014.
8
[5] Mark Batty, Alastair F. Donaldson, and John Wickerson. Overhauling SC atomics in C11 and OpenCL.
In 43rd Annual Symposium on Principles of Programming Languages (POPL), 2016.
[6] Mark Batty, Kayvan Memarian, Scott Owens, Susmit Sarkar, and Peter Sewell. Clarifying and compiling
C/C++ concurrency: From C++11 to POWER. In Proceedings of the 39th Annual ACM SIGPLAN-
SIGACT Symposium on Principles of Programming Languages, POPL ’12, pages 509–520, New York,
NY, USA, 2012. ACM.
[7] Mark Batty, Scott Owens, Susmit Sarkar, Peter Sewell, and Tjark Weber. Mathematizing C++
concurrency. In 38th Annual Symposium on Principles of Programming Languages (POPL), 2011.
[8] Hans-J. Boehm and Sarita V. Adve. Foundations of the C++ concurrency memory model. In 29th
Conference on Programming Language Design and Implementation (PLDI), 2008.
[9] Kourosh Gharachorloo, Daniel Lenoski, James Laudon, Phillip Gibbons, Anoop Gupta, and John
Hennessy. Memory consistency and event ordering in scalable shared-memory multiprocessors. 17th
International Symposium on Computer Architecture (ISCA), 1990.
[10] ISO/IEC. Programming Languages – C++, 2014.
[11] Daniel Lustig, Geet Sethi, Margaret Martonosi, and Abhishek Bhattacharjee. ”COATCheck: Verifying
Memory Ordering at the Hardware-OS Interface. In Proceedings of the 21st International Conference on
Architectural Support for Programming Languages and Operating Systems, 2016.
[12] Susmit Sarkar, Peter Sewell, Jade Alglave, Luc Maranget, and Derek Williams. Understanding power
multiprocessors. In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language
Design and Implementation, PLDI ’11, pages 175–186, New York, NY, USA, 2011. ACM.
[13] Peter Sewell. C/C++11 mappings to processors. 2016.
[14] Caroline Trippel, Yatin A. Manerkar, Daniel Lustig, Michael Pellauer, and Margaret Martonosi. Exploring
the trisection of software, hardware, and ISA in memory model design. CoRR, abs/1608.07547, 2016.
[15] Viktor Vafeiadis, Thibaut Balabonski, Soham Chakraborty, Robin Morisset, and Francesco Zappa Nardelli.
Common compiler optimisations are invalid in the C11 memory model and what we can do about it. In
42nd Symposium on Principles of Programming Languages (POPL), 2015.
[16] Viktor Vafeiadis and Ori Lahav. Personal communication, Sept. 27th, 2016.
9
