On Architecture to Architecture Mapping for Concurrency by Chakraborty, Soham
ON ARCHITECTURE TO ARCHITECTURE MAPPING FOR
CONCURRENCY
Soham Chakraborty
Department of Computer Science and Engineering
IIT Delhi
Delhi 110016, India
soham@cse.iitd.ac.in
ABSTRACT
Mapping programs from one architecture to another plays a key role in technologies such as binary
translation, decompilation, emulation, virtualization, and application migration. Although multicore
architectures are ubiquitous, the state-of-the-art translation tools do not handle concurrency primi-
tives correctly. Doing so is rather challenging because of the subtle differences in the concurrency
models between architectures.
In response, we address various aspects of the challenge. First, we develop correct and efficient
translations between the concurrency models of two mainstream architecture families: x86 and ARM
(versions 7 and 8). We develop direct mappings between x86 and ARMv8 and ARMv7, and fence
elimination algorithms to eliminate redundant fences after direct mapping. Although our mapping
utilizes ARMv8 as an intermediate model for mapping between x86 and ARMv7, we argue that it
should not be used as an intermediate model in a decompiler because it disallows common compiler
transformations.
Second, we propose and implement a technique for inserting memory fences for safely migrating
programs between different architectures. Our technique checks robustness against x86 and ARM,
and inserts fences upon robustness violations. Our experiments demonstrate that in most of the
programs both our techniques introduce significantly fewer fences compared to naive schemes for
porting applications across these architectures.
1 Introduction
Architecture to architecture mapping is the widely applicable concept of converting an application that runs over some
architecture X to run over some different architecture Y . For example, binary translators notaz [2014], Chernoff
et al. [1998], which recompile machine code from one architecture to another in a semantic preserving manner. Such
translation is facilitated by decompilers Bougacha, Bits, Yadavalli and Smith [2019], avast, Shen et al. [2012], which
lift machine code from a source architecture to an intermediate representation (IR) and compile to a target architecture.
Emulators implement a guest architecture on a host architecture. For instance, QEMU QEMU emulates a number of
architectures (including x86 and ARM) over other architectures, the Android emulator Android-x86 runs x86 images
on ARM, while Windows 10 on ARM emulates x86 applications Docs.
Architecture to architecture mapping is essential for application migration and compatibility. An application written
for an older architecture may need upgraded to execute on latest architectures, while an application primarily targeting
a later architecture may need to preserve backward compatibility with respect to older one. For example, Arm discusses
the required measures to port an application from ARMv5 to ARMv7 including synchronization primitives. Besides
its practical uses, formally mapping between architectures is helpful in the design process of future processors and
architectures, as it allows one to compare and relate subtle features like concurrency, which vary significantly from
one architecture to another.
A key feature that has been overlooked in these mappings is concurrency, which is crucial for achieving good per-
formance with modern multicore processors. To emulate or port a concurrent application correctly requires us to
ar
X
iv
:2
00
9.
03
84
6v
1 
 [c
s.P
L]
  8
 Se
p 2
02
0
On Architecture to Architecture Mapping for Concurrency
x86 ARMv8 ARMv7ARMv7-mca
C11 (§4.2)
(§4.1)
(§4.3)
C11 (§4.6)
(§4.5)
(§4.4)
Figure 1: Correct and efficient mapping schemes between x86, ARMv8, and ARMv7/ARMv7-mca.
map the concurrency primitives of the source to those of the target, taking into account the subtle differences in their
concurrency models. Such semantic differences appear not only between architectures (e.g., x86 and ARM), but also
between different versions of the same architecture (e.g., ARMv7 and ARMv8 Pulte et al. [2018]).
In this paper, we address the challenge of developing correct and efficient translations between relaxed memory con-
currency models of x86, ARMv8, and ARMv7. We approach the problem from multiple angles.
First, we develop correct mapping schemes between these concurrency models, using the ARMv8 model as an efficient
intermediate concurrency model for mapping between x86 and ARMv7.
This naturally leads to the question whether ARMv8 model can also serve as a concurrency model for IR in a de-
compiler. Decompilers typically (1) raise the source machine code to an IR, (2) optimize the IR, and (3) generate the
target code. Thus, If the IR follows the ARMv8 concurrency model, steps (1) and (3) can be performed efficiently to
facilitate translations between x86 and ARM concurrent programs. For step (2), we evaluate common optimizations
on ARMv8 concurrency and observe that a number of common transformations are unsound. The result demonstrates
that to achieve correct and efficient mapping by all steps (1,2,3) we require to come up with a different concurrency
model. We leave the exploration for such a model for future research.
Next, we focus on optimizing the direct mappings further. The issue is that for correctness direct mappings introduce
fences in translating stronger accesses to weaker ones. The introduced fences can be often redundant in certain
memory access sequences and can be eliminated safely. We identify conditions of safe fence elimination, prove safe
fence elimination, and based on these conditions we propose fence elimination algorithms.
In addition to fence elimination, we apply memory sequence analysis to check and enforce robustness for a class of
concurrent programs. Robustness analysis checks whether a program running model demonstrate only the behaviors
which are allowed by a stronger model. The behaviors of a robust program are indistinguishable on stronger model
from an weaker model and therefore the program can seamlessly migrate from one architecture to another as far as
concurrent behaviors are concerned. If a program is not robust we insert fences to enforce robustness against a stronger
model. It is especially beneficial in application porting and migration Barbalace et al. [2020, 2017] where it is crucial
to preserve the observable behaviors of a running application.
Contributions & Results. Now we discuss the specific contributions and obtained results.
• In §4 we propose the mapping schemes ( 7→) between x86 and ARMv8, and between ARMv8 and ARMv7 as shown
in Fig. 1. We do not propose any direct mapping between x86 and ARMv7, instead we consider ARMv8 as an
intermediate model. We achieve x86 to ARMv7 mapping by combining x86 to ARMv8 and ARMv8 to ARMv7
mapping. Similarly, ARMv7 to x86 mapping is derived by combining ARMv7 to ARMv8 and ARMv8 to x86
mappings. We show that the direct mapping schemes would be same as these two step mappings through ARMv8.
We also show that these mapping schemes are efficient; each of the leading and/or trailing fences used in mapping
with the memory accesses are required to preserve correctness.
• We show that multicopy-atomicity (MCA) (a write operation is observable to all other threads at the same time)
does not affect the mapping schemes between ARMv8 and ARMv7 though it is a major difference between ARMv8
and ARMv7 Pulte et al. [2018] as ARMv7 allows non-MCA behavior unlike ARMv8. To demonstrate the same
we propose ARMv7-mca in §4 which restricts non-MCA behaviors in ARMv7 and show that the mapping scheme
from ARMv8 to ARMv7-mca is same as ARMv8 to ARMv7 (Fig. 13a) and the mapping scheme of ARMv7-mca
to ARMv8 is same as ARMv7 to ARMv8 mapping (Fig. 12a) respectively.
• In §4.2, §4.6, and in §4.8 we propose alternative schemes for x86 to ARMv8 and ARMv8 to ARMv7 mapping
where the respective x86 and ARMv8 programs are generated from C11 concurrent programs. In these schemes
we exploit the catch-fire semantics of C11 concurrency ISO/IEC 9899 [2011], ISO/IEC 14882 [2011]. We do not
generate additional fences for the load or store accesses generated from non-atomic loads or stores unlike the x86
to ARMv8 and ARMv8 to ARMv7 mappings.
2
On Architecture to Architecture Mapping for Concurrency
X[1] = 1;
a = X[1];
b = Y [a];
c = Y [1];
d = Z[c];
Y [1] = 1; 7→ X[1] = 1;
a = X[1];
CBISB
b = Y [a];
CBISB
c = Y [1];
CBISB
d = Z[c];
CBISB
Y [1] = 1;
(a) Initially X[1] = Y [1] = 0 and behavior in question: a = c = 1, b = d = 0.
St(X[1], 1) St(Y [1], 1)
Ld(X[1], 1)
Ld(Y [1], 0)
Ld(Y [1], 1)
Ld(X[1], 0)
addr addr
(b) Disallowed in ARMv8
St(X[1], 1) St(Y [1], 1)
Ld(X[1], 1)
Ld(Y [1], 0)
Ld(Y [1], 1)
Ld(X[1], 0)
R R
(c) Allowed in ARMv7. R = ctrlisb ∪ addr
Figure 2: LDR 7→ LDR; CBISB in ARMv8 to ARMv7 mapping is unsound.
• In §5 we study the reordering, elimination, and access strengthening transformations in ARMv8 model. We prove
the correctness of the safe transformations and provide counter-examples for the unsafe transformations.
• The mapping schemes introduce additional fences while mapping the memory accesses. These fences are required
to preserve translation correctness in certain scenarios and otherwise redundant. In §6 we identify the conditions
when the fences are redundant and prove that eliminating the fences are safe. Based on these conditions we define
fence elimination algorithms to eliminate redundant fences without affecting the transformation correctness.
• We define the conditions for robustness for an (i) ARMv8 program against sequential consistency (SC) and x86
model, (ii) ARMv7/ARMv7-mca program against SC, x86, and ARMv8 model, and (iii) ARMv7 program against
ARMv7-mca model in §7 and prove their correctness in Appendix D. We also introduce fences to enforce robustness
for a stronger model againts a weaker model. To the best of our knowledge we are the first to check and enforce
robustness for ARM programs as well as for non-SC models.
• In §8 we discuss our experimental results. We have developed a compiler based on LLVM to capture the effect of
mappings between x86, ARMv8, and ARMv7. Next, we have developed fence elimination passes based on §6. The
passes eliminate significant number of redundant fences in most of the programs and in some cases generate more
efficient program than LLVM mappings.
We have also developed analyzers to check and enforce robustness in x86, ARMv8, and ARMv7. For a number
of x86 programs the result of our SC-robustness checker matches the results from TrencherBouajjani et al. [2013]
which also checks SC-robustness against TSO model. Moreover, we enforce robustness with significantly less
number of fences compared to naive schemes which insert fences without robustness information.
In the next section we informally explain the overview of the proposed approaches. Next, in §3 we discuss the
axiomatic models of the respective architectures which we use in later sections. The proofs and additional details are
in the supplementary material.
2 Overview
In this section we discuss the overview of our proposed schemes, related observations, and the analysis techniques.
2.1 Alternatives in x86 to ARMv8 mapping
In x86 to ARMv8 mapping we considered two alternatives for mapping loads and stores: (1) x86 store and load
to ARMv8 release-store (WMOV 7→ STLR) and acquire-load (RMOV 7→ LDAR) respectively. (2) x86 store and load to
ARMv8 regular store and load accesses with respective leading and trailing fences as proposed in Fig. 9a, that is,
WMOV 7→ DMBST; STR and RMOV 7→ LDR; DMBLD respectively. We choose (2) over (1) for following reasons.
3
On Architecture to Architecture Mapping for Concurrency
a = X; // 1
Y = a;
Y = 1;
b = Y ; // 1
X = b;
X = 1;
Ld(X, 1)
St(Y, 1)
St(Y, 1)
Ld(Y, 1)
St(X, 1)
St(X, 1)
data
coi
data
coi
Figure 3: a = b = 1 is disallowed in ARMv8 but allowed in ARMv7-mca for LDR 7→ LDR mapping.
a = X;
Y = a;
b = Y ;
Y = 2;
c = Y ;
Z = c;
d = Z;
Z = 4;
e = Z;
X = e;
X = 1;
7→
a = X;
CBISB
Y = a;
b = Y ;
CBISB
Y = 2;
c = Y ;
CBISB
Z = c;
d = Z;
CBISB
Z = 4;
e = Z;
CBISB
X = e;
X = 1;
Ld(X, 1)
St(Y, 1)
Ld(Y, 2)
St(Y, 2)
Ld(Y, 2)
St(Z, 2)
Ld(Z, 1)
St(Z, 4)
Ld(Z, 4)
St(X, 4)
St(X, 1)
R R R
coi
rfe
rfe
rfe
coe coe
rfe rfe
Figure 4: Behavior a = 1, b = c = 2, d = e = 4 is disallowed in ARMv8 but allowed in ARMv7-mca for LDR 7→
LDR; CBISB mapping. In the execution R = data in ARMv8 and R = data ∪ ctrlisb in ARMv7-mca.
• Reordering is restricted. x86 allows the reordering of independent store and load operations accessing different
locations Lahav and Vafeiadis [2016]. ARMv8 also allows reordering of different-location store-load pairs, but
restricts the reordering of a pair of release-store and acquire load operation as it violates barrier-ordered-by (bob)
order Pulte et al. [2018]. Thus scheme (1) is more restrictive than (2) considering reordering flexibility after map-
ping.
• Further optimization. (2) generates certain fences which are redundant in certain scenarios and can be removed
safely. Consider the mappings below; the generated DMBST is redundant in (2) and can be eliminated safely unlike
mapping (1).
(1) RMOV; WMOV 7→ LDAR; STLR (2) RMOV; WMOV 7→ LDR; DMBLD; DMBST; STR LDR; DMBLD; STR
• x86 7→ ARMv8 7→ ARMv7 would introduce additional fences To map x86 to ARMv7, if we use ARMv8 as
intermediate step then scheme (1) introduces additional fences unlike (2) as follows.
(1) WMOV 7→ STLR 7→ DMB; STR; DMB (2) WMOV 7→ DMBST; STR 7→ DMB; STR
2.2 ARMv8 to ARMv7 mapping: ARMv7 LDR is significantly weaker than ARMv8 LDR
ARMv8 to ARMv7 mapping in Fig. 13a introduces a trailing DMB fence for ARMv8 LDR and LDAR accesses as in-
troducing a trailing control fence (CBISB) is not enough for correctness. Consider the mapping of the program from
ARMv8 to ARMv7 in Fig. 2a. The execution is disallowed in ARMv8 as it creates an observed-by (ob) cycle as
shown in Fig. 2b. However, whie mapping to ARMv7, if we map LDR 7→ LDR; CBISB and rest of the instructions are
mapped following the mapping scheme in Fig. 13a then the execution would be allowed as shown in Fig. 2c. Therefore
LDR 7→ LDR; CBISB is too weak and we require a DMB fence after each load as well as RMW for the same reason.
2.3 Multicopy atomicity does not change the mapping from ARMv8
In ?? we strengthen the ARMv7 model to ARMv7-mca to exclude non-multicopy atomic behaviors. However, even
with such a strengthening an LDR mapping requires a trailing DMB fence.
Load access mapping without trailing fence is unsound in ARMv8 to ARMv7-mca mapping. Consider the
example in Fig. 3 where the ARMv8 to ARMv7-mca mapping does not introduce trailing fence for a load access and
therefore we analyze the same execution in ARMv8 and ARMv7-mca. The shown behavior is not allowed in ARMv8
as there is a dependency-ordered-befoe (dob) relation from the reads to the respective writes due to data; coi relation.
4
On Architecture to Architecture Mapping for Concurrency
In this case there is no preserved-program-order (ppo) relation in ARMv7a-mca from the reads to the respective writes
as data; coi 6⊆ ppo. Therefore the execution is ARMv7-mca consistent and the mapping introduces a new outcome.
Hence LDR 7→ LDR in ARMv8 to ARMv7-mca mapping is unsound.
Trailing control fence is not enough. Consider the example mapping and the execution in Fig. 4. The execution
in ARMv8 has ordered-by (ob) cycle and hence not consistent. The LDR 7→ LDR; CBISB mappings would result in
respective ppo relations in the execution, but these ppo relations do not restrict such a cycle. As a result, the execution
is ARMv7 or ARMv7-mca consistent. Hence LDR 7→ LDR; CBISB is unsound in ARMv8 to ARMv7-mca mapping.
2.4 Mapping schemes for programs generated from C11
In §4.2, §4.6, and §4.8 we study C11 7→x86 7→ARMv8, C11 7→ARMv8 7→ARMv7, and C11 7→ARMv8 7→ARMv7-mca
mapping schemes respectively. In these mappings from stronger to weaker models, we consider that the source archi-
tecture program is generated from a C11 program following the mapping in map. We use this information to categorize
the accesses in architectures as non-atomic ( NA) and atomic (A), and exploit two aspects of C11 concurrency; first,
a program with data race on non-atomic access results in undefined behavior. Second, C11 uses atomic accesses to
achieve synchronization and avoid data race on non-atomics. Considering these properties we introduce leading or
trailing fences in mapping particular atomic accesses and we map non-atomics to respective accesses without any
leading or trailing fence.
Pros and Cons C11 7→ x86 7→ ARMv8 scheme has a tradeoff; in case of non-atomics it is more efficient than
x86 7→ ARMv8 as it does not introduce additional fences whereas an atomic store mapping requires a leading
full fence or a pair of DMBLD and DMBST fences. Consider the mapping of the sequence: LdNA;StNA;StREL 7→
RMOVNA; WMOVNA; WMOVA 7→ LDR; STR; DMBFULL; WMOVA.
In this case the C11 non-atomic memory accesses cannot be moved after the release write access. Hence we introduce
a leading DMBFULL with WMOVA in C11 7→ x86 7→ ARMv8 to preserve the same order. Consider the C11 to x86 to
ARMv8 mapping of the program below.
a = XNA;
YNA = 1;
ZREL = 1;
r = ZACQ;
if (r == 1) {
XNA = 2;
YNA = 2;
c = YNA;
}
7→
a = XNA;
YNA = 1;
ZA = 1;
r = ZA;
if (r == 1) {
XNA = 2;
YNA = 2;
c = YNA;
}
7→
a = X;
Y = 1;
DMBFULL
Z = 1;
r = Z;
DMBLD
if (r == 1) {
X = 2;
Y = 2;
c = Y ;
}
The C11 program is data race free as it is well-synchronized by release-acquire accesses on Z and the outcome
a = 2, r = c = 1 is disallowed in the program. The generated ARMv8 program disallows the outcome, however,
without the DMBFULL in the first thread the outcome would be possible. It is because a DMBLD or DMBFULL fence is
required to preserve bob relation between Ld(X, 2) and St(Z, 1) events. Note that a DMBLD is not sufficient to establish
bob relation between St(Y, 1) and St(Z, 1) and hence we require a DMBST or DMBFULL fence. Therefore we have to
introduce a leading pair of DMBLD and DMBST fences or a DMBFULL fence for WMOVA mapping.
As a result Fig. 9b provides more efficient mapping for RMOVNA and WMOVNA accesses, but incurs more cost for WMOVA
by introducing a leading DMBFULL instead of a DMBST fence. After the mapping we may weaken such a DMBFULL fence
whenever appropriate.
The C11 7→ ARMv8 7→ ARMv7 scheme does not introduce fence for mapping non-atomics and therefore more
efficient than ARMv8 7→ ARMv7. Note that C11 StwREL generates an STLR in ARMv8 and ARMv8 STR is generated
only from C11 StvRLX which does not enforce any such order.
2.5 ARMv8 as an intermediate model for mappings between x86 and ARMv7
Now we move to mappings between x86 and ARMv7. We do not propose direct mapping schemes, in-
stead we use ARMv8 concurrency as an intermediate concurrency model as x86 7→ ARMv7/ARMv7-mca
and ARMv7/ARMv7-mca 7→ x86 would be same as x86 7→ ARMv8 7→ ARMv7/ARMv7-mca and
ARMv7/ARMv7-mca 7→ ARMv8 7→ x86 respectively.
x86 7→ ARMv7 vs x86 7→ ARMv8 7→ ARMv7 We derive x86 7→ ARMv8 7→ ARMv7 by combining x86 7→ ARMv8
(Fig. 9a) and ARMv8 7→ ARMv7 (Fig. 13a) as follows.
5
On Architecture to Architecture Mapping for Concurrency
a = X; // 1
c = Y [a];
Z = 1;
b = Z; // 1
V [b] = 1;
X = 1;
Ld(X, 1)
Ld(Y [1], 1)
St(Z, 1)
Ld(Z, 1)
St(V [1], 1)
St(X, 1)
addr
po
addr
porfe
Figure 5: Load-store or store-store reorderings introduce a = b = 1 outcome and are unsound in ARMv8.
MFENCE 7→ DMBFULL 7→ DMB RMW 7→ DMBFULL; RMW; DMBFULL 7→ DMB; RMW; DMB
RMOV 7→ LDR; DMBLD 7→ LDR; DMB WMOV 7→ DMBST; STR 7→ DMB; STR
The correctness proofs of the x86 to ARMv8 and ARMv8 to ARMv7 mapping schemes in Fig. 9a and Fig. 13a
demonstrate the necessity of the introduced fences. The introduced fences only allow reordering of an independent
store-load access pair on different locations which is similar to the allowed reordering restriction of x86. Therefore
the introduced fences are necessary and sufficient.
ARMv7 7→ x86 vs ARMv7 7→ ARMv8 7→ x86 We derive ARMv7 7→ ARMv8 7→ x86 by combining ARMv7 7→
ARMv8 (Fig. 12a) and ARMv8 to x86 (Fig. 12b) as follows. Note that the mapping does not introduce any fence
along with the accesses and therefore optimal.
DMB 7→ DMBFULL 7→ MFENCE RMW 7→ RMW 7→ RMW
LDR 7→ LDR 7→ RMOV STR 7→ STR 7→ WMOV
2.6 Common optimizations in ARMv8 concurrency
We consider ARMv8 as a concurrency model of an IR and find that many common compiler optimizations are unsound
in ARMv8.
• ARMv8 does not allow store-store and load-store reorderings Consider the program and the execution in Fig. 5.
In this execution there are addr; [Ld]; po; [St] and addr; [St]; po; [St] relations in the first and second threads respec-
tively which result in dob relations and in turn an ob cycle. Therefore the execution is not ARMv8 consistent and
the outcome a = b = 1 is disallowed. However, load-store reordering c = Y [a];Z = 1  Z = 1; c = Y [a]
or store-store reordering V [b] = 1;Z = 1  Z = 1;V [b] = 1 remove the respective dob relation(s) and enable
a = b = 1 in the target. Thus store-store and load-store reorderings are unsafe in ARMv8.
• Overwritten-write (OW) is unsound. Consider the program and its outcome a = 1, b = 2 in Fig. 6a. In the
respective execution the first thread has data; coi ⊆ dob from Ld(X, 1) to St(Y, 2). The other thread has a bob
relation due to DMBFULL fence which in turn create an ob cycle. Hence the execution is not ARMv8 consistent and
the outcome a = 1, b = 2 is disallowed. Overwriting Y = a in the first thread removes the dob relation and then
a = 1, b = 2 becomes possible.
• Read-after-write (RAW) is unsound. We study the RAW elimination in Fig. 6b which is performed based on
dependence analysis. Before we go to the transformation, we briefly discuss dependence analysis on the access
sequence a = X;Y [a ∗ 0] = 1. In this case there is a false dependence from load of X to store of Y [a ∗ 0] as
a ∗ 0 = 0 always. ARMv8 does not allow to remove such a false dependence Pulte et al. [2018]. However, we
observe that using a static analysis that distinguishes between true and false dependencies is also wrong in ARMv8.
In this example we analyze such a false dependency and based on that we perform read-after-write elimination on
the program, that is, Y [a ∗ 0] = 1; b = Y [0] Y [a ∗ 0] = 1; b = 1 .
The source program does not have any execution a = 1, b = 1, c = 0 as addr; rfi; addr ⊆ dob and in the other
thread there is a bob reltion which together create an ob cycle. In the target execution there is no dob relation from
the load of X to the load of c = Z[b] and therefore the outcome a = 1, b = 1, c = 0 is possible. As a result, the
transformation is unsound in ARMv8.
2.7 Fence eliminations in ARMv8
The mapping schemes introduce leading and/or trailing fences for various memory accesses. However, some of these
fences may be redundant can safely be eliminated. Consider the x86 7→ ARMv8 mapping and subsequent redundant
fence eliminations below.
RMOV; MFENCE; WMOV 7→ LDR; DMBLD; DMBFULL; DMBST; STR  LDR; DMBLD; STR
6
On Architecture to Architecture Mapping for Concurrency
a = X;
Y = a;
Y = 2;
 a = X;
Y = 2;
Context:− b = Y ;DMBLD
X = 1;

(a) OW introduces a = 1, b = 2
a = X;
Y [a ∗ 0] = 1;
b = Y [0];
c = Z[b];
 
a = X;
Y [a ∗ 0] = 1;
b = 1;
c = Z[b];
Context:− Z[1] = 1;DMBFULL;
X = 1;

(b) RAW introduces a = b = 1, c = 0
Figure 6: Overwritten-write (OW) and Read-after-write elimination (RAW) are unsound in ARMv8.
SB ,〈X = 1; MFENCE; t = Y ; 〉
SB′ ,〈Y = 1; MFENCE; t = X; 〉
SB′′ ,〈Y = 1; t = Z; 〉
St(X, _)
MFENCE
Ld(Y, _)
St(Y, _)
MFENCE
Ld(X, _)
St(Y, _)
Ld(Z, _)
Figure 7: A program of the form SB || · · · || SB || SB′ || · · · || SB′ || SB′′ || · · · || SB′′ is SC-robust against x86.
The ARMv8 access sequence generated from x86 to ARMv8 mapping introduces three intermediate fences between
the load-store pair. A DMBLD fence suffices to order a load-store pair and hence the DMBFULL as well as the DMBST
fence are redundant and are safely eliminated.
To perform such fence eliminations, we first identify non-deletable fences and then delete rest of the fences. A fence is
non-deletable if it is placed between a memory access pair in at least one program path so that the access pair may have
out-of-order execution without the fence. Analyzing the ARMv8 sequence above we mark the DMBLD as non-deletable
and rest of the fences as redundant.
2.8 Analyzing and enforcing robustness
There are existing approaches Lahav and Margalit [2019], Bouajjani et al. [2013] which explores program executions
to answer such queries. We propose an alternative approach by analyzing memory access sequences. In this analysis
1. We identify the program components which may run concurrently. Currently we consider fork-join parallelism
and identify the functions which create one or multiple threads. Our analysis considers that each of such functions
creates multiple threads. Therefore analyzing these functions f1, . . . fn, we analyze all programs of the form
f1 || · · · || f1 || fn || · · · || fn.
2. Next, we analyze the memory access sequences in f1, . . . fn to check whether the memory access pairs in these
functions may create a cycle.
3. In case a cycle is possible, we check if each access pair on a cycle is ordered by robustness condition. If so, then
all K consistent executions of these programs are also M consistent.
Consider the example in Fig. 7. We analyze the access sequences in thread functions SB, SB′, and SB′′ and derive a
graph by memory access pairs which contains a cycle by the memory access pairs in SB and SB′. These pairs on the
cycle have intermediate MFENCE operations which enforce interleaving executions only irrespective of the number of
threads created from SB, SB′, SB′′. Our analysis reports these x86 programs as SC-robust. Using this approach we
check M -robustness against K where K is an weaker models than M .
Enforcing robustness. If we identify robustness violation for a program then we identify memory access pairs which
may violate a robustness condition. For these access pairs we introduce intermediate fences to enforce robustness
against a stronger model.
ppo does not suffice to enforce robustness in ARMv7 In addition to fences, ppo relations also orders a pair of
accesses on different locations. However, we observe that ppo relations are not sufficient to ensure robustness for
ARMv7 model.
Consider the execution in Fig. 8, the execution allows the cycle and violates SC robustness. Therefore ppo cannot be
used to order epo relations to preserve robustness.
7
On Architecture to Architecture Mapping for Concurrency
a:Ld(A, 1)
b:St(X, 2)
c:St(X, 1)
d:Ld(X, 1)
e:St(Y, 1)
f:Ld(Y, 1)
g:St(Z, 1)
h:St(Z, 2)
i:Ld(Z, 2)
j:St(A, 1)
ppo ppo fence ppo
coe
rfe
rfe
coe
rfe
rfe
Figure 8: Execution prop(b, g) ∧ coe(g, h) ∧ ahb(h, b) cycle is allowed.
3 Formal Models
Syntax Instead of delving into the syntactic notations in each instruction set, we use common expressions and
commands which can be extended in each architecture.
E ::=r | v | X | E + E | E ∗ E | E ≤ E | · · · (Expr)
C ::=skip | C;C | r = E | r = X | X = E | r = RMW(X,E,E) | r = RMW(X,E) | · · ·
| br label | br label label (Cmd)
P ::=X = v; · · ·X = v; {C | · · · | C} (Program)
In this notation we use X ∈ Locs, r ∈ Reg, and v ∈ val where Locs, Reg, val denote finite sets of memory loca-
tions, registers, and values respectively. A program P consists of a set of initialization writes followed by a parallel
composition of thread commands.
Semantics We follow the per-execution based axiomatic models for these architectures. In these models a program’s
semantics is defined by a set of consistent executions. An execution consists of a set of events and relations among the
events.
Given a binary relation R on events, R−1, R?, R+, and R∗ represent inverse, reflexive, transitive, and reflexive-
transitive closures of R respectively. dom(R) and codom(R) denote is its domain and its range respectively. Relation
R is total on set S when total(S,R) , ∀a, b ∈ S. a = b∨R(a, b)∨R(b, a). We compose binary relationsR,S ⊆ E×E
relationally by R ; S. [A] denotes an identity relation on a set A. We write R|loc to denote R related event pairs on
same locations, that is, R|loc , {(e, e′) ∈ R | e.loc = e′.loc}. Similarly, R|6=loc , R \ R|loc is the R related event
pairs on different locations.
Definition 1. An event is of the form 〈id, tid, lab〉, where id, tid ∈ N,and lab are the unique identifier, thread id, and
the label of the event based on the respective executed memory access or fence instruction. A label is of the form
〈op, loc, rval,wval〉.
For an event e, whenever applicable, e.lab, e.op, e.loc , e.rval, and e.wval to return the label, operation type, location,
read value, and written value respectively. We write Ld, St, U, and F to represent the set of load, store, update, and
fence events. Moreover, load or update events represent read events (R) and store or update events are write events
(W), that isR = Ld∪U andW = St∪U. We write [[i]] to represent the generated event in the respective model from
an instruction i. For example, in x86 [[i]] ∈ St holds when i is a WMOV instruction. We also overload the notation as
[[P]]M to denote the set of execution of program P in model M .
In an execution events are related by various types of relations. Relation program-order(po) captures the syntactic order
among the events. We write a.b to denote that b is immediate po-successor of event a. Reads-from (rf) associates a
write event to a read event that justifies its read value. Relation coherence-order(co) is a total-order on same-location
writes (stores or updates). The from-read (fr) relation relates a pair of same-location read and write events. We
also categorize the relations as external and internal relations and define extended-coherence-order (eco). Relation
modification order (mo) is a total-order on writes, updates, and fences such that mo ⊆ O×O where O = St ∪U ∪ F.
8
On Architecture to Architecture Mapping for Concurrency
Note that the co relation is included in the mo relation. The mo relation is used in x86 model only; the ARM models
do not use mo in their definitions.
Definition 2. An execution is of the form X = 〈E, po, rf, co,mo〉 where X.E denotes the set of memory access or fence
events and X.po, X.rf, X.co, and X.mo denote the set of program-order, reads-from, coherence order, and modification
order relations between the events in X.E.
3.1 Concurrency models of x86, ARMv7, ARMv7-mca, and ARMv8
We now discuss the architectures and follow the axiomatic models of x86 and ARMv7 from Lahav et al. [2017], and
ARMv8 axiomatic model from Pulte et al. [2018]. We also present ARMv7-mca; a strengthened ARMv7 model with
multicopy atomicity (MCA).
x86. In x86 MOV instruction is used for both loading a value from memory as well as for storing a value to memory. To
differentiate these two accesses we categorize them as WMOV and RMOV operations. In addition, there are atomic update
operations which we denote by RMW. x86 also provides MFENCE which flushes buffers and caches and ensure ordering
between the preceding and following memory accesses.
In x86 concurrency WMOV, RMOV, and MFENCE generate St, Ld, and F events respectively. A successful RMW generates U
and otherwise an Ld event. We derive x86-happens-before (xhb) relation from program-order and reads-from relations:
xhb , (po ∪ rf)+. An x86 execution X is consistent when:
• X.xhb is irreflexive. (irrHB)
• X.mo;X.xhb is irreflexive. (irrMOHB)
• X.fr;X.xhb is irreflexive. (irrFRHB)
• X.fr;X.mo is irreflexive. (irrFRMO)
• X.fr;X.mo;X.rfe;X.po is irreflexive (irrFMRP)
• X.fr;X.mo; [X.U ∪ X.F];X.po is irreflexive. (irrUF)
ARMv7. It provides LDR and STR instructions for load and store operations, and load-exclusive
(LDREX) and store-exclusive(STREX) instructions to perform atomic update operation RMW where RMW ,
L : LDREX;mov; teq L′; STREX; teq L; L′ :. ARMv7 provides full fence DMB which orders preceding and following
instructions. There is also lightweight control fence ISB which is used to construct CBISB , cmp; bc; ISB to order
load operations.
In this model load (Ld), store (St), F events are generated from the execution of LDR and LDXR, STR and STXR, and
DMB instructions respectively. Fence ISB is captured in ctrlISB (similar to ctrlisync in Lahav et al. [2017]) and in turn
ppo relation, but does not create any event in an execution.
ARMv7 defines preserved-program-order (ppo) relation which is a subset of program-order relation.
We first discuss the primitives of ppo following §F.1 in Lahav et al. [2017]: ppo is based on data (⊆ Ld× St), control
(⊆ Ld × E), and address (⊆ Ld × (Ld ∪ St)) dependencies. Moreover, ISB fences along with conditionals introduce
ctrlISB ⊆ ctrl preserved program order. Finally, ctrl; po ⊆ ctrl and ctrlISB; po ⊆ ctrlISB holds from definition.
Based on these primitives ARMv7 define read-different-writes (rdw) and detour (detour) relations as follows.
rdw , (fre; rfe) ⊆ po detour , (coe; rfe) \ po
read-different-writes (rdw) relates two reads on same location in a thread which reads from different writes and detour
captures the scenario where an external write takes place between a pair of same-location write in the same thread,
and the read reads-from that external write.
Based on these primitives ARMv7 defines ii0, ci0, ic0, cc0 components as follows.
ii0 , addr ∪ data ∪ rdw ∪ rfi ic0 , ∅ ci0 , ctrlISB ∪ detour cc0 , data ∪ ctrl ∪ addr; po?
Using these components ARMv7 defines ii, ic, ci, cc relations where each of these relations can be derived from the
following sequential compositions and the constraints.
xy ,
⋃
n≥1 x
1y10 ;x
2y20 ; · · ·xnyn0
where
9
On Architecture to Architecture Mapping for Concurrency
x86 ARMv8
RMOV LDR; DMBLD
WMOV DMBST; STR
RMW DMBFULL; RMW; DMBFULL
MFENCE DMBFULL
(a) x86 to ARMv8
C11 to x86 ARMv8
RMOVNA LDR
WMOVNA STR
RMOVA LDR; DMBLD
WMOVA DMBFULL; STR
RMW DMBFULL; RMW; DMBFULL
MFENCE DMBFULL
(b) C11 to x86 to ARMv8
Figure 9: Mapping schemes from x86 to ARMv8.
• x, y, x1 · · ·xn, y1 · · · yn ∈ {i, c}.
• If x = c then x1 = c.
• For every 1 ≤ k ≤ n− 1, if yk = c then xk+1 = c.
• If y = i then yn = i.
Finally ARMv7 defines ppo as follows: ppo , [Ld]; ii; [Ld]∪ [Ld]; ii; [St]. ARMv7 also defines fence, ARM-happens-
before (ahb), and propagation (prop) relations as follows.
fence , [Ld ∪ St]; po; [F]; po; [Ld ∪ St]
ahb , ppo ∪ fence ∪ rfe
prop , prop1 ∪ prop2 where
prop1 , [St]; rfe?; fence; ahb∗; [St] and
prop2 , (coe ∪ fre)?; rfe?; (fence; ahb∗)?; fence; ahb∗
These relations are used to define the consistency constraints of an ARMv7 execution X as follows:
• X.co is total (total-co)
• (X.po|loc ∪ X.rf ∪ X.fr ∪ X.co) is acyclic (sc-per-loc)
• X.fre;X.prop;X.ahb∗ is irreflexive. (observation)
• (X.co ∪ X.prop) is acyclic. (propagation)
• [X.rmw];X.fre;X.coe is irreflexive (atomicity)
• X.ahb is acyclic (no-thin-air)
ARMv7-mca. We strengthen the ARMv7 model and define ARMv7-mca model to support multicopy atomicity. To
do so, following Wickerson et al. [2017], we define write-order (wo) and impose the additional constraint on ARMv7
as defined in ??.
• X.wo+ is acyclic where wo = (rfe; ppo; fre) (mca)
ARMv8. provides load (LDR), store (STR) for load and store operations, load-exclusive (LDXR) and store-exclusive
(STXR) instructions to construct RMW similar to that of ARMv7. In addition, ARMv8 provides load-acquire (LDAR),
store-release (STLR), load-acquire exclusive (LDAXR), and store-release exclusive (STLXR) instructions which operate
as half fences. In addition to DMBFULL and ISB, ARMv8 provides load (DMBLD) and store (DMBST) fences. A DMBLD
fence orders a load with other accesses and a DMBST orders a pair of store accesses.
Based on these primitives ARMv8 defines coherence-after (ca), observed-by(obs), and atomic-ordered-by (aob) rela-
tions on same-location events. ARMv8 also defines dependency-ordered-before (dob) and barrier-ordered-by (bob)
relations to order a pair of intra-thread events. Finally Ordered-before (ob) is a transitive closure of obs, aob, dob, and
bob relations.
ca , fr ∪ co obs , rfe ∪ fre ∪ coe aob , rmw ∪ [range(rmw)]; rfi; [A]
dob , addr ∪ data ∪ ctrl; [St] ∪ (ctrl ∪ (addr; po)); [ISB]; po; [Ld]
∪addr; po; [St] ∪ (ctrl ∪ data); coi ∪ (addr ∪ data); rfi
bob , po; [F]; po ∪ [L]; po; [A];∪[Ld]; po; [FLD]; po ∪ [A]; po
∪[St]; po; [FST]; po; [St] ∪ po; [L] ∪ po; [L]; coi
ob , (obs ∪ dob ∪ aob ∪ bob)+
10
On Architecture to Architecture Mapping for Concurrency
X = 1;
a = RMW(Y, 0, 1);
Y = 1;
b = RMW(X, 0, 1);
St(X, 1)
U(Y, 0, 1)
St(Y, 1)
U(X, 0, 1)
fre
St(X, 1)
F
Ld(Y, 0)
St(Y, 1)
St(X, 1)
F
Ld(Y, 0)
St(Y, 1)
rmw rmw
fre
Figure 10: In x86 to ARMv8 mapping RMW requires a leading F fence.
RMW(X, 0, 1);
a = Y ;
RMW(X, 0, 1);
b = X;
U(X, 0, 1)
Ld(Y, 0)
U(Y, 0, 1)
Ld(X, 0)
fre
Ld(X, 0)
St(X, 1)
F
Ld(Y, 1)
Ld(Y, 0)
St(Y, 1)
F
Ld(X, 0)
rmw rmw
rmwfre
Figure 11: In x86 to ARMv8 mapping RMW requires a trailing F fence.
Finally an ARMv8 execution X is consistenct when:
• X.po|loc ∪ X.ca ∪ X.rf is irreflexive. (internal)
• X.ob is irreflexive (external)
• X.rmw ∩ (X.fre;X.coe) = ∅ (atomic)
4 Architecture to Architecture Mappings
We propose correct and efficient mapping schemes between x86 and ARM models. These schemes may introduce
leading and/or trailing fences while mapping memory accesses from one architecture to another. We show that the
fences are necessary by examples and prove that the fences are sufficient for correctness. To prove correctness we
show that for each consistent execution of the target program after mapping there exists a corresponding consistent
execution of the source program before mapping with same behavior.
4.1 x86 to ARMv8 mapping
The mapping scheme from x86 to ARMv8 is in Fig. 9a. The scheme generates a DMBFULL for an MFENCE. While
mapping x86 memory accesses to that of ARMv8, the scheme introduces a leading DMBST fence with a store, a trailing
DMBLD fence with a load, and leading as well as a trailing DMBFULL fences with an update. We now discuss why these
fences are required.
Leading store fence In an x86 execution a pair of stores is ordered unlike that of ARMv8 execution. A pair of store
events (St) in ARMv8 execution are bob ordered when there is intermediate FST or F event, that is [St]; po; [FST ∪
F]; po; [St] ⊆ bob. To introduce such a bob order we require at least an intermediate FST fence event. Therefore the
scheme generates a leading DMBST fence with a store which ensures store-store order with preceding stores in ARMv8.
Trailing load fence We know a load-store or load-load access pair is ordered in x86. To preserve the same access
ordering we require a FLD fence between a load-load or load-store access pair. Therefore the scheme generates a
trailing DMBLD fence with a load which ensures such order.
Leading and trailing fence for atomic update Consider the x86 programs and a = b = 0 outcome.
No x86 execution would allow a = b = 0 in the two programs in Figs. 10 and 11. However, if we translate these
programs without intermediate DMBFULL fences between each pair of store and RMW accesses then a = b = 0 would
11
On Architecture to Architecture Mapping for Concurrency
ARMv7/ARMv7-mca ARMv8
LDR LDR
STR STR
RMW RMW
DMB DMBFULL
ISB ISB
(a) ARMv7 or ARMv7-mca to ARMv8
ARMv8 x86
LDR RMOV
LDAR RMOV
STR WMOV
STLR WMOV; MFENCE
RMW RMW
DMBFULL MFENCE
DMBLD/DMBST/ISB skip
(b) ARMv8 to x86
Figure 12: Mapping schemes: ARMv8 to x86 and ARMv7/ARMv7-mca to ARMv8
be possible in these two programs in ARMv8 as shown in the corresponding executions. As a result, the translations
from x86 to ARMv8 would be unsound. The leading and trailing DMBFULL fences with RMW accesses provide these
intermediate fences in the respective program to disallow a = b = 0 in both programs.
Mapping correctness These fences suffice to preserve mapping correctness as stated in Theorem 1 and proved in
Appendix A.1.
Theorem 1. The mappings in Fig. 9a are correct.
4.2 C11 to x86 to ARMv8 mapping
In this mapping from x86 to ARMv8 we exploit the C11 semantic rule: data race results in undefined behavior. The
mapping scheme is in Fig. 9b. In this scheme we categorize the x86 load and store accesses by whether they are
generated from C11 non-atomic or atomic accesses. If we know that a load/store access is generated from a C11
non-atomic load/store then we do not introduce any trailing or leading fence. We prove the correctness of the scheme
(Theorem 2) in Appendix A.2.
Theorem 2. The mapping scheme in Fig. 9b is correct.
In §2.4 we have already demonstrated the tradeoff between the x86 7→ ARMv8 and C11 7→ x86 7→ ARMv8 mapping
schemes.
4.3 ARMv8 to x86 mapping
The mapping scheme is in Fig. 12b. In this scheme an ARMv8 load or load-acquire is mapped to an x86 load and a
store is mapped to an x86 store operation. The scheme generates a trailing MFENCE with a store in x86 for ARMv8
release-store as L; po;A ⊆ bob whereas in x86 store-load on different locations are unordered. Consider the example
below.
L(X, 1)
A(Y, 0)
L(Y, 1)
A(X, 0)
fre
(a) Disallowed in ARMv8
St(X, 1)
F
Ld(Y, 0)
St(Y, 1)
F
Ld(X, 0)
fre
(b) Fences disallow the execution in x86
The scheme also maps an atomic access pair to an atomic update in x86. The DMBLD, DMBST, and ISB fences are not
mapped to any access.
Theorem 3. The mapping scheme in Fig. 12b is correct.
Proof Strategy To prove Theorem 3 we first define corresponding ARMv8 execution Xs for a given x86 consistent
execution Xt. Next we show that Xs is ARMv8 consistent. To do so, we establish Lemma 1 and then use the same to
establish Lemma 2 on x86 consistent execution. Next, we define x86-preserved-program-order (xppo) and then based
on xppo we define x86-observation (obx) on an x86 execution and establish Lemma 3. Finally we prove Theorem 3
using Lemma 2 and Lemma 3. The detailed proofs of Lemmas 1 to 3 and Theorem 3 are discussed in Appendix A.3.
obx , rfe ∪ coe ∪ fre ∪ [U] ∪ xppo where xppo , s1 ∪ s2 ∪ s3 ∪ s4 ∪ s5 ∪ s6 ∪ s7 ∪ s8
12
On Architecture to Architecture Mapping for Concurrency
ARMv8 ARMv7/ARMv7-mca
LDR LDR; DMB
STR STR
LDAR LDR; DMB
STLR DMB; STR; DMB
RMW RMW; DMB
RMWA RMW; DMB
RMWwL DMB; RMW; DMB
DMB(FULL/LD/ST) DMB
ISB ISB
(a) ARMv8 to ARMv7
C11 to ARMv8 ARMv7/ARMv7-mca
LDRNA LDR
LDRA LDR; DMB
STR STR
LDAR LDR; DMB
STLR DMB; STR; DMB
RMW RMW; DMB
RMWA RMW; DMB
RMWwL DMB; RMW; DMB
DMB(FULL/LD/ST) DMB
ISB ISB
(b) C11 to ARMv8 to ARMv7
Figure 13: Mapping schemes: ARMv8 7→ ARMv7/ARMv7-mca and C11 7→ ARMv8 7→ ARMv7/ARMv7-mca.
s1 ,[Ld]; po; [Ld ∪ St]
s2 ,po; [F]; po
s3 ,[St]; [F]; [Ld]
s4 ,[Ld]; po
s5 ,[St]; po; [St]
s6 ,po; [St]
s7 ,po; [St]; po|loc; [St]
s8 ,[U]; rfi; [Ld]
Lemma 1. Suppose X is an x86 consistent execution. In that case X.po|loc;X.fr =⇒ X.fr ∪ X.co.
Lemma 2. Suppose X = 〈E, po, rf,mo〉 is an x86 consistent execution. For each (X.po|loc ∪ X.fr ∪ X.co ∪ X.rf)+
path between two events there exists an alternative (X.xhb ∪ X.fr ∪ X.co)+ path between these two events which has
no intermediate load event.
Lemma 3. Suppose X = 〈E, po, rf,mo〉 is an x86 consistent execution. For each obx path between two events there
exists an alternative obx path which has no intermediate load event.
4.4 ARMv7 to ARMv8 mappings
The mapping scheme in Fig. 12a from ARMv7 to ARMv8 is straightforward as no fence is introduced along with any
memory access.
Theorem 4. The mappings in Fig. 12a are correct.
To prove Theorem 4 we relate preserved-program-order (ppo) in ARMv7 to Ordered-before (ob) relation in ARMv8.
In ARMv7 ppo relates intra-thread events and in ARMv8 dob, bob, and aob relates intra-thread event pairs. Note
that ARMv8 dob, bob, and aob relations together are not enough to capture the ARMv7 ppo relation as the detour
component of ppo involves inter-thread relations. However, ARMv7 detour relation implies obs relation in ARMv8
and therefore we can relate ppo and ob relations. Considering these aspects we state the following lemma.
Lemma 4. Suppose Xs is an ARMv7 consistent execution and Xt is corresponding ARMv8 execution. In that case
Xs.ppo =⇒ Xt.ob.
Based on Lemma 4 along with other helper lemmas we prove the mapping soundness Theorem 4. The detailed proofs
of Lemma 4, helper lemmas, and Theorem 4 are in Appendix A.5.
4.5 ARMv8 to ARMv7 mapping
The mapping scheme is in Fig. 13a. Now we show that the fences along with memory accesses are necessary to
preserve mapping soundness. In §2.2 we have already shown that LDR 7→ LDR; CBISB is unsound and therefore
LDR 7→ LDR; DMB is necessary for correctness. Similarly, LDAR 7→ LDR; CBISB is unsound and LDAR 7→ LDR; DMB is
necessary for the same reasons.
Leading and trailing fences for release-store mapping Consider po; [L] ⊆ bob in ARMv8. The bob relation in
the first thread along with other relations disallows this behavior. Consider the following example.
13
On Architecture to Architecture Mapping for Concurrency
St(X, 1)
L(Y, 1)
L(Y, 2)
A(X, 0)
bob bob
moe
fre
(a) Disallowed in ARMv8
St(X, 1)
F
St(Y, 1)
:
St(Y, 2)
F
Ld(X, 0)
:
(b) Fences disallow the execution in ARMv7
Without such an intermediate fence in the first thread the ARMv7 execution would be allowed which in turn introduce
a new outcome in the ARMv7 program and as a result the mapping would be incorrect. Therefore STLR mapping
requires a leading fence to preserve the mapping soundness. STLR mapping requires a trailing fence considering the
example similar to that of §4.3. Considering the mapping, an CBISB is not required anymore as every load generates
a trailing DMB fence.
In addition to RMW, ARMv8 provides acquire and release or stronger RMW accesses RMWA and RMWwL respectively. Before
mapping from ARMv8 we perform the transformations RMWA  RMW; DMBLD and RMWwL  DMBFULL; RMW; DMBFULL.
The trailing DMBLD provides the same ordering as an acquire-exclusive load with following accesses. In case of RMWwL,
we introduce leading and trailing DMBFULL fences similar to that of STLR access.
For DMBFULL, DMBLD, and DMBST fences in ARMv8 the mapping scheme generates DMB fences so that the bob orders in
ARMv8 executions are preserved in corresponding ARMv7 executions. Now we prove the correctness of the mapping
as stated in Theorem 5.
Theorem 5. The mappings in Fig. 13a are correct.
To prove Theorem 5, we relate ARMv8 and ARMv7 consistent executions in Lemma 5 and Lemma 6 as intermediate
steps. Lemma 5, Lemma 6, and Theorem 5 are proved in Appendix A.6.
Lemma 5. Suppose Xt is an ARMv7 consistent execution and Xs is ARMv8 execution following the mappings in
Fig. 13a. In this case Xs.ob =⇒ (Xt.rfe ∪ Xt.coe ∪ Xt.fre ∪ Xt.rmw ∪ Xt.fence)+.
Lemma 6. Suppose Xt is an ARMv7 consistent execution and Xs is ARMv8 execution following the mappings in
Fig. 13a. In this case either Xs.ob =⇒ ((Xt.E× Xt.E)|loc \ [E]) or Xs.ob =⇒ (Xt.co;Xt.prop ∪ Xt.prop)+.
4.6 C11 to ARMv8 to ARMv7 mapping
Similar to C11 7→ x86 7→ ARMv8 we propose C11 to ARMv8 to ARMv7 mapping scheme in Fig. 13b. The proof is
discussed in detail in Appendix A.7. In §2.4 we already show that this mapping scheme is more efficient than ARMv8
to ARMv7 mapping.
Theorem 6. The mapping scheme in Fig. 13b is correct.
4.7 ARMv7-mca to ARMv8 mapping
The mapping scheme for ARMv7-mca to ARMv8 is same as the ARMv7 to ARMv8 mapping scheme as shown
in Fig. 12a. To prove the mapping soundness we relate an ARMv7 consistent execution to corresponding ARMv8
execution as follows.
Lemma 7. Suppose Xt is an ARMv8 consistent execution and Xs is corresponding ARMv7 consistent execution. In
that case [Xs.Ld];Xs.ppo; [Xs.Ld];Xs.po|loc; [Xs.St] =⇒ [Xt.Ld];Xt.ob; [Xt.St]
Using Lemma 7 we establish the acyclicity of write-order in ARMv7-mca source execution.
Lemma 8. Suppose Xt is a target ARMv8 consistent execution and Xs is corresponding ARMv7 consistent execution.
In this case Xs.wo+ is acyclic.
The detailed proof of Lemma 7 are Lemma 8 are discussed in Appendix A.8. The mapping correctness theorem below
directly follows from Lemma 8.
Theorem 7. The mappings in Fig. 12a are correct for ARMv7-mca.
4.8 ARMv8 to ARMv7-mca and C11 to ARMv8 to ARMv7-mca mappings
The mapping schemes, ARMv8 to ARMv7-mca and C11 to ARMv8 to ARMv7-mca, are shown in Fig. 13. The
soundness proofs are same as Theorems 5 and 6 respectively. We have already discussed in §2.3 why mapping of a
load access requires a trailing DMB fence to preserve correctness.
14
On Architecture to Architecture Mapping for Concurrency
↓ a \ b→ St Ld L A F FLD FST
St 7 3 7 3 7 3 7
Ld 7 3 7 3 7 7 3
L 7 3 7 7 7 3 7
A 7 7 7 7 3 3 3
F 7 7 3 7 = 3 3
FLD 7 7 3 7 3 = 3
FST 7 3 3 3 3 3 =
3 Ld(X, v′) · Ld(X, v) Ld(X, v′) (RAR)
3 A(X, v′) · Ld(X, v) A(X, v′) (RAA)
3 A(X, v′) · A(X, v) A(X, v′) (AAA)
(a) LDR and LDAR eliminations.
3 Ld(X, v) A(X, v) (R-A)
3 St(X, v) L(X, v) (W-L)
3 FLD/FST  F (F)
(b) Access strengthening.
Figure 14: Reordering, elimination, and strengthening transformations in ARMv8.
5 Common Compiler Optimizations in ARMv8
In this section we study the correctness of independent access reordering, redundant access elimination, and access
strengthening in ARMv8 model. We prove the correctness of the safe transformations in Appendix B.
Reorderings. We show the safe (3) and unsafe (7) reordering transformations of the form a ·b b ·a in Fig. 14 where
a and b represent independent and adjacent shared memory accesses on different locations. We prove the correctness
of the safe reorderings in Appendix B.1.
In Fig. 5 we have already shown that we cannot move a store before any load or store in. Same reasoning extends
to release-store and acquire-load. It is not safe to move a store before any fence as it may violate a dob relation.
Similarly a load cannot be moved before an acquire load, DMBLD, or DMBFULL operation as it may remove a bob
relation. However, reordering with a DMBST is safe as the ordering between them do not affect any component of ob
relation. A release-store may safely reorder with a preceding fence as it does not eliminate any bob relation. Similarly
moving a load, store, or DMBST after an acquire-read is allowed as it does not eliminate any existing bob relation. We
may safely reorder acquire-read with DMBFULL as it does not affect the bob relations among the memory accesses. A
DMBLD between a load and a load or store creates bob relation. Hence moving a load after DMBLD may eliminate a bob
and therefore disallowed.
Finally reorderings fences are safe as it preserves the bob relations between memory accesses.
Redundant access elimination In §2.5 we have shown that overwritten-write and read-after-write transformations are
unsound. However, a read-after-read elimination is safe in ARMv8 as enlisted in Fig. 14. We prove the correctness of
the transformation in Appendix B.2.
Access strengthening Strengthening memory accesses and fences may introduce new ordering among events and there-
fore the strengthening transformations enlisted in Fig. 14 hold trivially.
6 Fence Optimizations
In this section we prove the correctness of various fence eliminations and then propose respective fence elimination
algorithms. More specifically, the proposed mapping schemes in §4 may introduce fences some of which are redundant
in certain scenarios and can safely be eliminated. To do so, we first check if a fence is non-eliminable. If not, we delete
the fence.
6.1 x86 fence elimination
In x86 only a store-load pair on different locations is unordered. Therefore if a fence appear between such a pair then
it is not safe to eliminate the fence. Otherwise we may eliminate a fence.
Theorem 8. An MFENCE in an x86 program thread is non-eliminable if it is the only fence on a program path from a
store to a load in the same thread which access different locations.
An MFENCE elimination is safe when it is not non-eliminable.
We prove the theorem in Appendix C.1. This fence elimination condition is particularly useful after ARMv8 to x86
mapping following the scheme in Fig. 12b as it introduces certain redundant fences. For instance, ARMv8 to x86
mapping STLR; STR 7→ WMOV; MFENCE; WMOV results in an intermediate MFENCE which is redudant and can be safely
deleted as stores are ordered in x86.
15
On Architecture to Architecture Mapping for Concurrency
6.2 ARMv8 fence elimination (after mapping)
We identify non-eliminable DMBFULL, DMBST, and DMBST fences and then safely eliminate rest of the fences. We prove
the correctness of these fence eliminations in Appendix C.2.
For instance, considering the Fig. 9a mapping scheme, the DMBLD fence after RMOV; WMOV 7→ LDR; DMBLD; DMBST; STR
mapping suffices to order the load and store access pair and the DMBST is not required. However, we cannot im-
mediately conclude that such a DMBST fence is entirely redundant if we consider a mapping WMOV; RMOV; WMOV 7→
DMBST; STR; LDR; DMBLD; DMBST; STR where the second DMBST orders the two stores and therefore non-eliminable.
Theorem 9. Suppose an ARMv8 program is generated by x86 7→ ARMv8 mapping (Fig. 9a). A DMBFULL in a thread
of the program is non-eliminable if it is the only fence on a program path from a store to a load in the same thread
which access different locations.
A DMBFULL elimination is safe when it is not non-eliminable.
The trailing and leading fences in x86 to ARMv8 mapping ensures that a DMBFULL fence can safely be eliminated
following Theorem 9. Otherwise we cannot immediately eliminate a DMBFULL; rather whenever appropriate, we may
weaken such a DMBFULL fence by replacing it with DMBST; DMBLD fence sequence when a DMBFULL fence is costlier
than a pair of DMBST and DMBLD fences. We define safe fence weakening in Theorem 10 below and the detailed proof
is in Appendix C.3.
Theorem 10. A DMBFULL in a program thread is non-eliminable if it is the only fence on a program path from a store
to a load in the same thread which access different locations.
For such a fence DMBFULL DMBST; DMBLD is safe.
While fence weakening can be applied on any ARMv8 program, it is especially applicable after ARMv7/ARMv7-mca
to ARMv8 mapping. ARMv7 has only DMB fence (except ISB) to order any pair of memory accesses and these DMB
fences translates to DMBFULL fence in ARMv8. In many cases these DMBFULL fences can be weakened and then we
can eliminate DMBLD and DMBST fences which are not non-eliminable.
Theorem 11. A DMBST in a program thread is non-eliminable if it is placed on a program path between a pair of
stores in the same thread which access different locations and there exists no other DMBFULL or DMBST fence on the
same path.
A DMBST elimination is safe when it is not non-eliminable.
Theorem 12. A DMBLD in a program thread is non-eliminable if it is placed on a program path from a load to a store
or load access in the same thread which access different locations and there exists no other DMBFULL or DMBLD fence
on the same path.
A DMBLD elimination is safe when it is not non-eliminable.
6.3 Fence Elimination in ARMv7
In ARMv7 we safely eliminate repeated DMB fences. ARMv7 DMB fence elimination is particularly useful after ARMv8
to ARMv7/ARMv7-mca mappings. For example, LDR; STLR 7→ LDR; DMB; DMB; STR; DMB generates repeated DMB
fences and one of them can be safely eliminated.
Theorem 13. A DMB in a program thread is non-eliminable if it is the only fence on a program path between a pair of
memory accesses in the same thread.
A DMB elimination is safe when it is not non-eliminable.
We first check if a fence is non-eliminable based on the access pairs and fence locations on the program paths. We
perform this analysis on the thread’s control-flow-graph G = 〈V, E〉 where G.V denotes the program statements
including the accesses and G.E represents the set of edges between pair of statements. Next, we delete a fence if it is
not non-eliminable.
In Fig. 15 we define a number of conditions which we use in fence elimination. Condition Reach(G, i, j) holds if
there is a path from instruction i to instruction j in G and Path checks if there is any path from i to j through a fence
f . mpairs(G, a, b) is a set of (a × b) memory access pairs in G. We compute mpairs(G, a, b)|6=loc; the set of memory
access pairs on different locations based on must-alias analysis. FDELETE deletes a set of fences. Procedure GETNFS
updates the set of non-eliminable fences considering the positions of other fences between the access pairs. Given a
fence f and an access pair (i, j), we check if there is a path from i to j through f without passing through already
identified non-eliminable fences B. If so, fence f is also non-eliminable.
16
On Architecture to Architecture Mapping for Concurrency
Reach(G, i, j) , (i, j) ∈ [G.V];G.E+; [G.V] Path(G, i, f, j) , Reach(G, i, f) ∧ Reach(G, f, j)
ReachWO(G, i, j, F ) ,Reach(〈G.V \ F,G.E \B〉, i, j) where B = (G.V × F ) ∪ (F ×G.V)
NFS(G, i, f, j, F ) ,Path(〈G.V \ F,G.E \B〉, i, f, j) where B = (G.V × F ) ∪ (F ×G.V)
mpairs(G, a, b) ,{(i, j) | [[i]] ∈ a ∧ [[j]] ∈ b ∧ Reach(G, i, j)}
mpairs(G, a, b)|6=loc ,{(i, j) | mpairs(G, a, b) ∧ ¬mustAlias(i, j)}
mpairs(G, a, b)|loc ,{(i, j) | mpairs(G, a, b) ∧mustAlias(i, j)}
FDelete(G, F ) ,〈G.V \ F,G.E \ ((G.V × F ) ∪ (F × G.V))〉
1: procedure GETNFS(G, PR, F,B)
2: for f ∈ F do
3: for (i, j) ∈ PR do
4: G′ ← FDelete(G, B)
5: if Path(G′, i, f, j) then
6: B ← B ∪ {f};
7: break; // inner loop
8: return B
9: end procedure
1: procedure FWEAKEN(G, F )
2: for f ∈ F do
3: V1 ← G.V ∪ {a, b | [[a]] ∈ FLD ∧ [[b]] ∈ FST}
4: E1 ← G.E ∪ {(f, a), (a, b)}
5: E2 ← E1 ∪ {(e, a) | G.E(e, f)}
6: E3 ← E2 ∪ {(b, e) | G.E(f, e)}
7: G′.V← V1 \ {f}
8: G′.E ← E3 \ ((G′.V × {f}) ∪ ({f} × G′.V))
9: return G′
10: end procedure
Figure 15: Helpers conditions and functions
1: procedure X86FELIM(G)
2: F = {f | f ∈ G.V ∧ [[f ]] ∈ F};
3: U = {f | f ∈ G.V ∧ [[f ]] ∈ U};
4: SL← mpairs(G,St, Ld)|6=loc
5: nfs← getNFS(G, SL, F, U);
6: return FDelete(G, F \ nfs);
7: end procedure
8: procedure ARMV7FELIM(G)
9: F = {f | f ∈ G.V ∧ [[f ]] = F};
10: M ← mpairs(G,E \ F,E \ F )
11: nfs← getNFS(G,M, F, ∅);
12: return FDelete(G, F \ nfs);
13: end procedure
1: procedure ARMV8FELIM(G)
2: F = {f | f ∈ G.V ∧ [[f ]] = DMBFULL};
3: SL← mpairs(G,St, Ld)|6=loc
4: nfs← getNFS(G, SL, F, ∅);
5: if x86 7→ ARMv8 then
6: G1 ← FDelete(G, F \ nfs);
7: else
8: G1 ← FWeaken(G, F \ nfs)
9: FS = {f | f ∈ G1.V ∧ [[f ]] = DMBST};
10: SS ← mpairs(G1,St,St)| 6=loc
11: FF ← getNFS(G1, SS, FS, nfs);
12: G2 ← FDelete(G1, FS \ FF );
13: FL = {f | f ∈ G2.V ∧ [[f ]] = DMBLD};
14: LS ← mpairs(G2, Ld,St)|6=loc
15: LL← mpairs(G2, Ld, Ld)|6=loc
16: FF ′ ← getNFS(G2, LL ∪ LS, FL, nfs);
17: return FDelete(G2, FL \ FF ′);
18: end procedure
Figure 16: Fence elimination algorithms after mappings.
Fence elimination in x86, ARMv7, and ARMv8. In Fig. 16 we define x86, ARMv8, ARMv7 fence elimination
procedures. For instance, in X86FELIM we first identify store-load access pairs on different locations and the MFENCE
operations in a thread. Then we identify the set of non-eliminable fences nfs using getNFS procedure. In this case we
consider the positions of atomic updates along with fences as atomic updates also act as a fence. Finally FDELETE
eliminates rest of the fences.
Procedure ARMV8FELIM works in multiple steps for each of the fences. Note that while mapping to ARMv8 we do
not use release-write or acquire-load accesses. Therefore we use the same ReachWO condition to check if a fence is
non-eliminable. Moreover, in case of x86 to ARMv8 we eliminate DMBFULL fences. In this case DMBFULL elimination
is safe as it introduces other DMBLD and DMBST fences. However, we do not eliminate DMBFULL when it is generated
from ARMv7 as it may remove order between a pair of accesses. In this case or in general we can weaken a DMBFULL
fence and then eliminate redundant DMBST and DMBLD fences.
In ARMv7 a F is redundant when it it appears between a pair of same-location load-load, store-store, store-load,
and atomic load-store accesses. Such redundant fences appear in ARMv7 program after mapping ARMv8 programs
17
On Architecture to Architecture Mapping for Concurrency
(SC-x86A) [R]; po ∪ po; [W] ∪ po|loc ∪ fence
(SC-ARMv8) po|loc ∪ (aob ∪ dob ∪ bob)+
(x86A-ARMv8) po|loc ∪ (aob ∪ bob ∪ dob)+ ∪WR
(SC-ARMv7) po|loc ∪ fence
(x86A-ARMv7) po|loc ∪ fence ∪WR
(ARMv8-ARMv7) po|loc ∪ [St]; po ∪ fence
(ARMv7mca-ARMv7) [St]; po ∪ po; [St] ∪ [Ld]; (po|loc ∪ fence); [Ld]
Figure 17: (M -K): Condition R for M -robust against K analysis.
(SC)
acy(po ∪ rf ∪ fr ∪ co)
(atomicity)
irr([rmw]; fre; coe)
(a) SC
(sc-per-loc) acy(po|loc ∪ rf ∪ fr ∪ co)
(atomicity) irr([rmw]; fre; coe)
(GHB) acy((po \WR) ∪ fence ∪ rfe ∪ co ∪ fr)
where fence = po; [rmw ∪ F]; po and
WR = [St \ codom(rmw)]; po; [Ld \ dom(rmw)]
(b) x86A
Figure 18: SC and x86A model for robustness checking
to ARMv7/ARMv7-mca following the mapping scheme in Fig. 13a. For example, a sequence LDR; LDR in ARMv8
results in a sequence LDR;F; LDR;F in ARMv7 where the introduced F instructions are redundant and we eliminate
these fences by ARMV7FELIM procedure.
7 Robustness Analysis
We first define robustness and then discuss the conditions and its analyses in more details.
Definition 3. Suppose M and K are concurrency models. A program is M -robust against K if all its K-consistent
executions are also M -consistent.
We observe that in axiomatic models the axioms are represented in the form irreflexivity of a relation or acyclicity of
one or a combination of relations. When an axiom is violated then it results in a cycle on an execution graph. Such a
cycle consists of a set of internal relations which are included in program order (po) along with external relations. If
these involved po relations are appropriately ordered then such a cycle would not be possible. As a result the program
would have no weaker behavior and would be M -robust against a weaker model K. To capture the idea we define
external-program-order (epo) relation as follows.
epo , po ∩ codom(eco)× dom(eco)
Based on this observation we check and enforce M -robustness against K considering the relative strength (@) of the
memory accesses of the memory models: SC @ x86 @ ARMv8 @ ARMv7-mca @ ARMv7. In all these cases we
define required constraints on the external-program-order (epo) edges in an execution which preserves robustness.
Checking robustness in x86. A subtle issue in checking SC-robustness against x86 is mo relation may take place
between writes on different locations and in that case we have to consider a possible through different location writes
as well. To avoid this complexity, we use the x86A model following Alglave et al. [2014], Alglave and Maranget as
shown in Fig. 18 for robustness analyses. In this model there is no mo relation and unlike x86 an update operation
results in rmw ⊆ po|loc relation instead of an event similar to ARM models. In Fig. 18 we also define SC model
Alglave et al. [2014], Alglave and Maranget for robustness analysis.
Robustness conditions In Fig. 17 we define the conditions which have to be fulfilled by epo in all executions for a
given program. An x86A execution is SC-robust when all epo relations are fully ordered as defined in (SC-x86A). In
ARMv8 model condition (SC-ARMv8) preserves order for all epo relations. Condition (x86A-ARMv8) orders all epo
relations except non-RMW store-load access pairs on different locations similar to x86A. ARMv7 model uses po|loc
and fence to order epo relations fullly to preserve SC robustness. We do not use ppo in these constraints as it violates
robustness as shown in the example in Fig. 8. To preserve x86A robustness, ARMv7 orders all epo relations except
18
On Architecture to Architecture Mapping for Concurrency
ReachWO(G, i, j, F ) ,Reach(〈G.V \ F,G.E \B〉, i, j) where B = (G.V × F ) ∪ (F ×G.V)
Ordered(G, (i, j), F ) ,mustAlias(i, j) ∨ ¬ReachWO(G, i, j, F )
OnCyc(A) ,{(a, b) | (a, b) ∈ A ∧ ∃(p, q), (r, s) ∈ A. (a, b) 6= (p, q) ∧ (a, b) 6= (r, s)
∧mayAlias(b, p) ∧mayAlias(a, s)}
getG(b) , G where b ∈ G.V
1: procedure INSERTF(P, O)
2: H ← ∅
3: for (a, b) ∈ O do
4: if b /∈ H then
5: G← getG(b);
6: f ← new(MFENCE);
7: G.V← G.V ∪ {f};
8: P ← {(a, f) | (a, e) ∈ G.E+}
9: Q← {(f, b) | (e, b) ∈ G.E+}
10: G.E ← G.E ∪ {(f, e)} ∪ P ∪Q;
11: H ← H ∪ {b}
12: end procedure
1: procedure SCROBUSTX86(P, N)
2: `← St ∪ Ld ∪ U;
3: A← ⋃i∈N mpairs(P(i), `, `);
4: O ← ∅;
5: for (a, b) ∈ OnCyc(A) do
6: B ← {f | f ∈ G.V ∧ [[f ]] ∈ F ∪ U}
7: if ¬Ordered(getG(b), (a, b), B) then
8: O ← O ∪ {(a, b)};
9: if O == ∅ then return true;
10: else
11: INSERTF(P, O);
12: return false;
13: end procedure
Figure 19: Analysis and enforcement of SC-robustness against x86.
1: procedure INSERTDMBV8(P, O)
2: H ← ∅
3: for (a, b) ∈ O do
4: if a /∈ H then
5: G′ ← getG(a)
6: if isLd(a) ∧ ¬isLL(a) then
7: f ← new(DMBLD);
8: else
9: f ← new(DMBFULL);
10: G.V← G.V ∪ {f};
11: P ← {(p, f) | (p, a) ∈ G.E+}
12: Q← {(f, q) | (a, q) ∈ G.E+}
13: G.E ← G.E ∪ {(f, a)} ∪ P ∪Q;
14: H ← H ∪ {a}
15: return false;
16: end procedure
(a) ARMv8
1: procedure INSERTDMBV7(P, O)
2: H ← ∅
3: for (a, b) ∈ O do
4: if a /∈ H ∧ ¬isLL(a) then
5: G′ ← getG(a)
6: f ← new(DMBFULL);
7: G.V← G.V ∪ {f};
8: P ← {(p, f) | (p, a) ∈ G.E+}
9: Q← {(f, q) | (a, q) ∈ G.E+}
10: G.E ← G.E ∪ {(f, a)} ∪ P ∪Q;
11: H ← H ∪ {a}
12: return false;
13: end procedure
(b) ARMv7
Figure 20: Fence insertion in ARMv8 and ARMv7 for enforcing robustness.
non-RMW store-load access pairs on different locations. Condition (ARMv8-ARMv7) also does not rely on ordering
by dependencies as ppo. For example, data; coi ⊆ dob in ARMv8 does not imply ppo in ARMv7. Therefore such a
dob order may disallow an execution to be ARMv8 consistent but be allowed in ARMv7 model which would violate
ARMv8-robustness. Finally, (ARMv7mca-ARMv7) checks if the program may have any MCA behavior.
Now we state the robustness theorem based on these constraints and prove the respective robustness results in Ap-
pendix D.
Theorem 14. A program P is M -robust against K if in all its K consistent execution X, X.epo ⊆ X.R holds where R
is defined as condition (M -K) in Fig. 17.
19
On Architecture to Architecture Mapping for Concurrency
ReachWO(G, i, j, F ) ,Reach(〈G.V \ F,G.E \B〉, i, j) where B = (G.V × F ) ∪ (F ×G.V)
RA(i,Rel,Acq) ,{a | ¬ReachWO(G, i, a,Rel) ∧ a ∈ Acq}
isSt(i) , [[i]] ∈ St isSC(i) , [[i]] ∈ St ∩ codom(rmw)
isLd(i) , [[i]] ∈ Ld isAcq(i) , [[i]] ∈ A
isLL(i) , [[i]] ∈ Ld ∩ dom(rmw) isW(i) , [[i]] ∈ St ∪ L
isR(i) , [[i]] ∈ Ld ∪ A
//
1: procedure ORDERED(G, i, j)
2: FF ← {f | f ∈ G.V ∧ [[f ]] ∈ F}; FL← {f | f ∈ G.V ∧ [[f ]] ∈ FLD};
3: FS ← {f | f ∈ G.V ∧ [[f ]] ∈ FST}; L← {a | a ∈ G.V ∧ [[a]] ∈ L};
4: A← {a | a ∈ G.V ∧ [[a]] ∈ A}; B ← FF ∪ RA(i, L,A);
5: Switch(i, j)
6: Case mustAlias(i, j):
7: Case isSt(i) ∧ isLd(j) ∧ ¬ReachWO(G, i, j, B):
8: Case (isRel(j) ∨ isAcq(i)) ∨ (isRel(i) ∧ isAcq(j)):
9: Case isLd(i) ∧ isLd(j) ∧ ¬ReachWO(G, i, j, B ∪ FL):
10: Case isLd(i) ∧ isSt(j) ∧ ¬ReachWO(G, i, j, B ∪ FL ∪ Lcoi(G, j)):
11: Case isSt(i) ∧ isSt(j) ∧ ¬ReachWO(G, i, j, B ∪ FS ∪ Lcoi(G, j)):
12: Case (isLL(i) ∨ isLd(i)) ∧ (isSt(j) ∨ isSC(j)): return true;
13: return false;
(a) Checking order for pairs
1: procedure SCROBUSTARMV8(P,N)
2: `← St ∪ Ld ∪ L ∪ A
3: A← ⋃i∈N mpairs(P(i), `, `);
4: O ← ∅
5: for (a, b) ∈ OnCyc(A) do
6: B ← GETB(getG(b))
7: if ¬Ordered(getG(b), a, b) then
8: O ← O ∪ {(a, b)};
9: if O == ∅ then return true;
10: else INSERTDMBV8(P, O);
11: end procedure
(b) SC-robust against ARMv8
1: procedure X86ROBUSTARMV8(P,N)
2: `← St ∪ Ld ∪ L ∪ A
3: A← ⋃i∈N mpairs(P(i), `, `);
4: O ← ∅
5: for (a, b) ∈ OnCyc(A) do
6: G← getG(b)
7: B ← GETB(G)
8: C ← isW(i) ∧ isR(j) ∧ ¬(isSC(i) ∧ isLL(j))
9: if ¬(C ∨ Ordered(G, a, b)) then
10: O ← O ∪ {(a, b)};
11: if O == ∅ then return true;
12: else INSERTDMBV8(P, O);
13: end procedure
(c) x86 robust against ARMv8
Figure 21: Robustness analysis of ARMv8 programs
7.1 Checking and enforcing robustness
When an execution is K-consistent but violates M consistency then it forms a cycle which violates certain irreflexivity
condition. Such a cycle contain events on different locations and therefore two or more epo edges where given such
an epo edge (a, b) there exists other epo edge(s) (p, q) and (r, s) such that a and b access the same locations as p and
s respectively as (b, p), (s, a) ∈ eco.
We lift this semantic notion of robustness to program syntax in order to analyze and enforce robustness. We first
identify the memory access pairs in all threads as these are potential epo edges. Next, we conservatively check if the
memory access pairs would satisfy the robustness conditions in Fig. 17 in all its K consistent executions. If so, we
report the program as M -robust against K. To enforce robustness we insert appropriate fences between the memory
access pairs.
We perform such an analysis in Fig. 19 to check and enforce SC-robustness against in x86 programs by procedure
SCROBUSTX86 using a number of helper conditions. ReachWO(G, i, j, F ) checks if there is a program path from
access i to access j without passing through the fences F in G. Ordered(G, (i, j), F ) checks if (i, j) access pair
20
On Architecture to Architecture Mapping for Concurrency
isWR(i, j) ,isW(i) ∧ isR(j) ∧ ¬(isSC(i) ∧ isLL(j))
Ordered(G, i, j, F ) ,mustAlias(i, j) ∨ ¬ReachWO(G, i, j, F )
1: procedure SCROBUSTARMV7(P,N)
2: `← St ∪ Ld
3: pr ← ⋃k∈N mpairs(P(k), `, `);
4: O ← ∅
5: for (i, j) ∈ OnCyc(pr) do
6: G← getG(b)
7: F = {f | f ∈ G.V ∧ [[f ]] ∈ F};
8: if ¬Ordered(G, i, j, F ) then
9: O ← O ∪ {(i, j)}
10: if O == ∅ then
11: return true;
12: else INSERTDMBV7(P, O);
13: end procedure
(a) SC-robustness against ARMv7
1: procedure X86ROBUSTARMV7(P,N)
2: `← St ∪ Ld
3: pr ← ⋃k∈N mpairs(P(k), `, `);
4: O ← ∅
5: for (i, j) ∈ OnCyc(pr) do
6: G← getG(b)
7: F = {f | f ∈ G.V ∧ [[f ]] ∈ F};
8: if ¬(isWR(i, j) ∨ Ordered(G, i, j, F )) then
9: O ← O ∪ {(i, j)}
10: if O == ∅ then
11: return true;
12: else INSERTDMBV7(P, O);
13: end procedure
(b) x86-robustness against ARMv7
1: procedure ARMV8ROBUSTARMV7(P,N)
2: `← St ∪ Ld
3: pr ← ⋃k∈N mpairs(P(k), `, `);
4: O ← ∅
5: for (i, j) ∈ OnCyc(pr) do
6: G← getG(b)
7: F = {f | f ∈ G.V ∧ [[f ]] ∈ F};
8: if ¬(isSt(i) ∨ Ordered(G, i, j, F )) then
9: O ← O ∪ {(i, j)}
10: if O == ∅ then
11: return true;
12: else INSERTDMBV7(P, O);
13: end procedure
(c) ARMv8 robust against ARMv7
1: procedure ARMV7MCAROBUSTARMV7(P,N)
2: pr ← ⋃k∈N mpairs(P(k), Ld, Ld);
3: O ← ∅
4: for (i, j) ∈ OnCyc(pr) do
5: G← getG(b)
6: F = {f | f ∈ G.V ∧ [[f ]] ∈ F};
7: if ¬Ordered(G, i, j, F ) then
8: O ← O ∪ {(i, j)}
9: if O == ∅ then
10: return true;
11: else INSERTDMBV7(P, O);
12: end procedure
(d) ARMv7-mca robust against ARMv7
Figure 22: Robustness analysis of ARMv7 programs
ordered in respective models. For example, in Fig. 19 it checks if i and j access same location using mustAlias or on
all paths from i to j there exists a at least a fence from F by ReachWO.
Finally, given a set of memory access pair A, OnCyc(A) ⊆ A identifies the set of memory access pairs which may
result in epo edges in an execution. SCROBUSTX86 checks if all such store-load access pairs appropriately ordered
which in turn ensure SC-robust for the program P having N thread functions. If so, we report SC-robustness against
x86. Otherwise, we insert fences between unordered pairs using INSERTF procedure to enforce robustness. Similar
to SCROBUSTX86 we also define procedures in Fig. 21 and Fig. 22 respectively to check and enforce robustness in
ARMv8 and ARMv7 programs.
8 Experimental Evaluation
Based on the obtained results we have implemented arachitecture to architeture (AA) mapping schemes defined in
Figs. 9, 12 and 13, followed by fence elimination algorithms described in Fig. 16. We have also developed robustness
analyses for x86, ARMv8, and ARMv7 programs following the procedures in Figs. 19, 21 and 22.
We have implemented these mappings, fence eliminations, and robust analyses in LLVM. To analyze programs for
fence elimination and checking robustness, we leverage the existing control-flow-graph analyses, alias analysis, and
memory operand type analysis in LLVM. The CFG analyses are used to define mpairs, Path, Reach, and ReachWO
conditions. The mayAlias and mustAlias functions are defined using memory operand type and alias analyses.
21
On Architecture to Architecture Mapping for Concurrency
Prog. Orig x-v8 C-x-v8AA fd AA fd
barrier 0,6,6 5,5,10 2,1,6 4,0,14 2,1,8
dekker-tso 4,7,0 5,5,7 4,3,4 8,0,18 4,6,6
dekker-sc 0,7,0 5,5,3 4,5,0 8,0,14 4,6,2
pn-ra 4,3,0 5,12,7 4,7,2 4,0,16 4,5,6
pn-ra-b 0,9,6 5,10,7 4,7,2 4,0,14 4,5,4
pn-ra-d 0,5,4 5,10,7 4,5,4 4,0,14 4,5,4
pn-tso 2,3,0 5,12,7 4,7,2 4,0,14 4,5,4
pn-sc 0,3,0 5,12,3 4,9,0 4,0,12 4,5,2
lamport-ra 4,3,7 7,5,5 5,4,4 10,0,12 4,2,8
lamport-tso 2,3,5 7,5,3 5,4,2 8,0,10 4,2,6
lamport-sc 0,3,5 7,5,1 5,4,0 8,0,8 4,2,4
spinlock 0,8,6 5,7,10 2,6,0 2,0,14 2,10,0
spinlock4 0,14,12 9,11,18 4,10,0 4,0,24 4,18,0
tlock 0,8,4 7,8,8 4,5,2 4,0,16 2,5,4
tlock4 0,12,8 13,12,12 8,7,4 8,0,24 4,7,8
seqlock 0,6,4 6,4,12 5,3,2 5,0,16 5,3,2
nbw 0,3,4 10,8,12 6,6,1 7,0,18 6,7,6
rcu 0,2,10 12,15,2 3,12,0 12,0,11 2,4,4
rcu-ofl 4,16,8 17,18,24 12,6,9 15,0,51 11,2,42
cilk-tso 2,7,4 15,15,15 13,4,11 13,0,29 9,4,14
cilk-sc 0,7,4 15,15,13 13,6,9 13,0,27 9,6,12
cldq-ra 3,4,0 7,5,9 6,2,1 6,0,14 6,2,2
cldq-tso 1,4,0 9,5,7 6,2,1 6,0,12 6,2,2
cldq-sc 0,4,0 7,5,6 6,2,1 6,0,11 6,2,1
(a) x86 to ARMv8
Prog. Orig v8-xAA fd
barrier 6,0 4,2 4,1
dekker-tso 3,4 0,11 0,6
dekker-sc 3,0 0,7 0,3
pn-ra 3,4 0,7 0,3
pn-ra-b 5,0 2,7 2,3
pn-ra-d 5,0 2,3 2,1
pn-tso 3,2 0,5 0,5
pn-sc 3,0 0,3 0,1
lamport-ra 1,4 0,7 0,5
lamport-tso 1,2 0,5 0,3
lamport-sc 1,0 0,3 0,1
spinlock 4,0 2,4 2,1
spinlock4 6,0 4,6 4,2
tlock 6,0 2,6 2,3
tlock4 8,0 4,8 4,2
seqlock 6,0 4,4 4,1
nbw 4,0 2,3 2,2
rcu 2,0 0,2 0,1
rcu-ofl 16,4 1,20 1,12
cilk-tso 5,2 2,9 2,2
cilk-sc 5,0 2,7 2,1
cldq-ra 4,3 2,5 2,2
cldq-tso 4,1 2,3 2,2
cldq-sc 4,0 2,2 2,2
(b) ARMv8 to x86
Figure 23: Mappings between x86 and ARMv8. In x86 to ARMv8: #(ish, stl, lda) in original and #(ishld, ishst, ish)
after mapping. In ARMv8 to x86: #(RMW,mfence) in original and generated programs.
We have experimented these implementations on a number of well-known concurrent algorithms and data structures
Lahav and Margalit [2019], Norris and Demsky [2013] which use C11 concurrency primitives extensively. These
programs exihibit fork-join concurrency where the threads are created from a set of functions. In these programs the
memory accesses are relaxed accesses in general and for wait loops we use release/acquire accesses. Some of the
programs have release-acquire/TSO/SC versions. These programs assume the program would run on the respective
memory models.
8.1 Mapping Schemes
We have modified the x86, ARMv7, and ARMv8 code generation phases in LLVM to capture the effect of mapping
schemes on C11 programs. For example, in original LLVM mapping a non-atomic store (StNA) results in WMOV and
STR accesses in x86 and ARMv8 respectively. Following the AA-mapping in Fig. 9a, WMOV results in DMBST; STR in
ARMv8. Therefore to capture the effect of x86 to ARMv8 translation we generate DMBST; STR in ARMv8 instead
of a STR for a C11 non-atomic store access. We modify the code lowering phase in LLVM to generate the required
leading and trailing fences along with the memory accesses. The AA-mapping schemes introduce additional fences
compared to original mapping in all mapping schemes which is evident in Figs. 23a, 23b, 24a and 24b in ‘Orig’ and
‘AA’ columns respectively.
x86 to ARMv8 mappings (Fig. 9a). In Fig. 23a we show the numbers of different fences resulted from C11 7→ ARMv8
(Orig), x86 7→ ARMv8 (AA in x-v8), and C11 7→ x86 7→ ARMv8 (AA in C-x-v8). Both x86 7→ ARMv8 and
C11 7→ x86 7→ ARMv8 mapping schemes generate more fences compared to the original C11 7→ ARMv8 mapping.
x86 7→ ARMv8 (x-v8) generates more DMBLD fences compared to C11 7→ x86 7→ ARMv8 (C-x-v8) as the earlier
scheme generates trailing DMBLD fence for non-atomic loads. However, the number of DMBFULL fences are more in
C-x-v8 compared to x-v8 as atomic stores introduce leading DMBFULL fences instead of DMBST. For the same reason
there is no DMBST in C-x-v8 column.
ARMv8 to x86 mappings (Fig. 12b). As shown in Fig. 23b, the number of atomic updates and fence operations in
AA-mapping varies from Orig due to the mapping of C11 StSC and StREL accesses. In original mapping StSC 7→ RMW
and StREL 7→ WMOV whereas in AA-mapping St(REL|SC) 7→ STLR 7→ WMOV; MFENCE. As a result, the number of
22
On Architecture to Architecture Mapping for Concurrency
Programs Orig v7-v8AA fd
barrier 0,6,6 0,0,12 1,1,8
dekker-tso 4,7,0 0,0,11 2,5,4
dekker-sc 0,7,0 0,0,7 2,5,0
pn-ra 4,3,0 0,0,7 0,2,4
pn-ra-b 0,9,6 0,0,15 0,4,6
pn-ra-d 0,5,4 0,0,9 0,2,6
pn-tso 2,3,0 0,0,5 0,2,2
pn-sc 0,3,0 0,0,3 0,2,0
lamport-ra 4,3,7 0,0,14 1,2,11
lamport-tso 2,3,5 0,0,10 1,2,7
lamport-sc 0,3,5 0,0,8 1,2,5
spinlock 0,8,6 0,0,14 2,11,0
spinlock4 0,14,12 0,0,26 4,21,0
tlock 0,8,4 0,0,12 2,5,4
tlock4 0,12,8 0,0,20 4,7,8
seqlock 0,4,4 0,0,8 4,3,2
nbw 0,3,4 0,0,9 1,3,4
rcu 0,2,10 0,0,12 0,1,10
rcu-ofl 4,16,8 0,0,29 1,1,22
cilk-tso 2,7,4 0,0,15 4,4,8
cilk-sc 0,7,4 0,0,13 4,6,6
cldq-ra 3,4,0 0,0,7 3,1,6
cldq-tso 1,4,0 0,0,5 1,1,2
cldq-sc 0,4,0 0,0,4 1,1,1
(a) ARMv7 to ARMv8
Prog. Orig v8-v7 C-v8-v7AA fd AA fd
barrier 13 19 16 13 12
dekker-tso 12 25 23 22 19
dekker-sc 8 21 20 18 15
pn-ra 8 17 16 12 11
pn-ra-b 14 19 16 18 13
pn-ra-d 10 17 16 12 11
pn-tso 6 15 14 10 9
pn-sc 4 13 12 8 7
lamport-ra 15 21 20 18 17
lamport-tso 11 18 17 15 14
lamport-sc 9 16 15 13 12
spinlock 13 18 17 15 12
spinlock4 23 32 31 27 22
tlock 13 20 16 17 12
tlock4 21 34 29 29 20
seqlock 11 19 15 12 9
nbw 7 23 21 15 13
rcu 13 36 32 15 14
rcu-ofl 30 55 49 39 36
cilk-tso 13 34 31 30 22
cilk-sc 11 32 31 28 23
cldq-ra 8 19 18 12 12
cldq-tso 6 18 17 12 11
cldq-sc 5 18 17 11 11
(b) ARMv8 to ARMv7
Figure 24: Mappings between ARMv7 and ARMv8. Original mapping to ARMv8 is (DMBFULL, release-store, acquire-
load). In ARMv7-ARMv8 mapping the numbers are of (DMBLD, DMBST, DMBFULL).
atomic updates are less and the number of fences are more in AA-mapping compared to the original x86 mapping in
LLVM. We can observe the tradeoff between x86 7→ ARMv8 and C11 7→ x86 7→ ARMv8 considering the number of
generated DMBLD and DMBFULL fences. For example, in Barrier program x86 7→ARMv8 generates more DMBLD than
C11 7→ x86 7→ARMv8 as it generates DMBLD fences for non-atomic loads. On the other hand, C11 7→ x86 7→ARMv8
generates DMBFULL fences for relaxed atomic stores instead of DMBST fences.
ARMv8 to ARMv7 mappings (Fig. 13a) We show the number of DMB fences in Fig. 24b due to C11 7→ARMv8 (Orig),
ARMv8 7→ARMv7 (AA in v8-v7), C11 7→ARMv8 7→ARMv7 (AA in C-v8-v7) mappings. Both ARMv8 7→ARMv7 and
C11 7→ARMv8 7→ARMv7 generate more fences than C11 7→ARMv8 mapping. Moreover, C11 7→ARMv8 7→ARMv7
generates less number of fences than ARMv8 7→ ARMv7 as we do not generate trailing DMB fences for non-atomic
loads.
ARMv7 to ARMv8 mappings (Fig. 12a). The result is in Fig. 24a where The original C11 7→ ARMv8 mapping
generates DMBFULL, release-store, and acquire-load operations for these programs whereas the AA-mapping generates
respective DMBFULL fences only as ARMv7 does not have release-store, and acquire-load operations.
8.2 Fence elimination
The fence optimization passes remove significant number of fences as shown in the ‘fd’ columns in Figs. 23a, 23b, 24a
and 24b. We have implemented the fence elimination algorithms as LLVM passes and run the pass after AA-mappings
to eliminate redundant fences. The pass extends LLVM MachineFunctionPass and run on each machine function of the
program. The precision of our analyses depend upon underlying LLVM functions which we have used. For example,
we apply alias analysis and memory operand analysis to identify the memory location accessed by a particular access.
Consider a scenario where we have identified an MFENCE between a store-load pair. If we precisely identify that the
store-load pair access same-location then we can eliminate the fence. Otherwise we conservatively mark the fence as
non-eliinable.
Fence elimination after x86 to ARMv8 mapping. The fence elimination algorithms have eliminated a number of
redundant fences after the mapping. In some scenarios original C11 to ARMv8 mapping is too restrictive as it generates
23
On Architecture to Architecture Mapping for Concurrency
Prog. x86A ARMv8 ARMv7
Rocker
(RA)
Trencher
(TSO)
SC SC x86A SC x86A v8 mca SC SC
barrier 8|071 12|676 75 12|1071 3 3 3 3(#2) 7(#2)
dekker-tso 20| 43 20|8 76 76 20| 8 78 78 78 74 3(#2) 3(#2)
dekker-sc 20| 0 710 20 | 4 712 79 20| 4 712 78 78 74 7(#2) 7(#2)
pn-ra 12| 4 3 12 | 4 77 77 12| 4 78 78 76 74 3(#2) 3(#2)
pn-ra-b 10 | 0 72 12 | 10 72 72 12 | 12 72 3 3 3 7(#2) 7(#2)
pn-ra-d 10| 0 3 12 | 4 78 78 12 | 6 78 78 74 72 3(#2) 3(#2)
pn-tso 12| 2 3 12| 2 79 79 12 | 2 710 710 76 74 7(#2) 3(#2)
pn-sc 12 | 0 74 12 | 0 711 711 12| 0 710 710 76 74 7(#2) 7(#2)
lmprt-ra 19| 478 18 | 13 77 74 19 | 13 76 74 73 30 3(#2/3) 3(#2)
lmprt-tso 17| 2 76 16| 9 711 710 17 | 9 78 77 76 71 7(#2) 3(#2)
lmprt-sc 17| 0 78 16| 7 714 713 17| 7 710 79 78 73 7(#2) 7(#2)
spinlock 8| 0 3 10 | 8 3 3 12| 12 3 3 3 3 3(#2) 3(#2)
spinlock4 16|03 20|163 3 24|243 3 3 3 3(#4) 3(#4)
tlock 10|03 10|63 3 12|83 3 3 3 3(#2) 3(#2)
tlock4 20|03 20|3 3 24|163 3 3 3 3(#2) 3(#2)
seqlock 7|03 11|873 73 11|871 71 71 71 3(#2) 3(#2)
nbw 15|03 18|2 712 712 20| 7 710 710 79 78 3(#4) 3(#4)
rcu 27|0710 25| 10 716 712 27| 10 718 718 79 77 3(#4) 7(#4)
rcu-ofl 30| 4 77 33| 14 727 725 36| 19 717 716 715 76 3(#3) 7(#3)
cilk-tso 11|23 28| 678 78 29| 10 77 77 77 77 3(#2) 3(#2)
cilk-sc 11|03 28| 4 79 79 29 | 8 78 78 78 78 7(#2) 7(#2)
cldq-ra 9|33 11|573 73 11| 5 73 73 73 71 3(#3) 3(#3)
cldq-tso 9|13 11| 3 75 75 11| 3 75 75 75 73 7(#3) 3(#3)
cldq-sc 9| 0 71 11| 276 77 11| 2 76 76 76 7.4 7(#3) 7(#3)
Figure 25: Robustness analyses. Entry (a|b3/7c) where a: # fences inserted by naive scheme excluding the existing
fences, b: #existing fences, 3/7: program is robust or not, [c] #fences inserted to enforce robustness. Rocker and
Trencher robustness results (for #k number of threads) are taken from Lahav and Margalit [2019]. Our SC-robustness
against x86A analysis matches Trencher in a number of cases. ARMv8 and ARMv7 is weaker than RA and therefore
we report non-robustness in these programs.
release-store and acquire-load accesses for C11 release-store and acquire-load accesses respectively. In our scheme
we prefer to generate fences separately and fence elimination eliminates those extra fences.
Fence elimination after C11 to x86 to ARMv8 mapping. In this case we first weaken the DMBFULL fences to a pair of
DMBST and DMBLD fences whenever appropriate and then perform the fence elimination. Therefore it introduces some
DMBST fences in the ’fd’ column in C-x-v8.
Fence elimination after ARMv8 to x86 mapping. The mapping generates MFENCE for release-store mapping and the
fence elimination safely eliminate these fences whenever possible.
Fence elimination after ARMv7 to ARMv8 mapping. In this case the mapping introduce DMBFULL fences in ARMv8
from ARMv7 DMB fences. We eliminate the repeated fences if any and then weaken the DMBFULL fences to DMBST and
DMBLD fences, and further eliminate redundant fences.
Fence elimination after ARMv8 to ARMv7 mapping. ARMv8 to ARMv7 mapping generates extra fences in certain
scenarios such as LDR; STLR 7→ LDR; DMB; DMB; STR; DMB where we can safely remove a repeated DMB fence. Similar
scenario takes place for LDRA; STLR mapping in C11 to ARMv8 to ARMv7 mapping.
8.3 Robustness analysis
We implement the robustness analysis as LLVM passes following the procedures in Fig. 19 as well as following
Figs. 21 and 22 in the appendix after instruction lowering in x86, ARMv8, and ARMv7. We report the analyses
results on the concurrent programs in Fig. 25. In these results we mark both robustness checking and robustness
enforcement results. We have also included the results from Lahav and Margalit [2019] about two other robustness
checker: Trencher Bouajjani et al. [2013] and Rocker Lahav and Margalit [2019].
Now we discuss the robustness results of the benchmarks programs which are marked by 3or 7. Among these pro-
grams spinlock, spinlock4, seqlock, ticketlock (tlock), and ticketlock4 (tlock4) provide robustness in all models. These
24
On Architecture to Architecture Mapping for Concurrency
results also match the results from both Trencher and Rocker; both SC-robustness checkers. In rest of the programs
we observe robustness violations due to various unordered accesses sequences. For example, (St-Ld) violates SC-
robustness in all architectures, (SC-St/Ld) violate x86A robustness in ARMv8 and ARMv7, and (Ld-Ld) violate all
robustness in ARMv8 and ARMv7 models.
Robustness of x86 programs. We first focus on SC-robustness against x86A and compare the result with Trencher. Our
analysis precisely analyze robustness and agrees to Trencher in all cases except lamport-ra (lmprt-ra), lamport-tso
(lmprt-tso), and cilk-sc. Both lamport-ra and lamport-tso has (St-Ld) sequence in different thread functions. As a
result, our analysis reports SC-robustness violation which is a false positive as in actual executions these access pairs
never execute in concurrence. In cilk-sc we report SC-robustness as the program has store-load sequences of the form
a = LdRLX(T );StRLX(T, a−1); LdRLX(H). In this case the StRLX(T, a−1); LdRLX(H) may yield non-SC behavior
during an execution which is reported by Trencher and Rocker. However, LLVM combines the load and store of T
into an atomic fetch-and-sub (fsub) operation, that is, a = LdRLX(T );StRLX(T, a−1)  a = fsub(T, 1). As a result
the program turns into SC-robust against x86 in LLVM as reported by our analysis.
Robustness of ARMv8 programs. Next, we study SC-robustness and x86A-robustness against ARMv8 for the bench-
mark programs. ARMv8 allows out-of-order executions of memory accesses on different locations which do not affect
dependencies. Therefore many of these programs in ARMv8 are not SC or x86A robust. Also our robustness ana-
lyzer do not rely on dob ordering as it performs the analysis before the ARMv8 machine code is generated during
the code lowering phase. Therefore LLVM may perform optimizations after the analysis which may remove certain
dependencies and in that case our analysis would be unsound and may report false negative.
As ARMv8 is weaker than x86A, the program which are not SC-robust in x86A are also not SC-robust in ARMv8.
Programs like barrier, peterson-ra-Bartosz (pn-ra-b), peterson-sc (pn-sc), lamport-ra/tso/sc, rcu, rcu-offline (rcu-ofl),
and chase-lev-dequeue-tso/sc (cldq-tso/sc) are in this category. There are programs which are SC-robust in x86 but
not in ARMv8 such as dekker-tso and so on. These programs violate both SC and x86A robustness due to unordered
(Ld-Ld) or (SC-St/Ld) pairs.
Robustness of ARMv7 programs. Now we move to the robustness analysis in ARMv7. Except spinlock, spinlock4,
and seqlock programs, all other programs violate SC-robustness due to the similar pattern as discussed in ARMv8
robustness. Among these programs SC-robustness is violated in barrier due to (St-Ld) unordered sequence. This
access pattern is allowed in x86A, ARMv8, and ARMv7-mca and therefore these ARMv7 programs are robust in
these models. Program rcu has unordered (St-St) pairs which violates SC and x86A robustness. However, these pairs
does not violate ARMv8 and ARMv7-mca robustness. Rest of the programs exihibit certain (Ld-Ld) pairs which result
in x86, ARMv8, and ARMv7-mca robustness violations.
8.3.1 Enforcing robustness
Whenever we identify a program as non-robust we insert appropriate fences to enforce respective robustness. For ex-
ample in Fig. 19 we identify the different-location store-load access pairs which may violate robustness. We introduce
leading MFENCE operations for the load operation in the pair as required.
A naive scheme does not use robustness information. It first eliminates existing fences in concurrent threads and then
insert fences after each memory accesses except atomic update in x86 and load-exclusive accesses in ARM models
to restrict program behavior. In both naive scheme and our approach we do not insert fences for atomic update. In
ARMv8 we insert DMBLD and DMBFULL trailing fences for load, and store and store-exclusive respectively when they
are unordered with a successor. In ARMv7 we insert DMBFULL trailing fences for load, store, and store-exclusive when
they are unordered with a successor.
In Fig. 25 we report the number of fences required in the naive scheme, robustness analyses results in our proposed
approach along with the number of introduce fences to enforce robustness. We compare our result to the naive schemes
as explained in Fig. 25 and find that our approach insert less number of fences in major instances. However, our fence
insertion is not optimal; we leave the optimal fence insertion for enforcing robustness for future investigation.
9 Related Work
Architecture to architecture mapping There are a number of dynamic binary translators Ding et al. [2011], Wang et al.
[2011], Hong et al. [2012], Lustig et al. [2015], Cota et al. [2017] emulate mutithreaded program. Among these earlier
translators such as PQEMUDing et al. [2011], COREMUWang et al. [2011], HQEMU Hong et al. [2012] and so on do
not address the memory consistency model mismatches. ArMOR Lustig et al. [2015] proposes a specification format
to define the ordering requirements for different memory models which is used in translating between architectural
25
On Architecture to Architecture Mapping for Concurrency
concurrency models in dynamic translation. The specification format is used in specifying TSO and Power architec-
tures. Cota et al. [2017] uses the rules from ArMOR in Pico dynamic translator for QEMU. Our mapping schemes
provide the ordering rules which can be used to populate the ordering tables for x86 and ARM models. Moreover the
ARMv8 reordering table in Fig. 14 demonstrates that reordering certain independent access pairs are not safe if they
are part of certain dependency based ordering. In addition to the QEMU based translators, LLVM based decompilers
Bougacha, Bits, Yadavalli and Smith [2019], avast, Shen et al. [2012] raise binary code to LLVM IR and then compiles
to another architecture. These decompilers do not support relaxed memory concurrency.
Fence optimization Redundant fence elimination is addressed by Vafeiadis and Zappa Nardelli [2011], Elhorst [2014],
Morisset and Nardelli [2017]. Vafeiadis and Zappa Nardelli [2011] performs safe fence elimination in x86, Elhorst
[2014] eliminate adjacent fences in ARMv7, and Morisset and Nardelli [2017] perform efficient fence elimination in
x86, Power, and ARMv7. However, none of these approaches perform ARMv8 fence elimination.
Robustness analysis. Sequential consistency robustness has been explored against TSO Bouajjani et al. [2013],
POWER Derevenetc and Meyer [2014], and Release-Acquire Lahav and Margalit [2019] models by exploring execu-
tions using model checking tools. Alglave et al. [2017] proposed fence insertion in POWER to strengthen a program
to release/acquire semantics which has same preserved-program-order constraints between memory aceesses as TSO.
On the contrary, we identify robustness checking conditions in ARMv7 and ARMv8 where we show that preserved-
program-order is not sufficient to recover sequential consistency in ARMv7 models. Identifying minimal set of fences
is NP-hard Lee and Padua [2001] and a number of approaches such as Shasha and Snir [1988], Bouajjani et al. [2013],
Lee and Padua [2001], Alglave et al. [2017] proposed fence insertion to recover stonger order, particularly sequential
consistency. Similar to Lee and Padua [2001] our approach is based on analyzing control flow graphs without explor-
ing the possible executions by model checkers. Though in certain scenarios we report false positives, our approach
precisely identifies robustness for a number of well-known programs.
10 Conclusion and Future Work
In this paper we propose correct and efficient mapping schemes between x86, ARMv8, and ARMv7 concurrency
models. We have shown that ARMv8 can indeed serve as an intermediate model for mapping between x86 and
ARMv7. We have also shown that removing non-multicopy atomicity from ARMv7 does not affect the mapping
schemes. We also show that ARMv8 model cannot serve as an IR in a decompiler as it does not support all common
compiler optimizations. Next,we propose fence elimination algorithms to remove additional fences generated by
the mapping schemes. We also propose robustness analyses and enforcement techniques based on memory access
sequence analysis for x86 and ARM programs.
Going forward we want to extend these schemes and analyses to other architectures as well. We believe these results
would play a crucial role in a number of translator, decompilers, and state-of-the-art systems. Therefore integrating
these results to these systems is another direction we would like to pursue in future.
References
C/C++11 mappings to processors. https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html.
J. Alglave and L. Maranget. herd7 consistency model simulator. http://diy.inria.fr/www/.
J. Alglave, L. Maranget, and M. Tautschnig. Herding cats: modelling, simulation, testing, and data-mining for weak
memory. ACM Trans. Program. Lang. Syst., 36(2):7:1–7:74, 2014. doi: 10.1145/2627752.
J. Alglave, D. Kroening, V. Nimal, and D. Poetzl. Don’t sit on the fence: A static analysis approach to automatic fence
insertion. ACM Trans. Program. Lang. Syst., 39(2):6:1–6:38, 2017.
Android-x86. https://www.android-x86.org/.
Arm. Migrating a software application from armv5 to armv7-a/r application. http://infocenter.arm.com/help/
index.jsp?topic=/com.arm.doc.dai0425/chapter1intendreader.html.
avast. A retargetable machine-code decompiler based on llvm. https://github.com/avast/retdec.
A. Barbalace, R. Lyerly, C. Jelesnianski, A. Carno, H. Chuang, V. Legout, and B. Ravindran. Breaking the boundaries
in heterogeneous-isa datacenters. In ASPLOS 2017, pages 645–659, 2017. doi: 10.1145/3037697.3037738.
A. Barbalace, M. L. Karaoui, W. Wang, T. Xing, P. Olivier, and B. Ravindran. Edge computing: the case for
heterogeneous-isa container migration. In VEE’20, pages 73–87, 2020. doi: 10.1145/3381052.3381321.
L. Bits. Framework for lifting x86, amd64, and aarch64 program binaries to llvm bitcode. https://github.com/
lifting-bits/mcsema.
26
On Architecture to Architecture Mapping for Concurrency
A. Bouajjani, E. Derevenetc, and R. Meyer. Checking and enforcing robustness against TSO. In ESOP 2013, pages
533–553, 2013. doi: 10.1007/978-3-642-37036-6\_29.
A. Bougacha. Binary translator to llvm ir. https://github.com/repzret/dagger.
A. Chernoff, M. Herdeg, R. Hookway, C. Reeve, N. Rubin, T. Tye, S. Bharadwaj Yadavalli, and J. Yates. Fx32 a
profile-directed binary translator. IEEE Micro, 18(2):56–64, 1998.
E. G. Cota, P. Bonzini, A. Bennée, and L. P. Carloni. Cross-isa machine emulation for multicores. In CGO’2017, page
210âA˘S¸220. IEEE Press, 2017.
E. Derevenetc and R. Meyer. Robustness against power is pspace-complete. In ICALP’14, volume 8573 of LNCS,
pages 158–170, 2014. doi: 10.1007/978-3-662-43951-7\_14.
J. Ding, P. Chang, W. Hsu, and Y. Chung. PQEMU: A parallel system emulator based on QEMU. In ICPADS’11,
pages 276–283, 2011. doi: 10.1109/ICPADS.2011.102.
M. Docs. How x86 emulation works on arm. https://docs.microsoft.com/en-us/windows/uwp/porting/
apps-on-arm-x86-emulation.
R. Elhorst. Lowering C11 atomics for ARM in LLVM. In European LLVM Conference, 2014.
D.-Y. Hong, C.-C. Hsu, P.-C. Yew, J.-J. Wu, W.-C. Hsu, P. Liu, C.-M. Wang, and Y.-C. Chung. Hqemu: A multi-
threaded and retargetable dynamic binary translator on multicores. In CGO’12, page 104âA˘S¸113, 2012. doi:
10.1145/2259016.2259030.
ISO/IEC 14882. Programming language C++, 2011.
ISO/IEC 9899. Programming language C, 2011.
O. Lahav and R. Margalit. Robustness against release/acquire semantics. In PLDI 2019, pages 126–141, 2019. doi:
10.1145/3314221.3314604.
O. Lahav and V. Vafeiadis. Explaining relaxed memory models with program transformations. In FM’16, pages
479–495, 2016. doi: 10.1007/978-3-319-48989-6_29.
O. Lahav, V. Vafeiadis, J. Kang, C.-K. Hur, and D. Dreyer. Repairing sequential consistency in C/C++11. In PLDI
2017, pages 618–632, 2017. doi: 10.1145/3062341.3062352. Technical Appendix Available at https://plv.
mpi-sws.org/scfix/full.pdf.
J. Lee and D. A. Padua. Hiding relaxed memory consistency with a compiler. IEEE Transactions on Computers, 50
(8):824–833, 2001.
D. Lustig, C. Trippel, M. Pellauer, and M. Martonosi. Armor: Defending against memory consistency model mis-
matches in heterogeneous architectures. In ISCA’15, page 388âA˘S¸400, 2015. doi: 10.1145/2749469.2750378.
R. Morisset and F. Z. Nardelli. Partially redundant fence elimination for x86, arm, and power processors. In CC’17,
pages 1–10, 2017.
B. Norris and B. Demsky. CDSChecker: Checking concurrent data structures written with C/C++ atomics. In OOP-
SLA’13, 2013.
notaz. Starcraft. http://repo.openpandora.org/, 2014.
C. Pulte, S. Flur, W. Deacon, J. French, S. Sarkar, and P. Sewell. Simplifying ARM concurrency: multicopy-atomic
axiomatic and operational models for ARMv8. PACMPL, 2(POPL):19:1–19:29, 2018. doi: 10.1145/3158107.
QEMU. the fast! processor emulator. https://www.qemu.org/.
D. E. Shasha and M. Snir. Efficient and correct execution of parallel programs that share memory. ACM Trans.
Program. Lang. Syst., 10(2):282–312, 1988. doi: 10.1145/42190.42277.
B.-Y. Shen, J.-Y. Chen, W.-C. Hsu, and W. Yang. Llbt: An llvm-based static binary translator. In CASES 2012, page
51âA˘S¸60, 2012. doi: 10.1145/2380403.2380419.
V. Vafeiadis and F. Zappa Nardelli. Verifying fence elimination optimisations. In SAS’11, volume 6887 of LNCS,
pages 146–162. Springer, 2011. doi: 10.1007/978-3-642-23702-7_14.
Z. Wang, R. Liu, Y. Chen, X. Wu, H. Chen, W. Zhang, and B. Zang. COREMU: a scalable and portable parallel
full-system emulator. In C. Cascaval and P. Yew, editors, PPOPP’11, pages 213–222, 2011. doi: 10.1145/1941553.
1941583.
J. Wickerson, M. Batty, T. Sorensen, and G. A. Constantinides. Automatically comparing memory consistency models.
In POPL’17, pages 190–204. ACM, 2017. doi: 10.1145/3009837.3009838.
S. B. Yadavalli and A. Smith. Raising binaries to llvm ir with mctoll (wip paper). In LCTES 2019, page 213âA˘S¸218,
2019. doi: 10.1145/3316482.3326354.
27
On Architecture to Architecture Mapping for Concurrency
A Proofs of Mapping Schemes
A.1 x86 to ARMv8 Mappings
We first restate Theorem 1.
Theorem 1. The mappings in Fig. 9a are correct.
To prove Theorem 1, we prove the following formal statement.
Px86  PARMv8 =⇒ ∀Xt ∈ [[PARMv8]]. ∃Xs ∈ [[Px86]]. Behavior(Xt) = Behavior(Xs)
Given an ARM execution Xt we define correxponding x86 execution Xs where
1. [Xt.St ∪ Xt.F];Xt.ob; [Xt.St ∪ Xt.F] =⇒ Xs.mo
2. [Xt.St ∪ Xt.F];Xt.po; [Xt.St ∪ Xt.F] =⇒ Xs.mo
3. [Xt.F];Xt.po;Xt.fr =⇒ Xs.mo
4. Xt.co =⇒ Xs.mo|loc
We know that Xt is ARMv8 consistent. Now we show that Xs is x86 consistent.
Proof. We prove by contradiction.
(irrHB)
Assume Xs has an Xs.xhb cycle.
It implies a (Xs.po ∪ Xs.rfe)+ cycle.
Considering the possible cases of Xs.po edges on the cycle:
Case [Xs.Ld];Xs.po; [Xs.W]:
=⇒ [Xt.Ld];Xt.po; [Xt.FLD];Xt.po; [Xt.W].
=⇒ [Xt.Ld];Xt.bob; [Xs.W]
=⇒ [Xt.Ld];Xt.ob; [Xs.W]
Case [Xs.U];Xs.po; [Xs.W]:
=⇒ [Xt.Ld];Xt.rmw;Xt.po; [Xt.F];Xt.po;Xt.rmw; [Xt.St]
=⇒ [Xt.Ld];Xt.aob;Xt.bob;Xt.aob; [Xt.St]
=⇒ [Xt.Ld];Xt.ob; [Xt.St]
Thus in both cases Xs.xhb =⇒ (Xt.ob∪Xt.rfe)+ ⊆ Xt.ob. However, Xt is ARM consistent and Xt.ob is irreflexive.
Hence a contradiction and Xs.xhb is irreflexive.
(irrMOHB)
Assume Xs has a Xs.mo;Xs.xhb cycle.
However, from definition, [Xs.W ∪ Xs.F];Xs.xhb; [Xs.W ∪ Xs.F]
Considering the po and rfe from xhb:
Case [Xs.W ∪ Xs.F];Xs.po; [Xs.W ∪ Xs.F]:
We know,
[Xs.W ∪ Xs.F];Xs.po; [Xs.W ∪ Xs.F]
28
On Architecture to Architecture Mapping for Concurrency
Considering the subcases:
Subcase [Xs.St ∪ Xs.F];Xs.po; [Xs.St ∪ Xs.F]:
It implies [Xt.St ∪ Xt.F];Xt.po; [Xt.St ∪ Xt.F].
From definitions, [Xt.St ∪ Xt.F];Xt.po; [Xt.St ∪ Xt.F] =⇒ Xs.mo ∧ ¬Xs.mo−1.
Subcase Otherwise:
Possible scenarios are [Xs.U];Xs.po; [Xs.W ∪ Xs.F] or [Xs.W ∪ Xs.F];Xt.po; [Xt.U].
Now,
[Xs.U];Xs.po; [Xs.W ∪ Xs.F]
=⇒ Xt.rmw;Xt.po; [Xt.F];Xt.po; [Xt.W ∪ Xt.F]
=⇒ Xt.bob
=⇒ Xt.ob
Similarly,
[Xs.W ∪ Xs.F];Xt.po; [Xt.U]
=⇒ Xt.po; [Xt.F];Xt.po; [Xs.W]
=⇒ Xt.bob
=⇒ Xt.ob
From definitions, [Xt.StXt.F];Xt.ob; [Xt.StXt.F] =⇒ Xs.mo ∧ ¬Xs.mo−1.
Case [Xs.W ∪ Xs.F];Xs.rfe; [Xs.W ∪ Xs.F]:
It implies [Xs.W];Xs.rfe; [Xs.U]
=⇒ ([Xt.Ld];Xt.rmw)?; [Xt.St];Xt.rfe; [Xt.Ld];Xt.rmw; [Xt.St] following the mappings.
=⇒ ([Xt.Ld];Xt.aob)?; [Xt.St];Xt.obs; [Xt.Ld];Xt.aob; [Xt.St]
=⇒ ([Xt.Ld];Xt.aob)?; [Xt.St];Xt.ob; [Xt.St]
From definitions we know that [Xt.St];Xt.ob; [Xt.St] =⇒ Xs.mo ∧ ¬Xs.mo−1.
Therefore Xs.xhb =⇒ Xs.mo and hence Xs.mo;Xs.xhb is acyclic and Xs satisfies (irrMOHB).
(irrFRHB)
Assume Xs has a Xs.fr;Xs.xhb cycle.
We already know that Xs.xhb =⇒ Xt.ob holds.
Considering the cases of Xs.fr:
Case Xs.fre:
In this case Xs.fre =⇒ Xt.fre =⇒ Xt.obs.
In this case there exists a Xt.obs;Xt.ob cycle which violates (external) in Xt.
Hence a contradiction and Xs satisfies (irrFRHB).
Case Xs.fri:
Following the mappings Xs.fri =⇒ Xt.bob.
29
On Architecture to Architecture Mapping for Concurrency
In this case there exists a Xt.bob;Xt.ob cycle which violates (external) in Xt.
Hence a contradiction and Xs satisfies (irrFRHB).
(irrFRMO)
Assume Xs has a Xs.fr;Xs.mo cycle.
It implies a Xs.fr;Xs.co cycle and in consequence a Xt.fr;Xt.co cycle which violates (internal) in Xt.
Hence a contradiction and Xs satisfies (irrFRMO).
(irrFMRP)
Assume Xs has a Xs.fr;Xs.mo;Xs.rfe;Xs.po cycle.
It implies a Xs.rfe;Xs.po;Xs.fr;Xs.mo cycle.
Now we consider a Xs.rfe;Xs.po;Xs.fr path.
Thus
[Xs.W];Xs.rfe;Xs.po;Xs.fr; [Xs.W]
=⇒ [Xs.W];Xs.rfe;Xs.po; [Xs.R];Xs.fre; [Xs.W]
∪ [Xs.W];Xs.rfe;Xs.po; [Xs.R];Xs.fri; [Xs.W]
=⇒ [Xt.St];Xt.rfe; [Xt.Ld];Xt.po; [Xt.FLD ∪ Xt.F];Xt.po; [Xt.Ld];Xt.fre; [Xt.St]
∪ [Xt.St];Xt.rfe; [Xt.Ld];Xt.po; [Xt.FLD ∪ Xt.F];Xt.po; [Xt.Ld];Xt.fri; [Xt.St]
=⇒ [Xt.St];Xt.obs; [Xt.Ld];Xt.bob; [Xt.Ld];Xt.obs; [Xt.St]
∪ [Xt.St];Xt.obs; [Xt.Ld];Xt.bob; [Xt.Ld];Xt.bob; [Xt.St]
=⇒ [Xt.St];Xt.ob; [Xt.St] ∪ [Xt.St];Xt.ob; [Xt.St]
=⇒ [St];Xt.ob; [Xt.St]
However, we know [Xt.St];Xt.ob; [St] =⇒ [Xs.W];Xs.mo; [Xs.W].
Thus [Xs.W];Xs.rfe;Xs.po;Xs.fr; [Xs.W] =⇒ Xs.mo ∧ ¬Xs.mo−1.
Hence a contradiction and thus Xs satisfies (irrFMRP).
(irrUF)
Assume Xs has a Xs.fr;Xs.mo; [Xs.U ∪ Xs.F];Xs.po cycle.
It implies [Xs.U ∪ Xs.F];Xs.po; [Xs.R];Xs.fr; [Xs.W];Xs.mo cycle.
Now, we consider a [Xs.U ∪ Xs.F];Xs.po; [Xs.R];Xs.fr; [Xs.W] path.
Considering possible cases:
Case [Xs.U];Xs.po; [Xs.R];Xs.fr; [Xs.W]:
=⇒ [Xt.Ld];Xt.po; [Xt.F];Xt.po; [Xt.Ld]; (Xt.fre ∪ Xt.fri); [Xt.St]
=⇒ [Xt.Ld];Xt.po; [Xt.F];Xt.po; [Xt.Ld];Xt.fre; [Xt.St]
∪ [Xt.Ld];Xt.po; [Xt.F];Xt.po; [Xt.Ld];Xt.fri; [Xt.St]
30
On Architecture to Architecture Mapping for Concurrency
=⇒ [Xt.Ld];Xt.bob;Xt.obs; [Xt.St] ∪ [Xt.Ld];Xt.bob; [Xt.St]
=⇒ [Xt.Ld];Xt.ob; [Xt.St]
=⇒ Xs.mo following the definition.
Case [Xs.F];Xs.po;Xs.fr; [Xs.W]:
=⇒ [Xt.F];Xs.po;Xt.fr; [Xt.St] following the mappings.
=⇒ Xs.mo ∧ ¬Xs.mo−1 following the definition.
Therefore [Xs.U ∪ Xs.F];Xs.po;Xs.fr;Xs.mo does not have a cycle.
Hence a contradiction and Xs satisfies (irrUF).
From definition we know Xt.co =⇒ Xs.mo|loc and therefore Behavior(Xs) = Behavior(Xt) holds.
A.2 Correctness of C11 to x86 to ARMv8 Mapping
We restate the theorem and then prove the same.
Theorem 2. The mapping scheme in Fig. 9b is correct.
Proof. The mapping can be represented as a combination of following transformation steps.
1. PC11 7→ Px86 mapping from map.
2. Px86 7→ PARMv8 mappings from Fig. 9a.
3. Fence strengthening DMBST; STR DMBFULL; STR in PARMv8.
4. Elimination of leading DMBFULL and trailing DMBLD fences in following cases.
(a) DMBFULL; STR STR where WMOVNA 7→ STR.
(b) LDR; DMBLD LDR where RMOVNA 7→ LDR.
We know (1), (2), (3) are sound and therefore it suffices to show that transformation (4) is sound.
Let Xa and X′a be the consistent execution of PARMv8 before and after the transformation (3). Let X be correspnding
C11 execution PC11. and we know PC11 is race-free. Therefore for all non-atomic event a in X if there exist another
same-location event b then X.hb=(a, b) holds.
Now we consider x86 to ARMv8 mapping scheme in Fig. 9a.
Considering the hb definition following are the possibilities:
Case [ENA];X.po; [WwRLX]:
=⇒ [E];Xa.po; [FLD];Xa.po; [F];Xa.po; [E]
=⇒ [E];X′a.po; [F];X′a.po; [E]
=⇒ [E];X′a.bob; [E]
Case [RwRLX];X.po; [ENA]:
=⇒ [Ld]; (Xa.rmw;Xa.F ∪ Xa.po; [FLD]);Xa.po; [E]
=⇒ [E];X′a.bob; [E]
Hence X′a.bob = Xa.bob and the transfmation is sound for x86 to ARMv8 mapping.
As a result, the mapping scheme in Fig. 9b is sound.
31
On Architecture to Architecture Mapping for Concurrency
A.3 ARMv8 to x86 Mappings
We restate Lemma 1.
Lemma 1. Suppose X is an x86 consistent execution. In that case X.po|loc;X.fr =⇒ X.fr ∪ X.co.
Proof. We consider two cases in X:
Case [X.Ld];X.po|loc;X.fr; [X.W]:
Let (r, e) ∈ [X.Ld];X.po|loc; [X.R], (e, w′) ∈ [X.R];X.fr; [X.W] holds.
Also consider X.rf(we, e) and X.rf(w, r) holds.
We show by contradiction that X.co(w,w′) and in consequence X.fr(r, w′) holds.
Assume X.co(we, w) holds. Therefore X.fr(e, w) holds. However, from definition, X.xhb(w, e) holds. It is not
possible in a x86 consistent execution as it violates irreflexive(X.fr;X.xhb) condition. Hence a contradiction and
X.co(w,we) holds.
We also know that X.co(we, w′) holds as from definition X.rf(we, e) ∧ X.fr(e, w′).
As a result X.co(w,w′) holds.
Therefore X.fr(r, w′) holds.
Thus [X.Ld];X.po|loc;X.fr; [X.W] =⇒ X.fr.
Case [X.W];X.po|loc;X.fr; [X.W]:
Let (w,w′) ∈ [X.W];X.po|loc;X.fr; [X.W] and
(w, r) ∈ [X.W];X.po|loc; [X.R] ∧ (r, w′) ∈ [X.R];X.fr; [X.W] holds.
Two subcases:
Subcase X.rf(w, r):
In this case X.co(w,w′) holds by definition.
Subcase X.rfe(wr, r):
In this case w 6= wr.
We show X.co(w,wr) holds by contradiction.
Assume X.co(wr, w) holds. In that case X.fr(r, w) holds. This violates irreflexive(X.fr;X.xhb) constraint and hence a
contradiction.
Therefore, X.co(w,wr) holds and in consequence co(w,w′) holds.
Thus [X.W];X.po|loc;X.fr; [X.W] =⇒ X.co.
We restate Lemma 2.
Lemma 2. Suppose X = 〈E, po, rf,mo〉 is an x86 consistent execution. For each (X.po|loc ∪ X.fr ∪ X.co ∪ X.rf)+
path between two events there exists an alternative (X.xhb ∪ X.fr ∪ X.co)+ path between these two events which has
no intermediate load event.
Proof. Consider a load event r on (X.po|loc ∪ X.fr ∪ X.co ∪ X.rf)+ path. Considering the path, the possible incoming
edges to r are X.rf, X.po|loc, and the outgoing edges are X.fr, X.po|loc.
Let a and b be the source and destination of the incoming and outgoing edges on the path.
Possible cases:
32
On Architecture to Architecture Mapping for Concurrency
Case X.rf(a, r) ∧ X.fr(r, b):
From definition X.co(a, b) holds.
Case X.rf(a, r) ∧ X.po|loc(r, b):
From definition, X.xhb(a, b).
Case X.po|loc(a, r) ∧ X.fr(r, b):
From Lemma 1, X.fr(a, b) ∨ X.co(a, b) holds.
Case X.po|loc(a, r) ∧ X.po|loc(r, b):
From definition, X.xhb(a, b) holds.
We restate Lemma 3.
Lemma 3. Suppose X = 〈E, po, rf,mo〉 is an x86 consistent execution. For each obx path between two events there
exists an alternative obx path which has no intermediate load event.
Proof. Consider a load event r on X.obx path. Considering the path, the possible incoming edges to r are X.rf, X.xppo,
and the outgoing edges are X.fr, X.xppo.
Let a and b be the source and destination of the incoming and outgoing edges on the path.
Possible cases:
Case X.rf(a, r) ∧ X.fr(r, b):
From definition X.mo(a, b) holds.
Case X.rf(a, r) ∧ X.xppo(r, b):
From definition X.xhb(a, b) holds as xppo ⊆ po.
Case X.po(a, r) ∧ X.po(r, b):
From definition X.xhb(a, b) holds as xppo ⊆ po.
Case X.xppo(a, r) ∧ X.fr(r, b):
Considering the subcases of a:
Subcase a ∈ (W ∪ F):
We show X.mo(a, b) holds.
In this case following the definition of xppo we know (a, r) ∈ [St]; po; [F]; po; [Ld]. Let c ∈ F such that X.po(a, c) ∧
X.po(c, r) holds.
We show X.mo(c, b) holds by contradiction.
Assume X.mo(b, c) holds.
In this case X.fr(r, b) ∧ X.mo(b, c) ∧ c ∈ F ∧ X.po(c, r) creates a cycle. Hence a contradiction as X is x86 consistent.
Therefore X.mo(c, b) holds.
We also know that X.mo(a, c) holds as X.mo(c, a) would lead to a X.mo;X.xhb cycle which is a contradiction.
As a result, X.mo(a, c) ∧ X.mo(c, b) implies that X.mo(a, b) holds.
33
On Architecture to Architecture Mapping for Concurrency
Subcase a ∈ Ld:
Let X.rf(w, a). We consider two scenarios based on whether there is an intermediate fence:
Subsubcase (a, r) ∈ [Ld];X.po; [F];X.po; [Ld]:
Let c ∈ F be the intermediate fence event.
It implies (a, c) ∈ X.xppo following s6 and X.mo(c, b) holds. (see earlier subcase)
Thus there is a X.obx path from a to b without passing through r.
Subsubcase Otherwise:
In this case (a, r) ∈ [Ld];X.po; [Ld] ∧ @e. X.po(a, e) ∧ X.po(e, r).
Let c be the event such that (c, r) ∈ X.po ∩ X.obx and there is no such c′ in between c and r.
The scenarios are as follows:
• c ∈ U ∪ F.
In this case X.mo(c, b) holds as otherwise X.mo(b, c) creates a X.fr;X.mo; [U ∪ F];X.po cycle which results
in a contradiction.
Thus X.obx path between the same events does not pass through r.
• c ∈ St.
Following the definition of xppo, there is an intermediate fence event d ∈ F such that X.po(c, d)∧X.po(d, r)
holds. In this case X.mo(d, b) holds and also X.mo(c, d) holds. Hence X.mo(c, b) also holds.
Thus X.obx path between the same events does not pass through r.
• c ∈ Ld.
Let w ∈ W be the event on the X.obx path and X.rfe(w, c) holds.
In this case we show by contradiction that X.mo(w, b) holds.
Assume X.mo(b, w) holds.
In that case X.fr(r, b)∧X.mo(b, w)∧X.rfe(w, c)∧X.po(c, r) creates a cycle which violates x86 consistency
for X. Hence a contradiction and X.mo(w, b) holds.
Thus X.obx path between the same events does not pass through r.
We restate the theorem.
Theorem 3. The mapping scheme in Fig. 12b is correct.
To prove Theorem 3, we prove the following formal statement.
PARMv8  Px86 =⇒ ∀Xt ∈ [[Px86]]. ∃Xs ∈ [[PARMv8]]. Behavior(Xt) = Behavior(Xs)
Proof. Given an x86 execution Xt we define the correxponding ARM execution Xs.
We know that Xt is x86 consistent. Now we show that Xs is ARM consistent. We prove by contradiction.
(internal)
Assume Xs contains Xs.po|loc ∪ Xs.ca ∪ Xs.rf cycle.
It implies a Xt.po|loc ∪ Xt.ca ∪ Xt.rf cycle following the mappings.
In that case we can derive a (Xt.xhb ∪ Xt.fr ∪ Xt.co)+ cycle with no load event in Xt following Lemma 2.
34
On Architecture to Architecture Mapping for Concurrency
Thus the cycle contains only same-location write events.
In that case Xt.fr =⇒ Xt.co and ([Xt.W];Xt.xhb; [Xt.W])|loc =⇒ Xt.co which implies a Xt.mo cycle as Xt.co ⊆
Xt.mo However, we know Xt.mo is has no cycle and hence a contradiction.
Therefore the source execution Xs in ARMv8 satisfies (internal).
(external)
We prove this by contradiction. Assume Xs contains a ob cycle. In that case Xt contains a obx cycle. In that case,
from Lemma 3, we know that there exists a Xt.obx cycle which has no load event. Therefore the cycle contains only
W ∪ F events and thus there is a Xt.mo cycle. However, Xt is x86 consistent and hence there is no Xt.mo cycle. Thus
a contradiction and Xs has no ob cycle. Therefore the source execution Xs in ARMv8 satisfies (external).
(atomic)
We prove this by contradiction. Assume [Xs.rmw]∩;X.fre;Xs.coe 6= ∅.
In that case there exists u ∈ Xt.U, w ∈ Xt.W in x86 consistent execution Xt such that Xt.fre(u,w), Xt.coe(w, u)
hold.
It implies there is a Xt.fr;Xt.mo cycle as fre ⊆ fr and coe ⊆ mo hold.
However, Xt.fr;Xt.mo cycle is not possible as Xt is consistent. Hence a contradiction and therefore
[Xs.rmw]∩;X.fre;Xs.coe = ∅.
Thus Xs is ARMv8 consistent as it satisfies (internal), (external), and (atomic) constraints.
A.4 ARMv7-mca to ARMv8 Mappings
In Appendix A.5 we have already shown all the relevant consistency constraints. It remains to show that (mca) holds
for ARMv7-mca to ARMv8 mappings.
We restate Lemma 7 and then prove the same.
Lemma 7. Suppose Xt is an ARMv8 consistent execution and Xs is corresponding ARMv7 consistent execution. In
that case [Xs.Ld];Xs.ppo; [Xs.Ld];Xs.po|loc; [Xs.St] =⇒ [Xt.Ld];Xt.ob; [Xt.St]
Proof. We start with
[Xs.Ld];Xs.ppo; [Xs.Ld];Xs.po|loc; [Xs.St]
Considering the final incoming edge to [Xs.Ld], we consider following cases:
Case [Xs.Ld];Xs.ppo?; [Xs.E];Xs.addr; [Xs.Ld];Xs.po|loc; [Xs.St]:
It implies [Xs.Ld];Xs.ppo?; [Xs.E];Xs.addr;Xs.po; [Xs.St]
=⇒ [Xt.Ld];Xt.ob?; [Xt.E];Xt.dob; [Xt.St] from Lemma 4.
=⇒ [Xt.Ld];Xt.ob; [Xt.St]
Case [Xs.Ld];Xs.ppo?; [Xs.E];Xs.rdw; [Xs.Ld];Xs.po|loc; [Xs.St]:
It implies [Xs.Ld];Xs.ppo?; [Xs.E];Xs.coe;Xs.rfe; [Xs.Ld];Xs.po|loc; [Xs.St]
=⇒ [Xs.Ld];Xs.ppo?; [Xs.E];Xs.coe;Xs.coe; [Xs.St]
=⇒ [Xt.Ld];Xt.ob?; [Xt.E];Xt.obs;Xt.obs; [Xt.St] from Lemma 4.
=⇒ [Xt.Ld];Xt.ob; [Xt.St]
Case [Xs.Ld];Xs.ppo; [Xs.St];Xs.rfi; [Xs.Ld];Xs.po|loc; [Xs.St]:
35
On Architecture to Architecture Mapping for Concurrency
It implies [Xs.Ld];Xs.ppo?; [Xs.Ld]; (Xs.ctrl∪Xs.data∪Xs.addr); [Xs.St];Xs.coi; [Xs.St] as Xs satisfies (sc-per-loc).
=⇒ [Xs.Ld];Xs.ppo?; [Xs.Ld]; (Xs.ctrl ∪ Xs.data); [Xs.St];Xs.coi; [Xs.St]
∪ [Xs.Ld];Xs.ppo?; [Xs.Ld];Xs.addr;Xs.po; [Xs.St]
=⇒ [Xt.Ld];Xt.ob?; [Xt.Ld];Xt.dob; [Xt.St] ∪ [Xt.Ld];Xt.ob?; [Xt.Ld];Xt.dob; [Xt.St] from Lemma 4.
=⇒ [Xt.Ld];Xt.ob; [Xt.St]
Case [Xs.Ld];Xs.ppo?; [Xs.Ld];Xs.ctrlISB; [Xs.Ld];Xs.po|loc; [Xs.St]:
It implies [Xs.Ld];Xs.ppo?; [Xs.Ld];Xs.ctrl; [Xs.St] as ctrlISB; po ⊆ ctrlISB and ctrlISB ⊆ ctrl.
=⇒ [Xt.Ld];Xt.ob?; [Xt.Ld];Xt.dob; [Xt.St] from Lemma 4.
=⇒ [Xt.Ld];Xt.ob; [Xt.St]
Case [Xs.Ld];Xs.ppo; [Xs.St];Xs.detour; [Xs.Ld];Xs.po|loc; [Xs.St]:
It implies [Xs.Ld];Xs.ppo; [Xs.St];Xs.coe; [Xs.St];Xs.rfe; [Xs.Ld];Xs.po|loc; [Xs.St] from the definition of detour.
=⇒ [Xs.Ld];Xs.ppo; [Xs.St];Xs.coe; [Xs.St];Xs.coe; [Xs.St]
=⇒ [Xt.Ld];Xt.ob; [Xt.St];Xt.obs; [Xt.St];Xt.obs; [Xt.St]
=⇒ [Xt.Ld];Xt.ob; [Xt.St]
Case [Xs.Ld];Xs.ppo?; [Xs.Ld];Xs.ctrl; [Xs.Ld];Xs.po|loc; [Xs.St]:
It implies [Xs.Ld];Xs.ppo?; [Xs.Ld];Xs.ctrl; [Xs.St] as ctrl; po ⊆ ctrl.
=⇒ =⇒ [Xt.Ld];Xt.ob?;Xt.dob; [Xt.St]
=⇒ [Xt.Ld];Xt.ob; [Xt.St]
Case [Xs.Ld];Xs.ppo?; [Xs.Ld];Xs.addr;Xs.po?; [Xs.Ld];Xs.po|loc; [Xs.St]:
It implies [Xs.Ld];Xs.ppo?; [Xs.Ld];Xs.addr;Xs.po?; [Xs.St]
=⇒ =⇒ [Xt.Ld];Xt.ob?;Xt.dob; [Xt.St]
=⇒ [Xt.Ld];Xt.ob; [Xt.St]
Now we show that Xs satisfies (mca). We restate Lemma 8 and then prove the same.
Lemma 8. Suppose Xt is a target ARMv8 consistent execution and Xs is corresponding ARMv7 consistent execution.
In this case Xs.wo+ is acyclic.
Proof. Following the definition of Xs.wo:
Xs.wo , ((Xs.rfe;Xs.ppo;Xs.rfe−1) \ [Xs.E]);Xs.co
It implies
Xs.rfe;Xs.ppo; [Xs.Ld];Xs.fri; [Xs.St] ∪ Xs.rfe;Xs.ppo; [Xs.Ld];Xs.fre; [Xs.St]
=⇒ Xs.rfe; [Xs.Ld];Xs.ppo; [Xs.Ld];Xs.po|loc; [Xs.St] ∪ Xs.rfe;Xs.ppo;Xs.fre from definitions.
=⇒ Xt.rfe; [Xt.Ld];Xt.ob; [Xs.St] ∪ Xt.rfe;Xt.ob;Xt.fre from Lemma 7.
=⇒ Xt.obs; [Xt.Ld];Xt.ob; [Xs.St] ∪ Xt.obs;Xt.ob;Xt.obs from Lemma 9.
36
On Architecture to Architecture Mapping for Concurrency
=⇒ Xt.ob.
Thus Xs.wo+ =⇒ Xt.ob ∪ Xt.ob =⇒ Xt.ob.
We know Xt.ob is acyclic.
Therefore Xs.wo+ is acyclic.
We restate Theorem 7 and then prove the same.
Theorem 7. The mappings in Fig. 12a are correct for ARMv7-mca.
We formally show
PARMv7-mca  PARMv8 =⇒ ∀Xt ∈ [[PARMv8]]. ∃Xs ∈ [[PARMv7-mca]]. Behavior(Xt) = Behavior(Xs)
Proof. Follows from Theorem 4 and Lemma 8. Moreoveover, Xs.co ⇐⇒ Xt.co holds. Therefore Behavior(Xt) =
Behavior(Xs) also holds.
A.5 ARMv7 to ARMv8 Mappings
We restate Theorem 4.
Theorem 4. The mappings in Fig. 12a are correct.
We prove the following formal statement.
PARMv7  PARMv8 =⇒ ∀Xt ∈ [[PARMv8]]. ∃Xs ∈ [[PARMv7]]. Behavior(Xt) = Behavior(Xs)
Given an ARMv8 execution Xt we define the correxponding ARMv7 execution Xs such that Xt.po ⇐⇒ Xs.po,
Xt.rf ⇐⇒ Xs.rf, and Xt.co ⇐⇒ Xs.co hold.
We know that Xt is ARMv8 consistent. We will show that Xs is ARMv7 consistent.
First we relate the Xs and Xt relations.
Lemma 9. Suppose Xs is an ARMv7 consistent execution and Xt is corresponding ARMv8 execution. In that case
Xs.fre =⇒ Xt.obs and Xs.rfe =⇒ Xt.obs.
Proof. Follows from definition.
Lemma 10. Suppose Xs is an ARMv7 consistent execution and Xt is corresponding ARMv8 execution. In that case
Xs.fence =⇒ Xt.bob.
Proof. Xs.fence =⇒ Xt.po; [Xs.F];Xt.po =⇒ Xt.bob.
Lemma 11. (ii0 ∪ ci0 ∪ cc0); [St]; rfi =⇒ ob
Proof. We know rfi ⊆ ii0 and ppo does not have cc0; ii0 subsequence following the constraint. Therefore we show
(ii0 ∪ ci0); [St]; rfi =⇒ ob. It implies (addr ∪ data ∪ ctrlISB); [St]; rfi
=⇒ (addr ∪ data); rfi ∪ ctrl; [St] =⇒ dob ∪ dob =⇒ ob
Let dobcc0 = data ∪ ctrl; [St] ∪ addr ∪ addr; po; [St] and ndobcc0 = ctrl; [Ld] ∪ addr; po; [Ld]. Therefore cc0 =
dobcc0 ∪ ndobcc0.
Lemma 12. cc+0 = dobcc0 ∪ ndobcc0
Proof. From definition cc+0 = (dobcc0 ∪ ndobcc0)+
Consider the following cases:
• dobcc0; dobcc0
=⇒ addr; addr =⇒ addr; po; [St] ∪ addr; po; [Ld] =⇒ dobcc0 ∪ ndobcc0
37
On Architecture to Architecture Mapping for Concurrency
• dobcc0; ndobcc0 =⇒ addr; (ctrl; [Ld] ∪ addr; po[Ld]) =⇒ addr; po; [Ld] =⇒ ndobcc0
• ndobcc0; dobcc0 =⇒ (ctrl; [Ld] ∪ addr; po; [Ld]); dobcc0; [Ld ∪ St]
=⇒ ctrl; [St] ∪ addr; po; [St] ∪ ctrl; [Ld] ∪ addr; po; [Ld] =⇒ dobcc0 ∪ ndobcc0
• ndobcc0; ndobcc0 =⇒ (ctrl; [Ld] ∪ addr; po; [Ld]); ndobcc0; [Ld] =⇒ ndobcc0
Therefore cc+0 = dobcc0 ∪ ndobcc0.
Now we restate Lemma 4 and then prove the same.
Lemma 4. Suppose Xs is an ARMv7 consistent execution and Xt is corresponding ARMv8 execution. In that case
Xs.ppo =⇒ Xt.ob.
Proof. From the definition of ppo and Lemma 12:
[Ld];Xt.ppo =⇒ [Ld]; (Xt.ii0 ∪ Xt.ci0 ∪ Xt.dobcc0; ci?0 ∪ Xt.ndobcc0;Xt.ci0)+
=⇒ [Ld]; (Xt.ii0 ∪ Xt.ci0 ∪ Xt.dobcc0;Xt.ci0 ∪ Xt.ndobcc0;Xt.ci0)+
=⇒ [Ld]; (Xs.addr ∪ Xs.data ∪ Xs.rdw ∪ Xs.ob ∪ Xs.ctrlISB ∪ Xs.detour ∪ Xs.dobcc0; ci?0 ∪ Xs.ndobcc0; ci0)+ by
reducing the rfi edges following Lemma 11.
=⇒ [Ld]; (Xs.ob ∪ Xs.ndobcc0; ci0)+ as
• Xs.addr ∪ Xs.data ∪ Xs.ctrlISB ⊆ Xs.dob ⊆ Xs.ob
• Xs.rdw = Xs.fre;Xs.rfe ⊆ Xs.obs;Xs.obs ⊆ Xs.ob
• Xs.detour = Xs.coe;Xs.rfe ⊆ Xs.obs;Xs.obs ⊆ Xs.ob
• Xs.dobcc0 = (Xs.data ∪ Xs.ctrl; [St] ∪ Xs.addr ∪ Xs.addr;Xs.po; [St]) ⊆ Xs.ob
Now, Xs.ndobcc0;Xs.ci0 = (Xs.ctrl; [Ld] ∪ Xs.addr;Xs.po; [Ld]); (Xs.ctrlISB ∪ Xs.detour) from definition.
=⇒ (Xs.ctrl; [Ld];Xs.ctrlISB ∪ Xs.addr;Xs.po; [Ld];Xs.ctrlISB) as dom(Xs.detour) 6⊆ Ld.
=⇒ (Xs.dob ∪ Xs.dob) as dom(Xs.detour) 6⊆ Ld =⇒ Xs.ob
=⇒ [Ld]; (Xs.ob ∪ Xs.ndobcc0;Xs.ci0)+ =⇒ Xs.ob.
Therefore Xt.ppo =⇒ Xs.ob.
Lemma 13. Suppose Xs is an ARMv7 consistent execution and Xt is corresponding ARMv8 execution. In that case
(i) Xs.ahb =⇒ Xt.ob and (ii) Xs.prop =⇒ Xt.ob
Proof. (i) Xs.ahb =⇒ Xs.ppo ∪ Xs.fence ∪ Xs.rfe =⇒ Xt.ob ∪ Xt.bob ∪ Xt.obs from Lemma 9, Lemma 10, and
Lemma 4.
(ii) We know Xs.prop = Xs.prop1 ∪ Xs.prop2 from definition.
Now,
Xs.prop1
=⇒ [Xt.W];Xt.rfe?;Xt.fence;Xt.ahb∗; [Xt.W]
=⇒ Xt.obs;Xt.bob; (Xt.dob ∪ Xt.bob ∪ Xt.obs); [Xt.W]
=⇒ Xt.ob
Also Xs.prop2
=⇒ ((Xt.co ∪ Xt.fr) \ Xt.po)?;Xt.rfe?; (Xt.fence;Xt.ahb∗)?;Xt.fence;Xt.ahb∗.
=⇒ (Xt.coi ∪ Xt.coe ∪ Xt.fri ∪ Xt.fre) \ Xt.po)?;Xt.rfe?; (Xt.fence;Xt.ahb∗)?;Xt.fence;Xt.ahb∗
38
On Architecture to Architecture Mapping for Concurrency
=⇒ (Xt.coe ∪ Xt.fre) \ Xt.po)?;Xt.rfe?; (Xt.fence;Xt.ahb∗)?;Xt.fence;Xt.ahb∗
=⇒ Xt.obs; (Xt.fence;Xt.ahb∗)?;Xt.fence;Xt.ahb∗
=⇒ Xt.obs; (Xt.bob; (Xt.dob ∪ Xt.bob ∪ Xt.obs)∗)?;Xt.bob; (Xt.dob ∪ Xt.bob ∪ Xt.obs)∗
=⇒ Xt.ob
Hence Xs.prop =⇒ Xt.ob.
Now we prove Theorem 4.
Proof. We show Xs is ARMv7 consistent by contradiction.
(total-mo), (sc-per-loc), (atomicity) hold on Xs as they hold on Xt. It remains to show that (observation) and (propa-
gation) hold on Xs.
(observation)
Assume there is a Xs.fre;Xs.prop;Xs.ahb∗ cycle.
Considering the relations above,
Xs.fre;Xs.prop;Xs.ahb
∗ =⇒ Xt.obs;Xt.ob; (Xt.dob ∪ Xt.bob ∪ Xt.obs)∗ =⇒ Xt.ob.
However, we know that Xt.ob is irreflexive and hence a contradiction.
Therefore, Xs.fre;Xs.prop;Xs.ahb∗ is irreflexive and Xs satisfies (observation).
(propagation)
Assume there is a Xs.co ∪ Xs.prop cycle.
It implies a Xt.co ∪ Xt.ob cycle.
We know Xt.co;Xtco =⇒ Xt.co and Xt.prop;Xt.prop =⇒ Xt.prop.
Thus a Xt.co ∪ Xt.ob cycle can be reduced to a Xt.co ∪ Xt.ob cycle where Xt.co and Xt.prop take place alternatively.
In this case each of Xt.prop ⊆ (Xt.W × Xt.W)|loc ⊆ Xt.co.
It implies there is a Xt.co cycle which is a contradiction.
Hence Xs.co ∪ Xs.prop is acyclic and Xs satisfies (propagation).
Therefore Xs is ARMv7 consistent.
Moreover, Behavior(Xt) = Behavior(Xs) holds as Xt.co ⇐⇒ Xs.co.
A.6 ARMv8 to ARMv7 Mappings
We restate Lemma 5 and then prove the same.
Lemma 5. Suppose Xt is an ARMv7 consistent execution and Xs is ARMv8 execution following the mappings in
Fig. 13a. In this case Xs.ob =⇒ (Xt.rfe ∪ Xt.coe ∪ Xt.fre ∪ Xt.rmw ∪ Xt.fence)+.
Proof. (1) Xs.obs =⇒ Xs.rfe ∪ Xs.coe ∪ Xs.fre
=⇒ Xt.rfe ∪ Xt.coe ∪ Xt.fre from definition.
(2) We know
Xs.dob ⊆ [Xs.Ld];Xs.po; [Xs.E].
=⇒ [Xt.Ld];Xt.po; [Xt.F];Xt.po; [Xs.E] following the mappings in Fig. 13a.
39
On Architecture to Architecture Mapping for Concurrency
=⇒ [Xt.Ld];Xt.fence; [Xt.E] from the definition.
(3)
We know aob , rmw ∪ [range(rmw)]; rfi; [A]
Hence Xs.rmw ∪ [range(Xs.rmw)];Xs.rfi; [Xs.A ∪ Xs.Q]
=⇒ Xt.rmw ∪ [range(Xt.rmw)];Xt.rfi; [Xt.Ld];Xt.po; [Xt.F]
(4)
Following the definition of Xs.bob, we consider its components:
• Xs.po; [Xs.F];Xs.po
=⇒ Xt.po; [Xt.F];Xt.po
=⇒ Xt.fence
• [Xs.STLR];Xs.po; [Xs.LDAR]
=⇒ [Xt;F];Xt.po; [Xt.St];Xt.po; [Xt;F];Xt.po; [Xt.Ld]
=⇒ Xt.fence
• [Xs; Ld];Xs.po; [Xs.F];Xs.po
=⇒ Xt.fence
• [Xs.LDAR];Xs.po
=⇒ [Xt; Ld];Xt.po; [Xt.F];Xt.po
=⇒ Xt.fence
• [Xs.St];Xs.po; [Xs.FST];Xs.po; [Xs.St]
=⇒ [Xt.St];Xt.po; [Xt.F];Xt.po; [Xt.St]
=⇒ Xt.fence
• Xs.po; [Xs.STLR]
=⇒ Xt.po; [Xt.F];Xt.po; [Xt.St]
=⇒ Xt.fence
• Xs.po; [Xs.STLR];Xs.coi
=⇒ Xt.po; [Xt.F];Xt.po; [Xt.St];Xt.po
=⇒ Xt.fence
Thus Xs.bob =⇒ Xt.fence .
Therefore Xs.ob =⇒
(Xt.rfe ∪ Xt.coe ∪ Xt.fre ∪ Xt.fence ∪ Xt.rmw ∪ [range(Xt.rmw)];Xt.rfi; [Xt.Ld];Xt.po; [Xt.F])+
Considering the outgoing edges from Ld event in [range(Xt.rmw)];Xt.rfi; [Xt.Ld];Xt.po; [Xt.F]we consider two cases:
case [range(Xt.rmw)];Xt.rfi; [Xt.Ld];Xt.po; [Xt.F];Xt.po
=⇒ Xt.fence
case [range(Xt.rmw)];Xt.rfi; [Xt.Ld];Xt.fre
=⇒ [range(Xt.rmw)];Xt.coe by definition of fre.
Therefore Xs.ob =⇒ (Xt.rfe ∪ Xt.coe ∪ Xt.fre ∪ Xt.fence ∪ Xt.rmw)+.
40
On Architecture to Architecture Mapping for Concurrency
We restate the Lemma 6 and then prove the same.
Lemma 6. Suppose Xt is an ARMv7 consistent execution and Xs is ARMv8 execution following the mappings in
Fig. 13a. In this case either Xs.ob =⇒ ((Xt.E× Xt.E)|loc \ [E]) or Xs.ob =⇒ (Xt.co;Xt.prop ∪ Xt.prop)+.
Proof. We know Xs.ob =⇒ (Xt.rfe ∪ Xt.coe ∪ Xt.fre ∪ Xt.rmw ∪ Xt.fence)+ from Lemma 5.
Scenario (1): (Xt.rfe ∪ Xt.coe ∪ Xt.fre ∪ Xt.rmw ∪ Xt.fence)+ has no Xt.fence.
In this case (Xt.rfe ∪ Xt.coe ∪ Xt.fre ∪ Xt.rmw)+ =⇒ (Xt.E× Xt.E)|loc \ [E ] from the definitions.
Scenario (2): Otherwise
In this case Xs.ob =⇒ ((Xt.rfe ∪ Xt.coe ∪ Xt.fre ∪ Xt.rmw)∗;Xt.fence)+
Now we consider following cases:
(RR) [Xt.Ld]; (Xt.rfe ∪ Xt.coe ∪ Xt.fre ∪ Xt.rmw)+; [Xt.Ld];Xt.fence
(RW) [Xt.Ld]; (Xt.rfe ∪ Xt.coe ∪ Xt.fre ∪ Xt.rmw)+; [Xt.St];Xt.fence
(WR) [Xt.St]; (Xt.rfe ∪ Xt.coe ∪ Xt.fre ∪ Xt.rmw)+; [Xt.Ld];Xt.fence
(WW) [Xt.St]; (Xt.rfe ∪ Xt.coe ∪ Xt.fre ∪ Xt.rmw)+; [Xt.St];Xt.fence
Case (RR):
[Xt.Ld]; (Xt.rfe ∪ Xt.coe ∪ Xt.fre ∪ Xt.rmw)+; [Xt.Ld];Xt.fence
=⇒ [Xt.Ld]; (Xt.rfe ∪ Xt.coe ∪ Xt.fre ∪ Xt.rmw)+; [Xt.St];Xt.rfe; [Xt.Ld];Xt.fence
=⇒ [Xt.Ld];Xt.fr; [Xt.St];Xt.rfe; [Xt.Ld];Xt.fence as Xt satisfies (sc-per-loc).
=⇒ [Xt.Ld];Xt.fri; [Xt.St];Xt.rfe; [Xt.Ld];Xt.fence ∪ [Xt.Ld];Xt.fre; [Xt.St];Xt.rfe; [Xt.Ld];Xt.fence
=⇒ [Xt.Ld]; (Xt.rmw ∪ Xt.fence); [Xt.St];Xt.rfe; [Xt.Ld];Xt.fence ∪ Xt.prop2
following the mapping of Fig. 13a and definition of prop2.
=⇒ [Xt.Ld];Xt.rmw; [Xt.St];Xt.rfe; [Xt.Ld];Xt.fence
∪ [Xt.Ld];Xt.fence; [Xt.St];Xt.rfe; [Xt.Ld];Xt.fence ∪ Xt.prop2
=⇒ [Xt.Ld];Xt.ppo; [Xt.St];Xt.rfe; [Xt.Ld];Xt.fence
∪ [Xt.Ld];Xt.fence; [Xt.St];Xt.rfe; [Xt.Ld];Xt.fence ∪ Xt.prop2
as Xt.rmw =⇒ Xt.ppo.
=⇒ [Xt.Ld];Xt.ahb;Xt.fence
∪ [Xt.Ld];Xt.fence; [Xt.St];Xt.ahb; [Xt.Ld];Xt.fence ∪ Xt.prop2
from definition of prop2.
=⇒ Xt.prop2 ∪ prop2 ∪ prop2
=⇒ Xt.prop
Case (RW):
41
On Architecture to Architecture Mapping for Concurrency
[Xt.Ld]; (Xt.rfe ∪ Xt.coe ∪ Xt.fre ∪ Xt.rmw)+; [Xt.St];Xt.fence
=⇒ [Xt.Ld];Xt.fr;Xt.fence
=⇒ [Xt.Ld];Xt.fri;Xt.fence ∪ [Xt.Ld];Xt.fre;Xt.fence
=⇒ [Xt.Ld];Xt.fence ∪ [Xt.Ld];Xt.fre;Xt.fence
=⇒ Xt.prop2 ∪ Xt.prop2 from definition of prop2.
=⇒ Xt.prop
Case (WR):
[Xt.St]; (Xt.rfe ∪ Xt.coe ∪ Xt.fre ∪ Xt.rmw)+; [Xt.Ld];Xt.fence
=⇒ [Xt.St]; (Xt.rfe ∪ Xt.coe ∪ Xt.fre ∪ Xt.rmw)∗; [Xt.St];Xt.rfe; [Xt.Ld];Xt.fence
=⇒ [Xt.St];Xt.co; [Xt.St];Xt.rfe; [Xt.Ld];Xt.fence as Xt satisfies (sc-per-loc).
=⇒ Xt.co;Xt.prop1 from definition.
=⇒ Xt.co;Xt.prop as prop1 ⊆ prop
=⇒ [Xt.St];Xt.coi; [Xt.St];Xt.rfe; [Xt.Ld];Xt.fence ∪ [Xt.St];Xt.coe; [Xt.St];Xt.rfe; [Xt.Ld];Xt.fence
=⇒ [Xt.St];Xt.coi; [Xt.St];Xt.rfe; [Xt.Ld];Xt.fence ∪ Xt.prop2 from definitions
Case (WW):
[Xt.St]; (Xt.rfe ∪ Xt.coe ∪ Xt.fre ∪ Xt.rmw); [Xt.St];Xt.fence
=⇒ [Xt.St];Xt.co; [Xt.St];Xt.fence
=⇒ [Xt.St];Xt.coi; [Xt.St];Xt.fence ∪ [Xt.St];Xt.coe; [Xt.St];Xt.fence
=⇒ Xt.fence ∪ [Xt.St];Xt.coe; [Xt.St];Xt.fence
=⇒ Xt.prop2 ∪ Xt.prop2 from definition of prop2.
=⇒ Xt.prop
Thus (in Scenario-II) Xs.ob =⇒ (Xt.co;Xt.prop ∪ Xt.prop)+.
Finally we restate Theorem 5 and then prove the same.
Theorem 5. The mappings in Fig. 13a are correct.
To prove Theorem 5, we prove the following formal statement.
PARMv8  PARMv7 =⇒ ∀Xt ∈ [[PARMv7]]. ∃Xs ∈ [[PARMv8]]. Behavior(Xt) = Behavior(Xs)
Proof. We know that Xt is ARMv7 consistent. Now we show that Xs is ARMv8 consistent. We prove by contradiction.
Case (internal) : We know that (sc-per-loc) holds in Xt. Hence (internal) trivially holds in Xs.
Case (external): Assume there is a Xs.ob cycle.
From Lemma 6 we know that Xs.ob =⇒ ((Xt.E× Xt.E)|loc \ [E ]) ∪ (Xt.co;Xt.prop ∪ Xt.prop)+.
We know both ((Xt.E× Xt.E)|loc\[E ]) is acyclic as Xt satisfies (sc-per-loc) and (Xt.co;Xt.prop∪Xt.prop)+ is acyclic
as Xt satisfies (propagation).
Case (atomic):
42
On Architecture to Architecture Mapping for Concurrency
We know that (atomic) holds in Xt. Hence (atomic) trivially holds in Xs.
Therefore Xs is consistent. Moreover, as Xs.co ⇐⇒ Xt.co holds, Behavior(Xs) = Behavior(Xt) also holds.
A.7 Proff of correctness: C11 to ARMv8 to ARMv7
We restate the theorem and then prove the correctness.
Theorem 6. The mapping scheme in Fig. 13b is correct.
Proof. The mapping can be represented as a combination of following transformation steps.
1. PC11 7→ PARMv8 mapping from map.
2. PARMv8 7→ PARMv7 mappings from Fig. 13a.
3. Elimination of leading DMB fences for LDRNA 7→ LDR mapping, that is, LDRNA 7→ LDR; DMB LDR.
We know (1), (2) are sound and therefore it suffices to show that transformation (4) is sound.
Let Xa and X′a be the consistent execution of PARMv8 before and after the transformation (4). Let X be correspnding
C11 execution PC11. and we know PC11 is race-free. Therefore for all non-atomic event a in X if there exist another
same-location event b then X.hb=(a, b) holds.
Now we consider ARMv8 to ARMv7 mapping scheme.
Considering the hb definition following are the possibilities:
Case [ENA];X.po; [FwREL];X.po; [WRLX] ∪ [ENA];X.po; [WwREL]:
=⇒ [E];Xa.po; [F];Xa.po; [St ∪ rmw]
=⇒ [E];X′a.po; [F];X′a.po; [E]
=⇒ [E];X′a.fence; [E]
Case [RwRLX];X.po; [ENA] ∪ [RRLX];Xpo; [FwACQ];X.po; [ENA]:
=⇒ [Ld];Xa.po; [F];Xa.po
=⇒ [E];X′a.fence; [E]
Therefore X′a.fence = Xa.fence and the transfmation is sound for ARMv8 to ARMv7 mapping.
A.8 ARMv7-mca to ARMv8 Mappings
In Appendix A.5 we have already shown all the relevant consistency constraints. It remains to show that (mca) holds
for ARMv7-mca to ARMv8 mappings.
We restate Lemma 7 and then prove the same.
Lemma 7. Suppose Xt is an ARMv8 consistent execution and Xs is corresponding ARMv7 consistent execution. In
that case [Xs.Ld];Xs.ppo; [Xs.Ld];Xs.po|loc; [Xs.St] =⇒ [Xt.Ld];Xt.ob; [Xt.St]
Proof. We start with
[Xs.Ld];Xs.ppo; [Xs.Ld];Xs.po|loc; [Xs.St]
Considering the final incoming edge to [Xs.Ld], we consider following cases:
Case [Xs.Ld];Xs.ppo?; [Xs.E];Xs.addr; [Xs.Ld];Xs.po|loc; [Xs.St]:
It implies [Xs.Ld];Xs.ppo?; [Xs.E];Xs.addr;Xs.po; [Xs.St]
43
On Architecture to Architecture Mapping for Concurrency
=⇒ [Xt.Ld];Xt.ob?; [Xt.E];Xt.dob; [Xt.St] from Lemma 4.
=⇒ [Xt.Ld];Xt.ob; [Xt.St]
Case [Xs.Ld];Xs.ppo?; [Xs.E];Xs.rdw; [Xs.Ld];Xs.po|loc; [Xs.St]:
It implies [Xs.Ld];Xs.ppo?; [Xs.E];Xs.coe;Xs.rfe; [Xs.Ld];Xs.po|loc; [Xs.St]
=⇒ [Xs.Ld];Xs.ppo?; [Xs.E];Xs.coe;Xs.coe; [Xs.St]
=⇒ [Xt.Ld];Xt.ob?; [Xt.E];Xt.obs;Xt.obs; [Xt.St] from Lemma 4.
=⇒ [Xt.Ld];Xt.ob; [Xt.St]
Case [Xs.Ld];Xs.ppo; [Xs.St];Xs.rfi; [Xs.Ld];Xs.po|loc; [Xs.St]:
It implies [Xs.Ld];Xs.ppo?; [Xs.Ld]; (Xs.ctrl∪Xs.data∪Xs.addr); [Xs.St];Xs.coi; [Xs.St] as Xs satisfies (sc-per-loc).
=⇒ [Xs.Ld];Xs.ppo?; [Xs.Ld]; (Xs.ctrl ∪ Xs.data); [Xs.St];Xs.coi; [Xs.St]
∪ [Xs.Ld];Xs.ppo?; [Xs.Ld];Xs.addr;Xs.po; [Xs.St]
=⇒ [Xt.Ld];Xt.ob?; [Xt.Ld];Xt.dob; [Xt.St] ∪ [Xt.Ld];Xt.ob?; [Xt.Ld];Xt.dob; [Xt.St] from Lemma 4.
=⇒ [Xt.Ld];Xt.ob; [Xt.St]
Case [Xs.Ld];Xs.ppo?; [Xs.Ld];Xs.ctrlISB; [Xs.Ld];Xs.po|loc; [Xs.St]:
It implies [Xs.Ld];Xs.ppo?; [Xs.Ld];Xs.ctrl; [Xs.St] as ctrlISB; po ⊆ ctrlISB and ctrlISB ⊆ ctrl.
=⇒ [Xt.Ld];Xt.ob?; [Xt.Ld];Xt.dob; [Xt.St] from Lemma 4.
=⇒ [Xt.Ld];Xt.ob; [Xt.St]
Case [Xs.Ld];Xs.ppo; [Xs.St];Xs.detour; [Xs.Ld];Xs.po|loc; [Xs.St]:
It implies [Xs.Ld];Xs.ppo; [Xs.St];Xs.coe; [Xs.St];Xs.rfe; [Xs.Ld];Xs.po|loc; [Xs.St] from the definition of detour.
=⇒ [Xs.Ld];Xs.ppo; [Xs.St];Xs.coe; [Xs.St];Xs.coe; [Xs.St]
=⇒ [Xt.Ld];Xt.ob; [Xt.St];Xt.obs; [Xt.St];Xt.obs; [Xt.St]
=⇒ [Xt.Ld];Xt.ob; [Xt.St]
Case [Xs.Ld];Xs.ppo?; [Xs.Ld];Xs.ctrl; [Xs.Ld];Xs.po|loc; [Xs.St]:
It implies [Xs.Ld];Xs.ppo?; [Xs.Ld];Xs.ctrl; [Xs.St] as ctrl; po ⊆ ctrl.
=⇒ =⇒ [Xt.Ld];Xt.ob?;Xt.dob; [Xt.St]
=⇒ [Xt.Ld];Xt.ob; [Xt.St]
Case [Xs.Ld];Xs.ppo?; [Xs.Ld];Xs.addr;Xs.po?; [Xs.Ld];Xs.po|loc; [Xs.St]:
It implies [Xs.Ld];Xs.ppo?; [Xs.Ld];Xs.addr;Xs.po?; [Xs.St]
=⇒ =⇒ [Xt.Ld];Xt.ob?;Xt.dob; [Xt.St]
=⇒ [Xt.Ld];Xt.ob; [Xt.St]
Now we show that Xs satisfies (mca). We restate Lemma 8 and then prove the same.
44
On Architecture to Architecture Mapping for Concurrency
Lemma 8. Suppose Xt is a target ARMv8 consistent execution and Xs is corresponding ARMv7 consistent execution.
In this case Xs.wo+ is acyclic.
Proof. Following the definition of Xs.wo:
Xs.wo , ((Xs.rfe;Xs.ppo;Xs.rfe−1) \ [Xs.E]);Xs.co
It implies
Xs.rfe;Xs.ppo; [Xs.Ld];Xs.fri; [Xs.St] ∪ Xs.rfe;Xs.ppo; [Xs.Ld];Xs.fre; [Xs.St]
=⇒ Xs.rfe; [Xs.Ld];Xs.ppo; [Xs.Ld];Xs.po|loc; [Xs.St] ∪ Xs.rfe;Xs.ppo;Xs.fre from definitions.
=⇒ Xt.rfe; [Xt.Ld];Xt.ob; [Xs.St] ∪ Xt.rfe;Xt.ob;Xt.fre from Lemma 7.
=⇒ Xt.obs; [Xt.Ld];Xt.ob; [Xs.St] ∪ Xt.obs;Xt.ob;Xt.obs from Lemma 9.
=⇒ Xt.ob.
Thus Xs.wo+ =⇒ Xt.ob ∪ Xt.ob =⇒ Xt.ob.
We know Xt.ob is acyclic.
Therefore Xs.wo+ is acyclic.
We restate Theorem 7 and then prove the same.
Theorem 7. The mappings in Fig. 12a are correct for ARMv7-mca.
We formally show
PARMv7-mca  PARMv8 =⇒ ∀Xt ∈ [[PARMv8]]. ∃Xs ∈ [[PARMv7-mca]]. Behavior(Xt) = Behavior(Xs)
Proof. Follows from Theorem 4 and Lemma 8. Moreoveover, Xs.co ⇐⇒ Xt.co holds. Therefore Behavior(Xt) =
Behavior(Xs) also holds.
45
On Architecture to Architecture Mapping for Concurrency
B Proofs and counter-examples for Optimizations in ARMv8
B.1 Proofs of Safe reorderings
We prove the following theorem for safe reorderings in Fig. 14.
Psrc  Ptgt =⇒ ∀Xt ∈ [[Ptgt]] ∃Xs ∈ [[Psrc]]. Behavior(Xt) = Behavior(Xs)
Proof. We know Xt is ARMv8 consistent. We define Xs where a · b b · a.
Xs.E = Xt.E
Xs.po = (Xt.po \ {(b, a)} ∪ {(a, b)})+
Xs.rf = Xt.rf
Xs.co = Xt.co
We show that Xs is ARMv8 consistent.
(internal)
We know that Xt.po|loc = Xs.po|loc, Xs.rf = Xt.rf, Xs.fr = Xt.fr, Xs.co = Xt.co hold. We also know that Xt satisfies
(internal). Therefore Xs also satisfies (internal).
(external)
We relate the ob relations between memory accesses in Xt and Xs. Let M = St ∪ Ld ∪ L ∪ A.
• St(x)/L(x) · Ld(y) Ld(y) · St(x)/L(x). In this case Xs.aob = Xt.aob, Xs.bob ⊆ Xt.bob, and Xs.dob =
Xt.dob hold.
• Ld(x) · Ld(y) Ld(y) · Ld(x) In this case Xs.aob = Xt.aob, Xs.bob = Xt.bob, and Xs.dob = Xt.dob hold.
• FST · Ld(y) Ld(y) · FST. In this case Xs.aob = Xt.aob, Xs.bob = Xt.bob, and Xs.dob = Xt.dob hold.
• St(x)/Ld(x)/FST · A(y)  A(y) · St(x)/Ld(x)/FST In this case Xs.aob = Xt.aob, Xs.bob ⊆ Xt.bob, and
Xs.dob = Xt.dob hold.
• FLD/FST/F · L(y)  FLD/FST/F · L(y). In this case Xs.aob = Xt.aob, and Xs.dob = Xt.dob hold. We also
know that [M ];Xs.bob; [Xs.L] = [M ];Xt.bob; [Xt.L] and [L];Xs.bob; [M ] ⊆ [L];Xt.bob; [M ] hold.
• A(x)FLD/FST · F  F · A(x)FLD/FST. In this case Xs.aob = Xt.aob, Xs.bob; [M ] ⊆ Xt.bob; [M ],
[M ];Xs.bob = [M ];Xt.bob, and Xs.dob = Xt.dob hold.
• St/L/A/F · FLD  FLD · St/L/A/F. In this case Xs.aob = Xt.aob, [M ];Xs.bob; [M ] ⊆ [M ];Xt.bob; [M ],
and Xs.dob = Xt.dob hold.
• FLD/A/F · FST  FST · FLD/A/F. In this case Xs.aob = Xt.aob, [M ];Xs.bob; [M ] = [M ];Xt.bob; [M ], and
Xs.dob = Xt.dob hold.
Hence [M ];Xs.obi; [M ] ⊆ [M ];Xt.obi; [M ] holds.
We also know that Xs.rf = Xt.rf and Xs.co = Xt.co hold.
We also know that irr(Xt.ob) holds.
Therefore irr(Xt.ob) also holds.
We know that Xt.rmw = Xs.rmw, Xs.rf = Xt.rf, Xs.fr = Xt.fr, Xs.co = Xt.co hold. We also know that Xt satisfies
(atomic). Therefore Xs also satisfies (atomic).
We already know Xs.co = Xt.co and therefore Behavior(Xs) = Behavior(Xt).
46
On Architecture to Architecture Mapping for Concurrency
B.2 Safe eliminations
We prove the following theorem for (RAR), (RAA), and (AAA) safe eliminations in Fig. 14(a).
Psrc  Ptgt =⇒ ∀Xt ∈ [[Ptgt]] ∃Xs ∈ [[Psrc]]. Behavior(Xt) = Behavior(Xs)
Proof. We know Xt is ARMv8 consistent. We define Xs where a · b a where
(RAR) a = Ld(X, v′) and b = Ld(X, v) or
(RAA) a = A(X, v′) and b = Ld(X, v) or
(AAA) a = A(X, v′) and b = A(X, v).
Xs.E = Xt.E ∪ {b}
Xs.po = (Xt.po ∪ {(a, b)})+
Xs.rf = Xt.rf ∪ {(w, b) | Xt.rf(w, a)}
Xs.co = Xt.co
Moreover, [{a}];Xs.poimm; [{b}];Xs.dob =⇒ [{a}];Xt.dob.
We show that Xs is ARMv8 consistent.
Assume Xs is not consistent.
(internal)
Asume a Xs.po|loc ∪ Xs.ca ∪ Xs.rf cycle.
It implies a Xt.po|loc ∪ Xt.ca ∪ Xt.rf cycle as [{b}];Xs.fr implies [{a}];Xs.fr, and [{a}];Xt.fr.
Therefore a contradiction and Xs satisfies (internal).
(external)
We know dom(Xs.dob); [{b}] =⇒ dom(Xs.dob); [{a}] = dom(Xt.dob); [{a}] hold.
Moreover, [{b}].Xs.dob =⇒ [{a}].Xt.dob.
Also in case of (AAA), codom([{b}];Xs.bob) = codom([{a}];Xs.bob) \ {b} = codom([{a}];Xt.bob) hold.
Hence Xs.ob ⊆ Xt.ob.
We know irr(Xt.ob) holds.
Therefore a contradiction and Xs satisfies (external).
(atomicity)
From definition Xs.rmw = Xt.rmw, Xs.coe = Xt.coe, and Xs.fre = Xt.fre hold.
Therefore Xs preserves atomicity as Xt preserves atomicity.
Moreover, Behavior(Xs) = Behavior(Xt) holds as Xs.co = Xt.co holds.
B.3 Access strengthening
We prove the following theorem for (R-A) from Fig. 14(a).
Psrc  Ptgt =⇒ ∀Xt ∈ [[Ptgt]] ∃Xs ∈ [[Psrc]]. Behavior(Xt) = Behavior(Xs)
Proof. We know Xt is ARMv8 consistent. We define Xs where a b where a = Ld(X, v) and b = A(X, v).
Xs.E = Xt.E ∪ {a} \ {b}
47
On Architecture to Architecture Mapping for Concurrency
Xs.po = Xt.po ∪ {(e, a) | Xt.po(e, b)} ∪ {(a, e) | Xt.po(b, e)}
Xs.rf = Xt.rf ∪ {(w, a) | Xt.rf(w, b)}
Xs.co = Xt.co
We show that Xs is ARMv8 consistent.
Assume Xs is not consistent.
(internal)
Asume a Xs.po|loc ∪ Xs.ca ∪ Xs.rf cycle.
It implies a Xt.po|loc ∪ Xt.ca ∪ Xt.rf cycle which is a contradiction and hence Xs satisfies (internal).
(external)
We know dom(Xs.ob); [{a}] = dom(Xs.ob); [{b}] and [{a}]; codom(X.po) = [{b}]; codom(Xt.bob).
Hence Xs.ob ⊆ Xt.ob.
We know irr(Xt.ob) holds.
Therefore a contradiction and Xs satisfies (external).
(atomicity)
From definition Xs.rmw = Xt.rmw, Xs.coe = Xt.coe, and Xs.fre = Xt.fre hold.
Therefore Xs preserves atomicity as Xt preserves atomicity.
Moreover, Behavior(Xs) = Behavior(Xt) holds as Xs.co = Xt.co holds.
48
On Architecture to Architecture Mapping for Concurrency
C Fence Elimination
C.1 Fence Elimination in x86
We restate the theorem on x86 fence elimination.
Theorem 8. An MFENCE in an x86 program thread is non-eliminable if it is the only fence on a program path from a
store to a load in the same thread which access different locations.
An MFENCE elimination is safe when it is not non-eliminable.
Proof. We show:
Psrc  Ptgt =⇒ ∀Xt ∈ [[Ptgt]] ∃Xs ∈ [[Psrc]]. Behavior(Xt) = Behavior(Xs)
Given a Xt ∈ [[Ptgt]] we define Xs ∈ Psrc by introducing the corresponding fence event e such that for all events
w ∈ Xs.W ∪ Xs.F,
• if (w, e) ∈ Xs.mo?;Xs.xhb holds then Xs.mo(w, e).
• Otherwise Xs.mo(e, w).
We know Xt is consistent.
Now we show Xs is consistent.
We prove by contradiction.
(irrHB) Assume Xs has Xs.xhb cycle. We know the incoming and outgoing edges to e are Xs.po edges and therefore
Xt.xhb already has a cycle. However, we know Xt.xhb is irreflexive. Hence a contradiction and Xs.xhb is irreflexive.
(irrMOHB) Assume Xs has Xs.mo;Xs.xhb cycle. We already know that Xt.mo;Xt.xhb is irreflexive. Therefore the
cycle contains e. Two possiblilities:
Case e ∈ dom(Xs.xhb) and e ∈ codom(Xs.mo):
Suppose Xs.xhb(e, w) and Xs.mo(w, e). However, from definition we already know Xs.xhb(e, w) =⇒ Xs.mo(e, w)
when w ∈ Xs.W ∪ Xs.F. Hence a contradiction and Xs.mo;Xs.xhb is irreflexive in this case.
Case e ∈ codom(Xs.xhb) and e ∈ dom(Xs.mo):
Suppose Xs.xhb(w, e) and Xs.mo(e, w). However, from definition we already know Xs.xhb(w, e) =⇒ Xs.mo(w, e)
when w ∈ Xs.W ∪ Xs.F. Hence a contradiction and Xs.mo;Xs.xhb is irreflexive in this case.
(irrFRHB) We know Xt does not have a Xt.fr;Xt.xhb cycle. We also know fr ⊆ (W×W) and hence event e ∈ F does
not introduce any new Xs.fr;Xxhb cycle. Therefore Xs.fr;Xxhb is irreflexive.
(irrFRMO)
We know Xt does not have a Xt.fr;Xt.mo cycle. We also know fr ⊆ (W × W) and hence event e ∈ F does not
introduce any new Xs.fr;Xmo cycle. Therefore Xs.fr;Xmo is irreflexive.
(irrFMRP)
Assume Xs has a Xs.fr;Xs.mo;Xs.rfe;Xs.po cycle in Xs cycle.
In that case the cycle is of the form:
[Xs.R];Xs.fr; [Xs.W];Xs.mo; [Xs.W];Xs.rfe; [Xs.R];Xs.po; [Xs.R].
We know e ∈ F and therefore does not introduce this cycle in Xs.
In that case Xt already has a Xt.fr;Xt.mo;Xt.rfe;Xt.po cycle which is a contradiction.
49
On Architecture to Architecture Mapping for Concurrency
Hence Xs.fr;Xs.mo;Xs.rfe;Xs.po cycle in Xs is irreflexive.
(irrUF)
Assume Xs has a Xs.fr;Xs.mo; [Xs.U ∪ Xs.F];Xs.po cycle.
Two possiblities
Case Xs.fr;Xs.mo; [Xs.U];Xs.po:
It implies a Xt.fr;Xt.mo; [Xt.U];Xt.po cycle.
However, we know Xt satisfies (irrUF) and hence a contradiction.
Case Xs.fr;Xs.mo; [Xs.F];Xs.po:
It implies a [Xs.R];Xs.fr; [Xs.W];Xs.mo; [Xs.F];Xs.po; [Xs.R] cycle created by the introduced event e ∈ F.
It implies [Xs.R];Xs.fr; [Xs.W];Xs.mo; [{e}];Xs.po; [Xs.R]
From definition, we know [Xs.W];Xs.mo; [{e}] when [Xs.W];Xs.mo?;Xs.xhb; [{e}] holds.
Thus [Xs.R];Xs.fr; [Xs.W];Xs.mo; [{e}];Xs.po; [Xs.R]
=⇒ [Xs.R];Xs.fr; [Xs.W];Xs.mo?; [Xs.W];Xs.xhb; [{e}];Xs.po; [Xs.R]
=⇒ [Xs.R];Xs.fr; [Xs.W];Xs.mo?; [Xs.W];
(Xs.xhb
?; [Xs.W];Xs.rfe;Xs.po ∪ Xs.po); [{e}];Xs.po; [Xs.R]
=⇒ [Xs.R];Xs.fr; [Xs.W];Xs.mo?; [Xs.W];Xs.xhb?; [Xs.W];Xs.rfe;Xs.po; [{e}];Xs.po; [Xs.R]
∪ [Xs.R];Xs.fr; [Xs.W];Xs.mo?; [Xs.W];Xs.po; [{e}];Xs.po; [Xs.R]
=⇒ [Xs.R];Xs.fr; [Xs.W];Xs.mo?; [Xs.W];Xs.xhb?; [Xs.W];Xs.rfe;Xs.po; [Xs.R]
∪ [Xs.R];Xs.fr; [Xs.W];Xs.mo?; [Xs.W];Xs.po; [{e}];Xs.po; [Xs.R]
Now we consider two subcases:
Subcase [Xs.R];Xs.fr; [Xs.W];Xs.mo?; [Xs.W];Xs.xhb?; [Xs.W];Xs.rfe;Xs.po; [Xs.R]:
=⇒ [Xt.R];Xt.fr; [Xt.W];Xt.mo?; [Xt.W];Xt.xhb?; [Xt.W];Xt.rfe;Xt.po; [Xt.R]
=⇒ [Xt.R];Xt.fr;Xt.mo?;Xt.rfe;Xt.po; [Xt.R]
This is a contradiction as Xt satisfies (irrFMRP).
Subcase [Xs.R];Xs.fr; [Xs.W];Xs.mo?; [Xs.W];Xs.po; [{e}];Xs.po; [Xs.R]:
Now we consider the [Xs.W];Xs.po; [{e}];Xs.po; [Xs.R] subsequence.
Possible cases:
Subsubcase [Xs.St];Xs.po; [{e}];Xs.po; [Xs.Ld] :
It implies [Xt.St];Xt.po; [Xt.F];Xt.po; [Xt.Ld] from the definition.
=⇒ [Xt.St];Xt.mo; [Xt.F];Xt.po; [Xt.Ld].
In that case there exists a Xt.fr;Xt.mo; [Xt.F];Xt.po cycle.
50
On Architecture to Architecture Mapping for Concurrency
This is a contradiction as Xt satisfies (irrFMRP).
Subsubcase [Xs.W];Xs.po; [{e}];Xs.po; [Xs.U] :
It implies [Xt.W];Xt.po; [Xt.U].
It implies [Xt.W];Xt.mo; [Xt.U] as Xt satisfies (irrMOHB).
In this case [Xs.R];Xs.fr; [Xs.W];Xs.mo?; [Xs.W];Xs.po; [{e}];Xs.po; [Xs.R]
=⇒ [Xt.R];Xt.fr; [Xt.W];Xt.mo?; [Xt.W];Xt.mo; [Xt.U]
=⇒ [Xt.R];Xt.fr;Xt.mo; [Xt.U]
Hence a contradiction as Xt satisfies (irrFRMO).
Subsubcase [Xs.U];Xs.po; [{e}];Xs.po; [Xs.Ld]:
It implies [Xt.U];Xt.po; [Xt.Ld] and in consequence a
[Xt.Ld];Xt.fr; [Xt.W];Xt.mo?; [Xt.U];Xt.po; [Xt.Ld] cycle.
Now, [Xt.Ld];Xt.fr; [Xt.W];Xt.mo?; [Xt.U];Xt.po; [Xt.Ld]
=⇒ [Xt.Ld];Xt.fr; [Xs.U];Xt.po; [Xt.Ld] ∪ [Xt.Ld];Xt.fr;Xt.mo; [Xt.U];Xt.po; [Xt.Ld]
=⇒ [Xt.Ld];Xt.fr;Xt.xhb; [Xt.Ld] ∪ [Xt.Ld];Xt.fr;Xt.mo; [Xt.U];Xt.po; [Xt.Ld]
Hence a contradiction as Xt satisfies (irrFRHB) and (irrUF).
Behavior(Xs) = Behavior(Xt) holds as Xs.mo|loc = Xt.mo|loc.
C.2 Fence Elimination in ARMv8
Observation. Let P be an ARMv8 program generated from an x86 program following the mappings in Fig. 9a. In this
case for all consistent execution X ∈ [[P]] the followings hold:
1. A non-RMW load event is immediately followed by a FLD event.
2. A non-RMW store event is immediately preceeded by a FST event,
3. An RMW is immediately preceeded by a F event,
4. An RMW is immediately followed by a F event,
We restate Theorem 9.
Theorem 9. Suppose an ARMv8 program is generated by x86 7→ ARMv8 mapping (Fig. 9a). A DMBFULL in a thread
of the program is non-eliminable if it is the only fence on a program path from a store to a load in the same thread
which access different locations.
A DMBFULL elimination is safe when it is not non-eliminable.
To prove Theorem 9, we show:
Psrc  Ptgt =⇒ ∀Xt ∈ [[Ptgt]] ∃Xs ∈ [[Psrc]]. Behavior(Xt) = Behavior(Xs)
Proof. Given a target execution Xt ∈ [[Ptgt]] we define a source execution Xs ∈ Psrc by introducing the corresponding
fence event e ∈ F.
We know target execution Xt satisfies (internal) and (atomic). From definition, source execution Xs also supports
(internal) and (atomic) as the respective relations remain unchanged.
We now prove that Xs satisfies (external).
We prove by contradiction.
Assume Xs violates (external).
51
On Architecture to Architecture Mapping for Concurrency
From definition we know that Xs.obs = Xt.obs, Xs.dob = Xt.dob, Xs.aob = Xt.aob.
In that case there exists events (a, b) ∈ Xs.bob but (a, b) /∈ Xt.bob.
Considering possible cases of a and b:
Case (a, b) ∈ [Xs.Ld]× [Xs.E]: Two subcases:
Subcase a /∈ dom(Xs.rmw):
It implies (a, b) ∈ [Xt.Ld];Xt.po; [Xt.FLD];Xt.po; [Xt.E] from Observation (1) in Appendix C.2.
=⇒ (a, b) ∈ [Xt.Ld];Xt.bob; [Xt.E]
Hence a contradiction and Xs violates (external).
Subcase a ∈ dom(Xs.rmw):
It implies (a, b) ∈ [Xt.Ld];Xt.po; [Xt.F];Xt.po; [Xt.E]
=⇒ (a, b) ∈ [Xt.Ld];Xt.bob; [Xt.E]
Hence a contradiction and Xs violates (external).
Case (a, b) ∈ [Xs.St]× [Xs.St]:
=⇒ [Xt.St];Xt.po; [Xt.FST];Xt.po; [Xt.St] from Observation (2) in Appendix C.2.
=⇒ [Xt.St];Xt.bob; [Xs.St]
This is a contradiction and hence Xs satisfies (external).
Case (a, b) ∈ [Xs.St]× [Xs.Ld]:
It implies (a, b) ∈ [Xt.St];Xt.po; [Xt.F];Xt.po; [Xt.Ld] from the condition in ??.
=⇒ [Xt.St];Xt.bob; [Xs.Ld]
This is a contradiction and hence Xs satisfies (external).
As a result, Xs also satisfies (external) and is ARMv8 consistent.
Moreover, we know that Xs.co = Xt.co. Hence Behavior(Xs) = Behavior(Xt).
We restate Theorem 11.
Theorem 11. A DMBST in a program thread is non-eliminable if it is placed on a program path between a pair of
stores in the same thread which access different locations and there exists no other DMBFULL or DMBST fence on the
same path.
A DMBST elimination is safe when it is not non-eliminable.
To prove Theorem 11, we show:
Psrc  Ptgt =⇒ ∀Xt ∈ [[Ptgt]] ∃Xs ∈ [[Psrc]]. Behavior(Xt) = Behavior(Xs)
Proof. Given a target execution Xt ∈ [[Ptgt]] we define a source execution Xs ∈ Psrc by introducing the corresponding
fence event e ∈ F.
We know target execution Xt satisfies (internal) and (atomic). From definition, source execution Xs also supports
(internal) and (atomic) as the respective relations remain unchanged.
We now prove that Xs satisfies (external) by showing Xt.ob = Xs.ob.
52
On Architecture to Architecture Mapping for Concurrency
From definition we know that Xt.obs = Xs.obs, Xt.dob = Xs.dob, Xt.aob = Xs.aob.
In that case there exists events (a, b) ∈ Xs.bob but (a, b) /∈ Xt.bob.
Considering possible cases of a and b:
Case (a, b) ∈ [Xs.Ld]× [Xs.E]:
It implies (a, b) ∈ [Xt.Ld];Xt.po; [Xt.FLD ∪ Xt.F];Xt.po; [Xt.E] from Observation (1) and (4) in Appendix C.2.
=⇒ (a, b) ∈ [Xt.Ld];Xt.bob; [Xt.E]
Hence a contradiction and Xs violates (external).
Case (a, b) ∈ [Xs.St]× [Xs.St]:
=⇒ [Xt.St];Xt.po; [Xt.FST ∪ Xt.F];Xt.po; [Xt.St] from the condition in Theorem 9.
=⇒ [Xt.St];Xt.bob; [Xs.St]
This is a contradiction and hence Xs satisfies (external).
Case (a, b) ∈ [Xs.St]× [Xs.Ld]:
It implies (a, b) ∈ [Xs.St];Xs.po; [Xs.F];Xs.po; [Xs.Ld] as Xs.bob(a, b) holds.
It implies (a, b) ∈ [Xt.St];Xt.po; [Xt.F];Xt.po; [Xt.Ld]
=⇒ [Xt.St];Xt.bob; [Xs.Ld]
This is a contradiction and hence Xs satisfies (external).
We restate Theorem 13.
Theorem 13. A DMB in a program thread is non-eliminable if it is the only fence on a program path between a pair of
memory accesses in the same thread.
A DMB elimination is safe when it is not non-eliminable.
To prove Theorem 13, we show:
Psrc  Ptgt =⇒ ∀Xt ∈ [[Ptgt]] ∃Xs ∈ [[Psrc]]. Behavior(Xt) = Behavior(Xs)
Proof. From the mapping scheme and the constraint in Theorem 13, in all cases there is a pair of F fences between
the access pairs and therefore one of the fences is eliminable.
C.3 Fence Weakening in ARMv8
We restate Theorem 10.
Theorem 10. A DMBFULL in a program thread is non-eliminable if it is the only fence on a program path from a store
to a load in the same thread which access different locations.
For such a fence DMBFULL DMBST; DMBLD is safe.
To prove Theorem 10, we show:
Psrc  Ptgt =⇒ ∀Xt ∈ [[Ptgt]] ∃Xs ∈ [[Psrc]]. Behavior(Xt) = Behavior(Xs)
Proof. Given a target execution Xt ∈ [[Ptgt]] we define a source execution Xs ∈ Psrc.
From definition we know that Xt.obs = Xs.obs, Xt.dob = Xs.dob, Xt.aob = Xs.aob.
We know target execution Xt satisfies (internal) and (atomic). From definition, source execution Xs also supports
(internal) and (atomic) as the respective relations remain unchanged.
We now prove that Xs satisfies (external).
We consider following possibilities:
53
On Architecture to Architecture Mapping for Concurrency
Case (a, b) ∈ [Ld]× [E]:
In this case (a, b) ∈ [Xs.Ld];Xs.po; [Xs.F];Xs.po; [Xs.E]
and (a, b) ∈ [Xt.Ld];Xt.po; [Xt.FLD];Xt.po; [Xt.E].
It implies both Xs.bob(a, b) and Xt.bob(a, b) hold.
Case (a, b) ∈ [St]× [St]:
In this case (a, b) ∈ [Xs.St];Xs.po; [Xs.F];Xs.po; [Xs.St]
and (a, b) ∈ [Xs.St];Xs.po; [Xs.FST];Xs.po; [Xs.St]
It implies both Xs.bob(a, b) and Xt.bob(a, b) hold.
We know that Xt.ob is acyclic and hence Xs.ob is also acyclic.
As a result, Xs also satisfies (external) and is ARMv8 consistent.
Moreover, we know that Xs.co = Xt.co. Hence Behavior(Xs) = Behavior(Xt).
54
On Architecture to Architecture Mapping for Concurrency
D Proofs and Algorithms of Robustness Analysis
D.1 SC robust against x86
Theorem 14. A program P is M -robust against K if in all its K consistent execution X, X.epo ⊆ X.R holds where R
is defined as condition (M -K) in Fig. 17.
In this case R = [R]; po ∪ po; [W] ∪ po|loc ∪ po; [F]; po.
Proof. Both x86A and SC satisfies atomicity.
It remains to show that (X.po ∪ X.rf ∪ X.fr ∪ X.co) is acyclic by contradiction.
Assume (X.po ∪ X.rf ∪ X.fr ∪ X.co) creates a cycle.
It implies (X.po;X.eco)+ creates a cycle.
It implies (([R]; po ∪ po; [W] ∪ po|loc ∪ fence);X.eco)+ has a cycle.
Considering incoming and outgoing eco edges to po|loc:
• X.rfe; [Ld];X.po|loc; [Ld];X.fre =⇒ X.co
• [W];X.po|loc; [Ld];X.fre =⇒ X.co
• X.rfe; [Ld];X.po|loc; [W] =⇒ X.co
It implies ((po \WR ∪ fencerfe ∪ coe ∪ fre) has a cycle.
It implies (po \WR ∪ fence ∪ rfe ∪ co ∪ fr) has a cycle as coi ∪ fri ⊆ po \WR.
However, we know (po \WR ∪ fencerfe ∪ co ∪ fr) is acyclic and therefore a contradiction.
Hence X satisfies acy(X.po ∪ X.rf ∪ X.fr ∪ X.co).
D.2 SC, x86 robustness against ARMv8
Theorem 14. A program P is M -robust against K if in all its K consistent execution X, X.epo ⊆ X.R holds where R
is defined as condition (M -K) in Fig. 17.
In this case R = po|loc ∪ (aob ∪ dob ∪ bob)+.
Proof. Both SC and ARMv8 satisfies atomicity.
It remains to show (X.po ∪ X.rf ∪ X.fr ∪ X.co) is acyclic by contradiction.
Assume (X.po ∪ X.rf ∪ X.fr ∪ X.co) creates a cycle.
If the cycle has one or no epo edge then the cycle violates (sc-per-loc).
Otherwise, the cycle contains two or more epo edges.
It implies (X.epo;X.eco)+ creates a cycle.
It implies ((X.po|loc ∪ (X.aob ∪ X.bob ∪ X.bob)+);X.eco)+ creates a cycle.
Considering X.po|loc with incoming and outgoing X.eco, possible cases:
(1) [Ld];X.po|loc; [Ld];X.fre; [St] =⇒ [Ld];X.fre
(2) [St];X.po|loc; [Ld];X.fre; [St] =⇒ [St];X.coe
(3) [St];X.rfe; [Ld];X.po|loc; [St] =⇒ [St];X.coe
(4) (X.coe ∪ X.fre); [St];X.po|loc; [St] =⇒ X.coe ∪ X.fre
Therefore a ((X.po|loc ∪ (X.aob ∪ X.bob ∪ X.bob)+);X.eco)+ cycle implies ((X.aob ∪ X.bob ∪ X.bob)+;X.eco)+
cycle.
55
On Architecture to Architecture Mapping for Concurrency
It implies an X.ob cycle which violates (external) and therefore a contradiction.
D.3 Proof of x86A robustness against ARMv8
Theorem 14. A program P is M -robust against K if in all its K consistent execution X, X.epo ⊆ X.R holds where R
is defined as condition (M -K) in Fig. 17.
In this case R = po|loc ∪ (aob ∪ bob ∪ dob)+ ∪WR
Proof. Suppose ((X.po \WR) ∪ fence ∪ X.rfe ∪ X.co ∪ X.fr) is a cycle.
It implies ((X.po \WR) ∪ X.fence ∪ X.rfe ∪ X.coe ∪ X.fre) is a cycle as coi ⊆ (X.po \WR) and fri ⊆ (X.po \WR).
It implies ((X.po \WR); eco ∪ X.fence;X.eco;∪X.WR|loc;X.eco ∪ X.WR|6=loc;X.eco) cycle.
Now X.WR| 6=loc =⇒ [St]; (X.po \WR); [St]|6=loc;X.eco.
Therefore it implies ((X.po \WR);X.eco ∪ X.fence ∪ X.WR|loc;X.eco) cycle.
Following the definition of epo
It implies ((po|loc ∪ (X.aob ∪ X.dob ∪ X.bob)+);X.eco)+ cycle.
Considering the incoming and outgoing edges for X.po|loc:
[Ld];X.po|loc; [Ld];X.fre =⇒ [Ld];X.fre
[St];X.rfe; [Ld];X.po|loc; [St] =⇒ [St];X.coe
(X.fre ∪ X.coe); [St];X.po|loc; [St] =⇒ (X.fre ∪ X.coe)
[St];X.po|loc; [Ld];X.fre =⇒ [St];X.coe
It implies (X.aob ∪ X.dob ∪ X.bob)+;X.eco)+ creates a cycle.
It implies X.ob creates a cycle which is a contradiction.
Therefore X is x86A consistent.
D.4 SC, x86A, ARMv8, ARMv7mca robust against ARMv7
Theorem 14. A program P is M -robust against K if in all its K consistent execution X, X.epo ⊆ X.R holds where R
is defined as condition (M -K) in Fig. 17.
D.4.1 SC-robust against ARMv7
In this case R = po|loc ∪ fence.
Proof. Both SC and ARMv7 satisfies atomicity.
It remains to show that (X.po ∪ X.rf ∪ X.fr ∪ X.co) is acyclic by contradiction.
Assume (X.po ∪ X.rf ∪ X.fr ∪ X.co) creates a cycle.
If the cycle has one or no epo edge then the cycle violates (sc-per-loc).
Otherwise, the cycle contains two or more epo edges.
It implies (X.epo;X.eco)+ creates a cycle.
It implies ((X.po|loc ∪ X.fence);X.eco)+ creates a cycle.
Considering the incoming and outgoing edges for X.po|loc:
[Ld];X.po|loc; [Ld];X.fre =⇒ [Ld];X.fre
[St];X.rfe; [Ld];X.po|loc; [St] =⇒ [St];X.coe
(X.fre ∪ X.coe); [St];X.po|loc; [St] =⇒ (X.fre ∪ X.coe)
[St];X.po|loc; [Ld];X.fre =⇒ [St];X.coe
56
On Architecture to Architecture Mapping for Concurrency
It implies (X.fence;X.eco)+ creates a cycle.
Now we consider [codom(fence)];X.eco; [dom(fence)] path.
Possible cases:
Case [Ld];X.eco; [Ld]:
It implies X.fre;X.rfe.
Case [Ld];X.eco; [St]:
It implies X.fre
Case [St];X.eco; [St]:
It implies X.coe
Case [St];X.eco; [Ld]:
It implies X.coe;X.rfe
Thus an (X.fence;X.eco)+ cycle implies
a ((X.coe ∪ X.fre);X.rfe?;X.fence)+ cycle.
It implies a prop+ cycle which violates (propagation).
Hence a contradiction and therefore SC is preserved.
D.4.2 x86A robust against ARMv7
Theorem 14. A program P is M -robust against K if in all its K consistent execution X, X.epo ⊆ X.R holds where R
is defined as condition (M -K) in Fig. 17.
In this case R = po|loc ∪ fence ∪WR.
Proof. Suppose ((X.po \WR) ∪ X.rfe ∪ X.co ∪ X.fr) is a cycle.
It implies ((X.po \WR) ∪ X.rfe ∪ X.co ∪ X.fr) is a cycle.
It implies ((X.po \WR) ∪ X.fence ∪ X.rfe ∪ X.coe ∪ X.fre) is a cycle as coi ⊆WW and fri ⊆ (X.po \WR).
It implies ((X.po \WR);X.eco ∪ X.fence;X.eco ∪ X.WR|loc;X.eco ∪ X.WR|6=loc;X.eco) cycle.
Now X.WR| 6=loc =⇒ [St]; (X.po \WR); [St]|6=loc;X.eco.
Therefore it implies ((X.po \WR);X.eco ∪ X.fence;X.eco ∪ X.WR|loc;X.eco) cycle.
It implies ((po|loc ∪ fence);X.eco)+ cycle following the definition of epo.
Considering the incoming and outgoing edges for X.po|loc:
[Ld];X.po|loc; [Ld];X.fre =⇒ [Ld];X.fre
[St];X.rfe; [Ld];X.po|loc; [St] =⇒ [St];X.coe
(X.fre ∪ X.coe); [St];X.po|loc; [St] =⇒ (X.fre ∪ X.coe)
[St];X.po|loc; [Ld];X.fre =⇒ [St];X.coe
It implies (X.fence;X.eco)+ creates a cycle.
It implies X.prop creates a cycle which is a contradiction.
Therefore X is x86A consistent.
57
On Architecture to Architecture Mapping for Concurrency
D.4.3 ARMv7 robust against ARMv8
Theorem 14. A program P is M -robust against K if in all its K consistent execution X, X.epo ⊆ X.R holds where R
is defined as condition (M -K) in Fig. 17.
In this case R = po|loc ∪ [St]; po ∪ fence.
Proof. We show X is ARMv8 consistent.
(internal)
Assume a (X.po|loc ∪ X.fr ∪ X.co ∪ X.rf) cycle.
However, X satisfies (sc-per-loc) and hence a contradiction.
Therefore, X satisfies (internal).
(external)
Assume a X.ob cycle.
It implies (X.obs; (X.aob ∪ X.bob ∪ X.dob))+ creates cycle.
From the definition,
(X.aob ∪ X.bob ∪ X.dob) ⊆ po|loc ∪ fence ∪ [St]; po and therefore
((X.rfe ∪ X.coe ∪ X.fre); (X.po|loc ∪ fence))+ creates cycle.
It implies prop creates a cycle which violates (propagation).
Therefore a contradiction and X satisfies (external).
(atomicity)
ARMv7 execution X satisfies (atomicity).
Therefore X has only ARMv8 execution.
D.4.4 ARMv7-mca robust against ARMv7
Theorem 14. A program P is M -robust against K if in all its K consistent execution X, X.epo ⊆ X.R holds where R
is defined as condition (M -K) in Fig. 17.
In this case R = [Ld]; po|loc ∪ fence; [Ld].
Proof. We show that X satisfies (mca).
Assume X violates (mca) and therefore a wo+ cycle.
It implies a (X.rfe; [Ld];X.ppo; [Ld];X.fre)+ cycle.
However, [Ld];X.ppo; [Ld] ⊆ po|loc ∪ fence.
[Ld];X.ppo; [Ld] ⊆ po|loc violates (sc-per-loc) and therefore a contradiction.
Otherwise it implies a (X.rfe; [Ld];X.fence; [Ld];X.fre)+ cycle.
It implies a X.prop cycle which violates (propagation)
Therefore a contradiction and hence X is ARMv7-mca consistent.
58
