Formally specifying memory consistency models and automatically generating executable specifications by Chatterjee, Prosenjit
Formally specifying memory consistency models and automatically 
generating executable specifications 
Prosenjit Chatterjee and Ganesh Gopalakrishnan* 
School of Computing, University of Utah 
{prosen ganesh }@cs.utah.edu 
Technical Report UUCS-OI-012 
Abstract 
Memory ordering properties of shared memory multiprocessors are more subtle and less well understood than 
cache coherence. These properties tend to be processor or platform specific and are not always formally specified. 
It is difficult to compare even those platforms whose memory ordering properties have been clearly specified as each 
such platform is usually specified in its own definitional framework. We present a generic and formal specification 
scheme to specify any realistic memory consistency model that gives an intuitive undertstanding to architects, 
and implementors of platforms whose memory model is being defined and also a common definitional framework to 
compare memory models. Another contribution of the paper is to generate an executable specification automatically, 
given the specification of any memory consistency model expressed in our newly defined framework. This alternative 
specification can be used to generate all possible outcomes of small assembly-language multiprocessor programs in a 
given memory model, which is very helpful for understanding the subtleties of the model. The executable specification 
can also check the correctness of assembly language programs including synchronization routines. 
1 Introduction 
Shared-memory multiprocessors are increasingly employed both as servers (for computations, databases, files, 
and the web) and as clients. To improve performance, multiprocessor system designers use a variety of complex 
and interacting optimizations. These optimizations include cache coherence via snooping or directory protocols, 
out-of-order processors, store buffers. These optimizations add considerable complexity at the architectural level 
and even more complexity at the implementation level. Directory protocols, for example, require the system 
to transition from many shared copies of a block to one exclusive one. Unfortunately, these transitions must 
be implemented with many non-atomic lower-level transitions that expose additional race conditions, buffering 
requirements, and forward-progress concerns. Due to this complexity, industrial product groups spend more time 
verifying their system than actually designing and optimizing it. 
To verify a system, engineers should unambiguously define what "correct" means. For a shared-memory system, 
"correct" is defined by a memory consistency model. A memory consistency model defines for programmers the 
allowable behavior of hardware. 
Sequential Consistency is the most intuitive model for writing shared memory programs[?] mainly because all the 
memory operations in an execution that obeys SC can be viewed as a total order 'as if' the program has been run in 
a uniprocessor. However, many commercial processors implement more relaxed memory consistency models in an 
effort to improve performance. An example is the insertion of FIFO or coalescing store buffers, non-blocking loads 
and so on. SPARC Total Store Order (TSO )[?], relax the SC requirement where in the total ordering of memory 
operations a store(st) can appear after a load(ld) that follows it in program order. More relaxed models, such as 
Compaq(DEC) Alpha[?], allow re-ordering between any two instructions. Cray3TD[?], PowerPc[?] implements 
*This work was supported by National Science Foundation Grants CCR-9987516 and CCR-0081406 
even more relaxed memory models where a processor can read a store intstruction written by another processor 
even before the remaining processors see it hence deviating from se largely in all its memory operations now 
not forming any such total order. While modern high performance processors gain additional performance from 
relaxed consistency schemes, they lead to less intuitive programming models unlike se simply because the total 
ordering of all memory operations(loads and stores) in an execution doesnt exist any more. This section of the 
work contributes in formally specifying any memory model that is either as strong as or is weaker than sequential 
consistency in a way such that, an execution obeys a given memory model if there exists at least one total order of 
all memory operations that satisfies all the memory ordering rules defining that memory model. Hence even if we 
consider a memory model that is very weak compared to sequential consistency and hence have hardly anything 
common between them, its still possible to constrain their differences to behavior internal to the processor. 
In Section 2 we define some memory ordering rules and the concept of logical total ordering of events. In section 
3 we categorize memory models into four classes and define memory ordering rules in light of every class. Section 
5 describes the methodology to generate an operational model given the memory model specification expressed in 
our framework and finally, Section 5 concludes with directions towards future work. 
2 Memory model specification 
Before proceeding, we introduce some terminologies. A concurrent shared memory program is viewed as a set of 
sequences of instructions, one sequence per processor. Each sequence in the set captures the program order at that 
processor. An execution of this program is obtained by running the shared memory program on an MP system. 
Formally, an execution is the above set of sequences of instructions with each load labeled with its returned value. 
Every instruction in an execution can be decomposed into one or more events (in the former case, we shall use the 
words 'instruction' and 'event' interchangeably). Rule Read Value indicates what data value a load event in an 
execution should return. Rule Per Processor Order constrains how any two events from the same processor should 
be ordered in their "logical" total order relation. Finally, Rule Write atomicity requires all store events to appear 
to be visible to all processors instantaneously. 
We define an execution to obey a memory model if all the memory events of the execution form at least one logical 
total order which obeys the Per Processor Order and is consistent with the Read Value. '-+' indicates the total 
order relation (often referred to as "logical total order" or "logical order"). The logical order of completion of events 
may not be the same as the temporal order of completion. To illustrate this distinction, consider a processor that 
allows local bypassing, i.e, early retirement of matching loads by directly reading from the store buffer. Let this 
processor also implement memory fences
' 
aggressively by marking existing entries in the store buffer, and flushing 
the marked entries only just prior to subsequent coherent transactions. In such a processor, an instruction sequence 
store(a); fence; load(a) may be carried out in temporal order as store(a).local; load(a); store(a).global. In 
other words, the load appears to have finished "globally" before the store finishes globally. However, in a logical 
explanation, the store(a) . global must precede the load(a) due to the presence of a memory fence between 
them. Other examples of this situation arise in speculative implementation schemes where, following a store that 
misses in the local cache, a speculative load that hits in the local cache may be entertained [7]. 
Each instruction t is defined as a tuple (p,l,o,a,d) where 
• p(t): processor in whose program t originates from. 
• I(t): label of instruction t in p's program. 
• ott): operation type 
• art): memory address 
• d(t): data 
If there are two instructions h, t2 then I(td < l(t2) means that h appears before t2 in program order. 
1 Any event issued after a fence should appear to complete only after any event that was issued before the fence. 
3 Hierarchy of Memory Models 
An executable specification (operational model) of any memory model can be visualized as a correct and simple 
'implementation' whose every run satisfies the memory model definition and also every run that satisfies the 
memory model can be generated by the executable specification. The selection of this simplified 'implementation' 
depends on the memory model under examination. In this section we categorize memory models into four classes 
and show how common data stuctures to design the operational model can be derived for memory models belonging 
to a particular memory model class, thus providing a systematic approach to deriving the operational model. The 
four classes of memory models are as follows:: 
1. Strong: requires Write atomicity and does not allow local bypassing. (e.g. Sequential Consistency, IBM-370). 
2. Weak: requires Write atomicity and allows local bypassing (e.g. Ultra Spare TSO, PSO and RMO, Alpha) 
3. Weakest. does not require Write atomicity and allows local bypassing (e.g.PC, PowerPC). 
4. Hybrid: supports weak load and store instructions that come under memory model Weakest and also support 
strong load and store instructions that come under Strong or Weak memory model categories. (e.g. Itanium, 
DASH-RC, Repo). 
I 
Memroy consirency Models 
I I 
Strpng W'jak Weajest 
I I I 
SC IBM-370 TSO PSO RMO Alpha PC PowerPC Itanium DASH-RC 
Figure 1. The four classes of memory models 
Memory models that are weaker than the Strong category do not usually require that an execution have all its load 
and store instructions form a total order as a part of their specification. 
In our work, we adopt a style of specifying even these memory models to adopt the approach of identifying a 
logical total order such as -+ described earlier. In other words, even in case of these weaker memory models, by 
splitting loads and stores to finer events, we define a logical total order of all the memory events. 
Depending upon the category a memory model falls under, we split a store instruction into one or more events. 
Load instructions for any memory model can always be treated as a single event. Here are a few examples of 
splitting events. In case of Sequential Consistency, we do not split even the stores as sequential consistency 
demands a single global total order of the loads and stores. For a weak memory model such as the Ultra Spare 
TSO, we split the store instruction into two events, a local store event (which means that the store is only visible 
to the processor who issued it) and a global event (which means that the store event is visible to all processors). 
Since the Weakest category of memory models lack write atomicity, we need to split stores into p+ 1 events, where 
p is number of processors, thus ending up with a local store event and p global events (global event i would mean 
that the store event is visible to processor i). 
We now look at each of the four classes of memory models in details. 
3.1 Strong memory models: 
For specifying memory consistency models under this category, every load and store instruction is kept unsplit. 
So we can view every instruction to be an event itseH. 
3.1.1 Memory ordering rules 
1. Read Value: Let h be a load instruction. Then the data value of h is the data value of the "most recent" 
store instruction t2 to the same memory location as tl in the total order relation -+ . i.e 
(a) 0(t2) = st, a(td = a(t2), t2 -+ hand 
(b) there does not exist a store instruction t3 s.t a(td = a(t3) and t2 -+ t3 -+ h. 
2. Per processor order: Let hand t2 be two instructions s.t p(td = p(t2), I(td < l(t2)' Per processor order 
rules constitutes of four sub-rules. 
(a) Id -+ Id: o(td = Id,0(t2) = Id. 
(b) Id -+ st: o(td = Id,0(t2) = st. 
(c) st -+ Id: o(td = st, 0(t2) = Id. 
(d) st -+ st: o(td = st, 0(t2) = st. 
(e) fence: there exists a fence instruction tf s.t I(td < I(tf) < l(t2)' 
For any sub-rule that holds true for a memory model, h -+ t2' 
Sometimes a sub-rule holds only per memory location, and in that case we explicitly specify that sub-rule 
to hold per memory location (e.g. if Id -+ st holds only when a(td = a(t2) then we say that Id -+ st per 
memory location holds). 
We use concise representations to express the list of sub-rules that holds for a memory model(e.g if both 
Id -+ Id and Id -+ st sub-rules are satisfied then we can represent that memory model to satisfy Id -+ X 
where X means any memory instruction) 
3. Write Atomicity All the store events form a single totat order in a way that every store instruction appear 
to be visible atomically to all processors. 
3.1.2 Examples of Strong Memory Models: 
An execution satisfies a Strong memory model if there exists at least one logical total order of all memory events 
in that execution that obeys Read Value and Write Atomicity rules and also obeys zero or more sub-rules of the 
Per processor rules depending upon that memory model. 
The sub-rules of Per processor rules that are applicable to the following memory models are as follows: 
1. Sequential Consistency: All subrules except fence holds(SC is defined in absence of fence instructions). Hence 
we can concisely say that X -+ X holds. 
2. IBM-370: Sub-rules Id -+ X, st -+ st, st -+ Id per memory location and fence holds. 
3.2 Weak memory models: 
Every store instruction is split into two memory events. Hence a store instruction t = (p, I, St, a, d) is split 
into tloeal = (p, I, stloeal, a, If) and tglobal = (p, I, stglobal' a, d), both of which have the same data value. Each load 
instruction may get its Read Value from a tloeal for which the corresponding tglobal has not yet occurred. The goal 
in this case is to model "local bypassing" i.e to model a store buffer bypassing where stores enters the store buffer 
as !focal when its visible to only the processor that issued it and exit with a tglobal when its visible to all processors. 
tlocal -+ tglobal always. 
3.2.1 Memory ordering rules: 
1. Read Value: Let h be a load instruction. Then the data value of h is the data value of the "most recent" 
store event split from the store instruction t2,to the same memory location as h in the total order relation 
-+ . i.e 
(a) if 
i. a(td = a(t2)' if20cal -+ h -+ &,Iobal and 
ii. there does not exist a store instruction t3 s.t p(td = P(t3), a(td = a(t3), t~oeal -+ if30eal -+ h. 
(b) else if 
i. a(td = a(t2)' &,Iobal -+ hand 
ii. there does not exist a store instruction t3 s.t a(td = a(t3), t~lobal -+ tglobal -+ h. 
(c) else, h receives the initial value O. 
2. Per processor order Let hand t2 be two events s.t p(td = p(t2),I(td < l(t2)' There are six possible sub-rules 
as follows: 
(a) ld -+ ld: o(td = ld, 0(t2) = ld. 
(b) ld -+ st: o(td = ld, 0(t2) = stgloOOI' 
(c) st -+ ld: o(td = stglobaZ, 0(t2) = ld. 
(d) st -+ st: o(td = stglobaZ, 0(t2) = stglobal' 
(e) fence: there exists a fence instruction tf s.t l(td < l(tf) < l(t2)' 
(f) Memory Data Dependence: This sub-rule pertains to instrcutions involving the same memory location. 
Hence, for a(td = a(t2)' 
i. ld -+ ld: o(td = ld, 0(t2) = ld. 
ii. ld -+ st: o(td = ld, 0(t2) = stloeal' 
iii. st -+ ld: o(td = stloeaZ, 0(t2) = ld. 
iv. st -+ st: o(td = stloeal, 0(t2) = stloeal' 
For any sub-rule that holds true for a memory model, h -+ t2' Whenever we use "x" we refer to it as Id or 
st instruction. 
3. Write Atomicity All the store events t where ott) = stglobaZ, form a single totat order in a way that every 
store event appear to be visible atomically to all processors. 
3.2.2 Examples of Weak Models : 
An execution satisfies a memory model in this category if there exists at least one logical total order of all memory 
events in that execution that obeys Read Value and Write Atomicity rules and obeys zero or more sub-rules of the 
Per pmcessor order rules depending upon that memory model. 
The sub-rules of Per processor rules that are applicable to the following memory models are as follows: 
1. TSO: Sub-rules ld -+ X, X -+ st, Memory Data Dependence and fence holds. 
2. PSO: Sub-rules ld -+ X, Memory Data Dependence and fence holds. 
3. RMO: Sub-rules Memory Data Dependence and fence holds. 
4. Alpha: Sub-rules Memory Data Dependence and fence holds. Note that Alpha has two kinds of fences, MB 
which is equivalent to our definition of fence and M B - WW which is applicable between store instuctions 
only (M B - WW can be defined in a simlar way as M B was defined). 
3.3 Weakest memory models: 
Every store instruction is split into p + 1 events where p is the number of processors. Thus t where ott) = st is 
split into tloeal where o(tlocal ) = stloeal and t', .tk ... tP where o(tk) = st:lobal and tloeal -+ tP(I) -+ tk, IIk(l ::; k ::; p 
and k oF p(t)). Since Weakest memory models do not require Write Atomicity, a store instrcution may be visible 
to one processor earlier or later than when its visible to another processor. Hence the need to identify when a store 
event is visible to all processors at different times where tk event corresponds to when the store event t is visible 
to processor k. 
3.3.1 Memory Ordering Rules 
1. Read Value: Let tt be a load instruction. Then the data value of tt is the data value of the "most recent" 
store event split from the store instruction t2, to the same memory location as tl in the total order relation 
-+ . i.e 
(a) if 
• p(td = p(t2), a(td = a(t2)' t~oeal -+ t, -+ t~lobal and 
• there does not exist a store instruction t3 s.t p(td = P(t3), a(td = a(t3), t~oeal -+ t30eal -+ tt. 
(b) else if 
• a(td = a(t2)' ~(" ) -+ tt and 
• there does not exist a store instruction t3 s.t a(td = a(t3), ~(h) -+ t;;(h) -+ tt. 
(c) else, tt receives the initial value O. 
2. Per processor order: Let t, and t2 be two events( can be Id.acq, stloeal, stk for any k E processors) s.t p(td = 
p(t2), I(td < l(t2)' There are six possible sub-rules. 
(a) Id -+ Id: o(td = Id, 0(t2) = Id. 
(b) Id -+ st: o(td = Id, 0(t2) = stv(t). 
(c) st -+ Id: o(td = stP(t), 0(t2) = Id. 
(d) st -+ st: o(td = stk, 0(t2) = stk for all processors k. 
(e) fence: there exists a fence instruction tf s.t I(td < I(tf) < l(t2) and o(td = 0(t2) = stv(t). 
(f) Memory Data Dependence: This sub-rule pertains to instructions involving the same memory location. 
Hence, for a(td = a(t2)' 
i. Id -+ Id: o(td = Id, 0(t2) = Id. 
ii. Id -+ st: o(td = Id, 0(t2) = stloeal' 
iii. st -+ Id: o(td = stloea/' 0(t2) = Id. 
iv. st -+ st: o(td = stloea/' 0(t2) = stloeal' 
If anyone of the above sub-rules hold then t, -+ t2' Again, if we use "x" then we refer to it as Id or st 
instruction. 
3. Coherence: All this time we never referred to Coherence as for Strong and Weak memory models Coherence is 
a subrule of Write Atomicity. However now we need to define Coherence separately simply because Weakest 
memory models does not support Write Atomicity but mayor may not support Coherence. 
Coherence requires that if tt and t2 be two store instructions s.t a(td = a(t2) then 
(a) If p(td = p(t2), I(td < l(t2), then t~ -+ t~\I processors k. 
(b) If tl' -+ ~ for some processor p then t~ -+ t~\I processors k. 
4. Write Atomicity: Although Weakest memory models do not require Write Atomicity we will still define it 
when store instructions are split to p + 1 instructions, its importance will be apparent in the next section 
where we will define Hybrid models. 
Here we have two cases as follows 
(a) case 1: If a store instruction t is split into p + 1 events i.e tloeal, t" ... tV then they all appear to occur 
atomically i.e in the total order relation if ti -+ t j where i,j E {local,I, .. ,p} and if ti -+t' -+ t j then t' 
can only be E {tloeal, t', .. , tP}. Thus, all the events of t are ordered in a way s.t no instruction that does 
not belong to an event of t can come in between. 
(b) case 2: If a store instruction t is split into p + 1 instructions i.e tloeol, t
' 
, ... tV then except tloeal they 
all appeat to occur atomically i.e in the total order relation if ti -+ tj where i,j E {local,I, .. ,p} and if 
ti -+t'-+ t j then t' can only be E {t' , .. , tP}. 
Note that the definition of case 1 is same as the definition of Write Atomicity for Strong memory models and the 
definition of case 2 is same as the definition of W A for Weak memory models where they only differ in how store 
instructions are split. 
3.3.2 Examples of Weakest Memory Models 
An execution satisfies a memory model in this category if there exists at least one logical total order of all memory 
events in that execution that obeys Read Value. mayor may not satisfy Coherence and obeys zero or more sub-rules 
of the Per processor rules depending upon that memory model. 
The sub-rules of Per processor rules that are applicable to the following memory models are as follows: 
1. PC: Sub-rules Id -+ X, X -+ st, Memory Data Dependence and fence holds. 
2. PowePC: Sub-rules Memory Data Dependence and fence holds. 
3.4 Hybrid memory models: 
All memory models that support more than one kind of load and store operations where all executions containing 
only the strong load and store operations obey a memory model under Strong or Weak memory model classes and 
all executions containing only the weak load and store operations obey a memory model under Weakest memory 
model classes. 
In such models where both strong and weak store operations exist there can be two ways to view these store 
instructions as follows: 
• split the strong store operations t into tlocal and tglobal and split the weak store operations into p + 1 events 
i.e tlocal, tI, .. , tP • 
• split both the weak and strong store operations into p + 1 events i.e tlocal, tl, .. , tP • 
Although breaking up the strong store operations into p + 1 operations seems redundant in order to define their 
properties but to retain uniformity and to be able to explain the interaction between weak and strong store opera-
tions more clearly the latter method seems more logical. Hence every strong and weak store operations are split into 
p + 1 events as explained before. The memory ordering rules like Read Value,Per processor order, Coherence, Write 
Atomicity are same as defined for Weak memory ordering rules. However, a reference to Id or st instruction corre-
sponds to weak load and store instructions respectively, ld.acq and st.rel refers to strong load and store instructions 
respectively, unlike as for Weakest memory models where ld and st corresponds to one unique type of load ans 
store instruction. Consequently, for example, a sub-rule Id -+ stloeal of per processor rules defined for Weakest 
memory models corresponds to ld.acq -+ stlocal, ld.acq -+ st.rellocal, ld -+ stlocal, and ld -+ st.rellocal subrules for 
Hybrid memory models where each of these four sub-rules are defined similar to Id -+ stloeal sub-rule for Weakest 
memory models . 
If for a rule or sub-rule the type of load or store instruction is not mentioned then that rule or sub-rule is applicable 
to both types of load and store instructions respectively. "X" would refer to ld,ld.acq, stlocal, or st.rel1ocal event. 
3.4.1 Examples of Hybrid Memory Models 
1. ItaniumTM: An execution obeys Itanium memory model if there exists a logical total order of all memory 
events (ld,st,ld.acq,st.rel,fence) that obeys Read Value, Coherence, Write Atomicity(case 2) for all st.rel 
store events and Per processor order rules which include subrules Id.acq -+ X, X -+ st. rei, Memory Data 
Dependence except Id -+ Id and fence. Note that Coherence is defined irrespective of whether a store 
instruction is strong or weak. 
4 Automatic generation of executable specification 
4.1 Strong and Weak memory models 
The operational semantics of the Generalized Weak memory model is described in terms of three data structures 
(see Figure 1), and how each instruction tuple t that is issued updates these data structures and/or returns the 
read value, as per Table 1. By 'buffer' we mean an unbounded structure in which the entries maintain their arrival 
order as in a FIFO, but entries may be removed from anywhere provided a removal condition is satisfied. The 
oldest entry is always at the head and the youngest at the tail. Initially, all buffers are empty. The data structure 
elements are: 
Event Guard Actions 
ld(t) (hit) :3 youngest t E WOBp(t) ,a(t ) - a(t) ,\ d(t ) - d(t) 
ld(t) (miss) ~3 t E WOBp(t) ,a(t ) - a(t) Issue(RBp(I)' t) 
ld(t) (miss) contd. tE REp(t),\ Allowed(REp(t),t),\ M[a(t)]- d(t) Delete(REp(t), t) 
sttoeat(t) True Issue(WOBp(t), t) 
stytobat(t) t E WOBp(l) ,\ Allowed(WOBp(t),t) M[a(t)] f- d(t), Delete(WOBp(t),t), 
Fence(t) True Flush(t) 
Table 1. Transition System 
1. a single port memory M that spans the entire address-space and holds word-sized data in each location. M 
is updated when the MW(t) event of Table 1 fires, which removes an entry from WOBp(tj writes into M. 
Initially, each location of M carries data O. 
2. a write out re-order buffer WOBi into which st instructions are enqueued. When the stglobal(t) event of 
Table 1 fires and p(t) = i, the entry t is removed from WOBi and atomically copied into M. 
3. a re-order load buffer RBi into which Id instructions are enqueued when event (ld(t» of Table 1 fires. 
Eventually, t is removed from RBi, and data d(t) corresponding to this tuple gets returned. 
4.2 State Transition Rules 
Table 1 defines the operational semantics. The first column shows Events that happen if the Guard condition in 
the second column is true, performing the Actions shown in the last column. At any time, anyone of the eligible 
events may be picked in a fair manner. Each event happens when the next instruction t is issued by processor p(t) 
(ld(t),st(t),Fence(t». Notice that in case of event Id(t), tuple t carries the data d(t) being returned (following the 
convention used in [7]). When these events fire, a constraint expressed in the Guard field shows what this data is. 
We use = for equality testing, and +- for assignment. 
Id(t): We seek an entry t' in WOBp(tj such that a(t') = a(t), and t' is the youngest such entry, if multiple entries 
exist. If t' exists (hit), the returned data d(t) is the same as d(t\ If no such entry exists (miss), t is enqueued 
into RBp(tj via I ssue(RBp(tj, t). Eventually, the Id(t) event completes by being serviced by M which provides the 
data d(t). Its guard 'Allowed' captures when tuple t, which is present in RBp(tj, can be processed ahead of all the 
other tuples within RBp(tj. 
stloeal(t): results in t being enqueued into WOBp(tj via procedure Issue. 
stglobal(t) updates the memory array M from WOBp(tj. Its guard 'Allowed' captures when tuple t, which is present 
in WOBp(tj, can be processed ahead of all the other tuples within WOBp(tj. 
Fence(t) is carried out by procedure Flush, which flushes every pending RBp(tj entry, every WOBp(tj entry, where 
the entry comes from p(t) and occurs earlier than t in program order. The functions used in the transition system 
are now described. 
Allowed(WOBp(tj,t): The function evaluates to true if the following conditions are satisfied as follows: 
1. ~3t' E WOBp(tj s.t I(t') < I(t) and 
(a) sub-rule st -+ st2 or, 
(b) (a(td = a(t2)/I Memory Data Dependence sub-rule st -+ st). 
2. ~3t' E RBp(tj s.t I(t') < I(t) and 
(a) sub-rule Id -+ st or, 
(b) (a(td = a(t2)/I Memory Data Dependence sub-rule Id -+ st) 
Allowed( RBp(tj,t): The function evaluates to true if the following conditions are satisfied as follows: 
2indicates that the following condition is true if the memory model requires the sub-rule to hold 
1. ~3t' E RBp(tj s.t I(t') < I(t) and 
(a) sub-rule Id -+ Id or, 
(b) (a(td = a(t2)/I Memory Data Dependence sub-rule Id -+ Id) 
2. ~3t' E WOBp(tj s.t I(t') < I(t) and 
(a) sub-rule st -+ Id or, 
(b) (a(td = a(t2)/I Memory Data Dependence sub-rule Id -+ st) 
Issue(Buffer,t): Add t to the tail of Buffer queue. 
Delete( Buf fer, t): This procedure deletes t wherever it may be in Buffer. 
Although we designed a Generalized Weak memory model, Strong memory models can also be designed with the 
same data structures and operational semantics except that now a load instruction cannot hit the W 0 B buffer 
(local bypassing) due to Strong memory models not supporting local bypassing. 
4.3 Weakest and Hybrid memory models 
The operational semantics of the Generalized Weakest memory model is described in terms of five data structures 
(see Figure 1), and how each instruction tuple t that is issued updates these data structures and/or returns the 
read value, as per Table. Initially, all buffers are empty. The data structure elements are: 
Event Guard Actions 
ld( t) (hit) :3 youngest t E WOBp(t) : a(t ) = a(t) ,\ d(t ) = d(t) 
ld(t) (miss) ~3 t E WOBp(t) : a(t ) = a(t) Issue(RBp(t), t) 
ld(t) (miss) contd. tE RBp(t),\ Allowed(RBp(t),t),\ Mp(t)[a(t)]- d(t) Delete(RBp(t), t) 
sttoea' ( t) True Issue(WOBp(t) , t) 
Sigtobat(t) t E WOBp(l) ;\ Allowed(WOBp(t),t) V processors i Issue(W lEi, t); Delete(WOBp(t), t); 
Sigtobat(t) contd. t E WIBp(t) ,\ Allowed(WIB;,t) M;[a(t)] f- d(t), Delete(WIB;,t), 
Fence(t) True Flush(t) 
Table 2. Transition System 
1. memory Mi per processor i that spans the entire address-space and holds word-sized data in each location. 
Mi is updated when the stylabol contd. event of Table 1 fires, which removes an entry from W IBp(tj and 
writes into Mi' Initially, each location of Mi carries data O. 
2. a write out re-order buffer WOBi into which st instructions are enqueued. When the stylabol(t) event of 
Table 1 fires, the entry t is removed from WOBi and atomically copied into WIBi for every processor i. 
3. a re-order load buffer RBi into which Id instructions are enqueued when event (ld(t)) of Table 1 fires. 
Eventually, t is removed from RBi, and data d(t) corresponding to this tuple gets returned. 
4. a write in re-order buffer W IBi into which st instructions are enqueued. VI-'hen the stylabol(t) contd. event 
of Table 1 fires, the entry t is removed from W IBi and atomically copied into Mi' 
4.4 State Transition Rules 
Table 2 defines the operational semantics. 
Id(t): We seek an entry t' in WOBp(tj such that art') = art), and t' is the youngest such entry, if multiple entries 
exist. If t' exists (hit), the returned data d(t) is the same as d(t\ If no such entry exists (miss), t is enqueued 
into RBp(tj via I ssue(RBp(tj, t). Eventually, the Id(t) event completes by being serviced by Mp(t) which provides 
the data d(t). Its guard 'Allowed' captures when tuple t, which is present in RBp(tj, can be processed ahead of all 
the other tuples within RBp(tj. 
stlocal(t): results in t being enqueued into WOBp(tj via procedure Issue. 
stylobal(t) updates the memory array Mi from WIBp(tj. Its guard 'Allowed' captures when tuple t, which is present 
in WIBp(tj, can be processed ahead of all the other tuples within WIBp(tj. 
Fence(t) is carried out by procedure Flush, which flushes every pending RBp(tj entry, every WOBp(tj entry, 
where the entry comes from p(t) and occurs earlier than t in program order. The functions Allowed(WOBp(tj), 
Allowed(RBp(tj), Issue(Buf fer, t) and Delete(Buf fer, t) used in the transition system are same as that of the 
Generalized Weak memory model. Function Aliowed(W IBp(tj is now described. 
Allowed(WIBp(tj,t): The function evaluates to true if ~3t' E WIBp(tj s.t I(t') < I(t) and sub-rule st -+ st or 
a(td = a(t2) and Data Dependence sub-rule st -+ st is satisfied by the memory model. 
Although we designed a Generalized Weakest memory model, Hybrid memory models can also be designed with 
the same data structures and operational semantics but now due to multiple load and store instruction types, 
there will be additional rules pertaining to these new instructions, for example, reordering between st and st.rel 
in W 0 Band WI B. The operational model for Itanium has been designed [?] and can be referred to understand 
how our Generalized Weakest memory model can be easily extended to handle Hybrid memory models. 
5 Conclusion 
