Mothers of Pipelines  by Krstić, Sava et al.
Mothers of Pipelines
Sava Krstic´, Robert B. Jones, and John O’Leary 1 ,2
Strategic CAD Labs, Intel Corporation, Hillsboro, Oregon, USA
Abstract
We present a method for pipeline veriﬁcation using SMT solvers. It is based on a non-deterministic “mother
pipeline” machine (MOP) that abstracts the instruction set architecture (ISA). The MOP vs. ISA cor-
rectness theorem splits naturally into a large number of simple subgoals. This theorem reduces proving the
correctness of a given pipelined implementation of the ISA to verifying that each of its transitions can be
modeled as a sequence of MOP state transitions.
Keywords: Pipeline veriﬁcation, stuttering simulation, conﬂuence, SMT solvers
1 Introduction
Proving correctness of microarchitectural processor pipelines (MA) with respect to
their instruction set architecture (ISA) amounts to establishing a simulation relation
between the behaviors of MA and ISA. There are diﬀerent ways in the literature to
formulate the correctness theorem that relates the steps of the two machines [1], but
the complexity of the MA’s step function remains the major impediment to practical
veriﬁcation. The challenge is to ﬁnd a systematic way to break the veriﬁcation eﬀort
into manageable pieces.
We propose a solution based on the obvious fact that the execution of any
instruction can be seen as a sequence of smaller actions (let us call them mini-steps in
this informal overview), and the observation that the mini-steps can be understood
at an abstract level, without mentioning any concrete MA. Examples of mini-steps
are fetching an instruction, getting an operand from the register ﬁle, having an
operand forwarded by a previous instruction in the pipeline, writing a result to the
register ﬁle, and retiring. We introduce an intermediate speciﬁcation MOP between
ISA and MA that describes the execution of each instruction as a sequence of mini-
steps. By design, our highly non-deterministic intermediate speciﬁcation admits a
1 Thanks to Arvind and Jesse Bingham for their comments.
2 Email: {sava.krstic,robert.b.jones,john.w.oleary}@intel.com
Electronic Notes in Theoretical Computer Science 174 (2007) 7–22
1571-0661  © 2007 Elsevier B.V. 
www.elsevier.com/locate/entcs
doi:10.1016/j.entcs.2006.11.036
Open access under CC BY-NC-ND license.
broad range of implementations. For example, MOP admits implementations that
are out-of-order or not, speculative or not, superscalar or not, etc. This approach
allows us to separate the implementation-independent proof obligations that relate
ISA to MOP from those that rely upon the details of the MA. This can potentially
amortize some of the proof eﬀort over several diﬀerent designs.
The concept of parcels, formalizing partially-executed instructions, will be needed
for a thorough treatment of mini-steps. We will follow the intuition that from any
given state of any MA one can always extract the current state of its ISA compo-
nents and infer a queue of parcels currently present in the MA pipeline. In Section 2,
we give a precise deﬁnition of a transition system MOP whose states are pairs of
the form 〈ISA state, queue of parcels〉, and whose transitions are mini-steps as
described above. Intuitively, it is clear that with a suﬃciently complete set of mini-
steps we will be able to model any MA step in this transition system as a sequence
of mini-steps. Similarly, it should be possible to express any ISA step as a sequence
of mini-steps of MOP .
Figure 1 indicates that correctness of a microarchitecture MA with respect to
ISA is implied by correctness results that relate these machines with MOP . In Sec-
tion 3, we will prove the crucial MOP vs. ISA correctness property: despite its non-
determinism, all MOP executions correspond to ISA executions. The proof rests
on the local conﬂuence of MOP—a technique pioneered by Shen and Arvind [15].
We ﬁrst prove the correspondence in the Burch-Dill style, and then extend it to
stuttering bisimulation between bounded MOP and ISA.
MA1



...
MOP   ISA
MAn

Fig. 1. With transitions that express atomic steps in instruction execution, a mother of pipelines MOP
simulates the ISA and its multiple microarchitectural implementations. Simulation a` la Burch-Dill ﬂushing
justiﬁes the arrow from MOP to ISA.
The MA vs. MOP relationship is discussed in Section 4. We will see that all
one needs to prove here is a precise form of the simulation mentioned above: there
exists an abstraction function that maps MA states to MOP states such that for
any two states joined by a MA transition, the corresponding MOP states are joined
by a sequence of mini-steps.
MA vs. MOP vs. ISA correctness theorems systematically reduce to numerous
subgoals, suitable for automated SMT solvers (“satisﬁability modulo theories”). We
used CVC Lite [4] and our initial experience is discussed in Section 5.
S. Krstic´ et al. / Electronic Notes in Theoretical Computer Science 174 (2007) 7–228
2 MOP Deﬁnition
The MOP deﬁnition depends on the ISA and the class of pipelined implementations
that we are interested in. The particular MOP described in this section has a
simple load-store ISA and can model complex superscalar implementations with
out-of-order execution and speculation.
2.1 The Instruction Set Architecture
instruction imem.pc actions
opc1 dest src1 src2 pc := pc + 4 rf .dest := alu opc1 (rf .src1 ) (rf .src2 )
opc2 dest src1 imm pc := pc + 4 rf .dest := alu opc2 (rf .src1 ) imm
ld dest src1 oﬀset pc := pc + 4 rf .dest := mem.(rf .src1 + oﬀset)
st src1 dest oﬀset pc := pc + 4 mem .(rf .dest + oﬀset) := rf .src1
opc3 reg oﬀset pc :=
{
target if taken
pc + 4 otherwise
, where
target = get target pc oﬀset
taken = get taken (get test opc3 ) (rf .reg)
Fig. 2. ISA instruction classes (left column) and corresponding transitions. The variables dest , src1 , src2 , reg
have type Reg, and imm, oﬀset have type Word. For the three opcodes, we have opc1 ∈ {add, sub,mult},
opc2 ∈ {addi, subi,multi}, opc3 ∈ {beqz, bnez, j}.
ISA is a deterministic transition system with system variables pc : IAddr, rf : RF,
mem : MEM, imem : IMEM. We assume the types Reg and Word of registers and
machine words, so that rf can be viewed as a Reg-indexed array with Word values.
Similarly, mem can be viewed as a Word-indexed array with values in Word, while
imem is an IAddr-indexed array with values in the type Instr of instructions. Instruc-
tions fall into ﬁve classes that are identiﬁed by the predicates alu reg , alu imm, ld ,
st , branch. The form of an instruction of each class is given in Figure 2. The ﬁgure
also shows the ISA transitions—the change-of-state equations deﬁned separately for
each instruction class.
2.2 State
Parcels are records with the following ﬁelds:
instr : Instr⊥ my pc : IAddr⊥ dest , src1 , src2 : Reg⊥
imm : Word⊥AA opc : Opcode⊥ data1 , data2 , res ,mem addr : Word⊥
tkn : bool⊥ next pc : IAddr⊥AA wb : {⊥,} pc upd : {⊥, , ,}
The subscript ⊥ to a type indicates the addition of the element ⊥ (“undeﬁned”)
to the type. The empty parcel has all ﬁelds equal to ⊥. The ﬁeld wb indicates
whether the parcel has written back to the register ﬁle (for arithmetic parcels and
loads) or to the memory (for stores). Similiarly, pc upd indicates whether the parcel
has caused the update of pc. The additional values and are to record that the
parcel has updated pc peculatively and that it ispredicted.
S. Krstic´ et al. / Electronic Notes in Theoretical Computer Science 174 (2007) 7–22 9
In addition to the architected state components pc, rf , mem, imem, the state
of MOP contains integers head and tail , and a queue of parcels q . The queue is
represented by an integer-indexed array with head and tail deﬁning its front and
back ends. We write idx j as an abbreviation for the predicate head ≤ j ≤ tail ,
saying that j is a valid index in q. The jth parcel in q is denoted q.j.
2.3 Transitions
The transitions of MOP are deﬁned by the rules given in Figure 3. Each rule is a
guarded parallel assignment, where DEFN contains local deﬁnitions, GUARD is the
set of predicates deﬁning the rule’s domain, and ASSIGN are the assignments made
when the rule ﬁres. Some rules contain additional predicates and functions, deﬁned
next.
The rule decode requires the predicate decoded p ≡ p.opc = ⊥ and the function
decode that updates the parcel ﬁeld opc and some of the ﬁelds dest , src1 , src2 , imm.
This update depends on the instruction class of p.instr , as in the following table.
instruction opc dest src1 src2 imm
ADD R1 R2 R3 add R1 R2 R3 ⊥
ADDI R1 R2 17 addi R1 R2 ⊥ 17
LD R1 R2 17 ld R1 R2 ⊥ 17
ST R1 R2 17 st ⊥ R1 R2 17
BEQZ R1 17 beqz ⊥ R1 ⊥ 17
J 17 j ⊥ ⊥ ⊥ 17
To specify how a given parcel should receive its data1 and data2—from the register
ﬁle or by forwarding—we use the predicates no mrw r j ≡ (S = ∅) and mrw r j k ≡
(S = ∅ ∧ max S = k), where S = {k | k < j ∧ idx k ∧ q.k.dest = r}. The former
checks whether the parcel q.j needs forwarding for a given register r and the latter
gives the position k of the forwarding parcel (mrw = “most recent write”).
The rule write back allows parcels to write back to the register ﬁle out-of-
order. The parcel q.j can write back assuming (1) it is not mispredicted, and (2)
there are no parcels in front of it that write to the same register or that have not
fetched an operand from that register. These conditions are expressed by predicates
ﬁt j ≡
∧
head<j′≤j ﬁt at j
′ and valid data upto j ≡
∧
head≤j′≤j valid data j
′, where
ﬁt at j ≡ q.j.my pc = q.(j − 1).next pc = ⊥
valid data j ≡ q.j.data1 = ⊥ ∧ (alu reg q.j ⇒ q.j.data2 = ⊥)
Memory access rules (load and store) enforce in-order execution of loads and
stores. The existence and the location of the most recent memory access parcel
are described by predicates mrma (“most recent memory access”) and no mrma ,
analogous to mrw and no mrw above: one has mrma j k when k is the largest valid
index such that k < j and q.k is a load or store; and one has no mrma j when no
such number k exists. The completion of a parcel’s memory access is formulated by
ma complete p ≡ (load p ∧ p.res = ⊥) ∨ (store p ∧ p.wb = )
The next four rules in Figure 3 (with branch in their name) cover the compu-
S. Krstic´ et al. / Electronic Notes in Theoretical Computer Science 174 (2007) 7–2210
DEFN i = imem.pc fetch
GUARD length = 0 ∨ q.tail.pc upd ∈ { ,}
ASSIGN q.(tail + 1) := empty parcel [instr → i,my pc → pc] tail := tail + 1 AAAAAAAAAA
DEFN p = q.j decode j
GUARD idx j ¬(decoded p)
ASSIGN p := decode p
DEFN p = q.j data1 rf j
GUARD idx j decoded p p.src1 
= ⊥ p.data1 = ⊥ no mrw (p.src1 ) j
ASSIGN p.data1 := rf .(p.src1 )
DEFN p = q.j p¯ = q.k, where mrw (p.src1 ) j k data1 forward j
GUARD idx j decoded p p.src1 
= ⊥ p¯.res 
= ⊥ p.data1 = ⊥
ASSIGN p.data1 := p¯.res
DEFN p = q.j d = p.data1 d′ =
n
p.data2 if alu reg p
p.imm if alu imm p result j
GUARD
»
idx j p.data1 
= ⊥ p.res = ⊥
(alu reg p ∧ p.data2 
= ⊥) ∨ alu imm p
ASSIGN p.res := alu p.opc d d′
DEFN p = q.j d = p.data1 d′ = p.data2 mem addr j
GUARD idx j p.mem addr = ⊥ (ld p ∧ d 
= ⊥) ∨ (st p ∧ d′ 
= ⊥)
ASSIGN p.mem addr :=
j
d + p.imm if ld p
d′ + p.imm if st p
DEFN p = q.j write back j
GUARD
»
idx j alu reg p ∨ alu imm p ∨ ld p ﬁt j valid data upto j
no mrw (p.dest) j p.res 
= ⊥ p.wb = ⊥
ASSIGN rf .(p.dest) := p.res p.wb := 
DEFN p = q.j load j
GUARD
»
idx j ld p p.mem addr 
= ⊥ p.res = ⊥
no mrma j ∨ (mrma j k ∧ma complete q.k)
ASSIGN p.res := mem.(p.mem addr)
DEFN p = q.j store j
GUARD
»
idx j st p p.mem addr 
= ⊥ p.data1 
= ⊥ p.wb = ⊥ ﬁt j
no mrma j ∨ (mrma j k ∧ma complete q.k)
ASSIGN mem.(p.mem addr ) := p.data1 p.wb := 
DEFN p = q.j branch target j
GUARD idx j branch p decoded p p.res = ⊥
ASSIGN p.res := get target (p.my pc) (p.imm)
DEFN p = q.j t = get test (p.opc) branch taken j
GUARD idx j branch p decoded p p.data1 
= ⊥ p.tkn = ⊥
ASSIGN p.tkn := get taken t (p.data1 )
DEFN p = q.j next pc branch j
GUARD idx j branch p p.tkn 
= ⊥ p.res 
= ⊥ p.next pc = ⊥
ASSIGN p.next pc :=
j
p.res if p.tkn
(p.my pc) + 4 otherwise
DEFN p = q.j next pc nonbranch j
GUARD idx j ¬(branch p) decoded p p.next pc = ⊥
ASSIGN p.next pc := (p.my pc) + 4
DEFN p = q.tail pc update
GUARD length > 0 decoded p p.next pc 
= ⊥ p.pc upd 
= 
ASSIGN pc := p.next pc p.pc upd := 
DEFN p = q.tail speculate
GUARD length > 0 decoded p branch p p.pc upd = ⊥ p.next pc = ⊥AAAA
ASSIGN pc := branch predict p.my pc p.pc upd :=
DEFN p = q.j prediction ok j
GUARD idx j idx (j + 1) p.pc upd = ﬁt at (j + 1)
ASSIGN p.pc upd := 
DEFN p = q.j squash j
GUARD idx j idx (j + 1) p.pc upd = ¬(ﬁt at (j + 1)) p.next pc 
= ⊥
ASSIGN tail := j p.pc upd :=
DEFN retire
GUARD length > 0 complete (q.head)
ASSIGN head := head + 1
Fig. 3. MOP transitions. The rules data2 rf and data2 forward are analogous to data1 rf and
data1 forward, and are not shown.
S. Krstic´ et al. / Electronic Notes in Theoretical Computer Science 174 (2007) 7–22 11
tation of the next pc value of a parcel, and the related test of whether the branch
is taken and (if so) the computation of the target address. The functions get taken
and get target are the same ones used by the ISA.
The rules pc update and speculate govern the program counter updating.
The ﬁrst is based on the next pc value of the last parcel and implements the regular
ISA ﬂow. The second implements practically unconstrained speculative updating
of the program counter, speciﬁed by an arbitrary branch predict function.
Note that the status of a speculating branch changes when its next pc value
is computed; if the prediction is correct (matches my pc of the next parcel), the
change is modeled by rule prediction ok. And if the next pc value turns out
wrong, rule squash becomes enabled, eﬀecting removal of all parcels in the shadow
of the mispredicted branch.
Rule retire ﬁres only for parcels that have completed their expected modiﬁcation
of the architected state. The predicate complete p is deﬁned by (p.wb = ) ∧
(p.pc upd = ) for non-branches, and by p.pc upd =  for branches.
3 MOP Correctness
We call MOP states with empty queues ﬂushed and consider them the initial states
of the MOP transition system. The map γ : s −→ 〈s, empty queue〉 establishes a
bijection from ISA states to ﬂushed MOP states.
Note that MOP simulates ISA: if s and s′ are two consecutive ISA states, then
there exists a sequence of MOP transitions that leads from γ(s) to γ(s′). The
sequence begins with fetch and proceeds depending on the class of the instruction
that was fetched, keeping the queue size equal to one until the last transition retire.
One can prove with little eﬀort that a requisite sequence from γ(s) to γ(s′) can
always be found within the set described by the strategy
fetch ; decode ; (data1 rf [] (data1 rf ; data2 rf)) ;
(result []mem addr [] (branch taken ; branch target)) ; [load [] store] ;
(next pc branch []next pc nonbranch) ; pc update ; retire
For the proof that ISA simulates MOP (Theorem 3.4 below), we use the ap-
proach introduced by Shen and Arvind [15].
A MOP invariant is a property that holds for all states reachable from initial
(ﬂushed) states. Local conﬂuence is MOP ’s fundamental invariant.
Theorem 3.1 Restricted to reachable states, MOP is locally conﬂuent.
We omit the proof of Theorem 3.1. Note, however, that proof of local conﬂuence
breaks down into lemmas—one for each pair of rules. For MOP , most of the cases
are resolved by rule commutation: if m1
ρ1
←− m
ρ2
−→ m2 (i.e., ρi applies to the
state m and leads from it to mi), then m1
ρ2
−→ m′
ρ1
←− m2, for some m
′. For
the sake of illustration, we show in Figure 4 three examples when local conﬂuence
requires non-trivial resolution. Diagrams 1 and 2 show two ways of resolving the
S. Krstic´ et al. / Electronic Notes in Theoretical Computer Science 174 (2007) 7–2212
•
fetch




pc update




• 	1
prediction ok head




•
fetch




•
•
fetch




pc update




•
	2
(squash t) ;pc update
 •
•
retire




data1 forward j




• 	3
data1 rf j




•
retire




•
Fig. 4. Example non-trivial cases of local conﬂuence
conﬂuence of the rule pair (fetch,pc update). Note that both rules are enabled
only when q.tail .pc upd = . Thus, the parcel q.tail must be a branch, and the
fetch is speculative. Diagram 1 applies when the speculation goes wrong, Diagram
2 when the fetched parcel is correct. (In Diagram 2, t is the index of the branch
at the tail of the original queue.) Diagram 3 shows local conﬂuence for the pair
(retire,data1 forward j) when mrw j (q.j.src1 ) head holds.
The second fundamental property of MOP is related to termination. Even
though MOP is not terminating (of course), every inﬁnite run must have an in-
ﬁnite number of fetches:
Lemma 3.2 Without the rule fetch, MOP (on reachable states) is terminating
and locally conﬂuent.
Proof. Every MOP rule except fetch either reduces the size of the queue, or makes
measurable progress in at least one of the ﬁelds of one parcel, while keeping all other
ﬁelds the same. Measurable progress means going from ⊥ to a non-⊥ value, or, in
the case of the pc upd ﬁeld, going up in the ordering ⊥ ≺ ≺ ≺ . This ﬁnishes
the proof of termination. Local conﬂuence of MOP without fetch follows from a
simple analysis of the (omitted) proof of Theorem 3.1. 
Let us say that a MOP state is irreducible if none of the rules, except possibly
fetch, applies to it. It follows from Lemma 3.2, together with Newman’s Lemma
[3], that for every reachable state m there exists a unique irreducible state which
can be reached from m using non-fetch rules. This state will be denoted |m|.
Lemma 3.3 For every reachable state m, the irreducible state |m| is ﬂushed.
Proof. Suppose the queue of |m| is not empty and let p be its head parcel. We
need to consider separately the cases deﬁned by the instruction class of p. All cases
being similar, we will give a proof only for one: when p is a conditional branch.
Since decode does not apply to it, p must be fully decoded. Since data1 rf
does not apply to p, we must have p.data = ⊥ (other conditions in the guard of
data1 rf are true). Now, since branch taken and branch target do not apply,
we can conclude that p.res = ⊥ and p.tkn = ⊥. This, together with the fact that
next pc branch does not apply, implies p.next pc = ⊥. Now, if p.pc upd = ,
then retire would apply. Thus, we must have p.pc upd = . Since pc update
does not apply, we must have head = tail , so the queue has length at least 2. If
p.pc upd = , then either squash or prediction ok would apply to the parcel
p. Thus, p.pc upd is equal to ⊥ or , and this contradicts the (easily checked)
S. Krstic´ et al. / Electronic Notes in Theoretical Computer Science 174 (2007) 7–22 13
invariant saying that a parcel with p.pc upd equal to ⊥ or must be at the tail of
the queue. 
Deﬁne α(m) to be the ISA component of the ﬂushed state |m|. Recall now the
function γ deﬁned at the beginning of this section. The functions γ and α map ISA
states to MOP states and the other way around. Clearly, α(γ(s)) = s.
The function α is analogous to the pipeline ﬂushing functions of Burch-Dill [5].
Indeed, we can prove that MOP satisﬁes the fundamental Burch-Dill correctness
property with respect to this ﬂushing function.
Theorem 3.4 Suppose a MOP transition leads from m to m′, and m is reachable.
Then α(m′) = isa step (α(m)) or α(m′) = α(m).
Proof. We can assume the transition m −→ m′ is a fetch; otherwise, we clearly
have |m| = |m′|, and so α(m) = α(m′). The proof is by induction on the minimum-
length k of a chain of (non-fetch) transitions from m to |m|. If k = 0, then m is
ﬂushed, so m = γ(s) for some ISA state s. By the discussion at the beginning of
Section 3, the fetch transition m −→ m′ is the ﬁrst in a sequence that, without
using any further fetches, leads from γ(s) to γ(s′), where s′ = isa step s. It follows
that |m′| = |γ(s′)|, so α(m′) = α(γ(s′)) = s′, as required.
m
ρ 
fetch
		
m1  • . . . •  |m|
m′
σ



m
ρ 
fetch
		
m1
fetch
		
 • . . . •  |m|
m′
σ m′1
Fig. 5. Two cases for the inductive step in the proof of Theorem 3.4
Assume now k > 0 and let m
ρ
−→ m1 be the ﬁrst transition in a minimum-length
chain from m to |m|. Analyzing the proof of Theorem 3.1, one can see that local
conﬂuence in the case of the rule pair (fetch, ρ) can be resolved in one of the two
ways shown in Figure 5, where σ has no occurrences of fetch. In the ﬁrst case, we
have α(m′) = α(m1), and in the second case we have α(m
′) = α(m′1), where m
′
1 is
as in Figure 5. In the ﬁrst case, we have α(m′) = α(m1) = α(m). In the second
case, the proof follows from α(m) = α(m1), α(m
′) = α(m′1), and the induction
hypothesis: α(m′1) = α(m1) or α(m
′
1) = isa step(α(m
′
1)). 
3.1 Stuttering Equivalence of Bounded MOP and ISA
We begin with a set of deﬁnitions.
Two transition systems →1 and →2 deﬁned on sets S1, S2 respectively are said
to be stuttering equivalent when there exists a relation R ⊂ S1 × S2 that is a
stuttering bisimulation. This is to say that both R and its inverse R−1 are stuttering
simulations, where the last concept is deﬁned as follows.
Deﬁnition 3.5 Let R and the two transition systems be as in the previous para-
graph. We say that execution sequences
σ1 : a1 →1 a2 →1 a3 →1 · · · σ2 : b1 →2 b2 →2 b3 →2 · · ·
S. Krstic´ et al. / Electronic Notes in Theoretical Computer Science 174 (2007) 7–2214
are R-matching if there exist strictly increasing functions f, g : {1, 2, 3, . . .} →
{1, 2, 3, . . .} such that for every k, i, j one has
f(k) ≤ i < f(k + 1) ∧ g(k) ≤ j < g(k + 1) ⇒ R(ai, bj).
We say that R is a stuttering simulation from S1 to S2 if for every pair (a1, b1) ∈ R
and every execution sequence σ1 that begins with a1, there exists an R-matching
sequence σ2 that begins with b1.
Systems that are stuttering equivalent satisfy the same properties in the tem-
poral logic without the next-state operator; see [13] and references therein.
Let MOPk (k ≥ 1) be the restriction of MOP on the subset of its states for
which the queue has length at most k.
Theorem 3.6 The relation R = {(s,m) | s = α(m)} is a stuttering bisimulation
between MOPk and ISA.
We need three lemmas about occurrences of squash in chains of MOP transi-
tions. The ﬁrst is a reﬁnement of an argument used in the proof of Theorem 3.4.
Lemma 3.7 Let m and m′ be reachable MOP states such that m
fetch
−→ m′. Let σ be
a sequence of rules that lead from m to |m|. If σ contains no occurrence of squash,
then α(m′) = isa step(α(m)).
Proof. It is easy to see that σ applies to m′ as well. Moreover, one can prove that
fetch commutes with all transitions of σ, i.e. fetch ; σ =m σ ; fetch. (See Figure 5,
diagram on the right.) Thus,
α(m′) =α(m′ σ) = α((m fetch)σ) = α(m (fetch ; σ)) = α(m (σ ; fetch))
=α((mσ) fetch) = α(|m| fetch) = isa step(α(|m|)) = isa step(α(m))

Lemma 3.8 Suppose σ is a chain of MOP rules that applies to some MOP state.
If squash j occurs twice in σ, then between these two occurrences there must be an
occurrence of squash j′ for some j′ < j.
Proof. Notice a general fact that holds for an arbitrary sequence σ of MOP rules,
an arbitrary MOP state m and an arbitrary index j: if σ applies to m and squash j′
does not occur in σ for any j′ < j, then on the way from m to mσ via σ every ﬁeld
of the parcel m.q.j either makes progress or stays the same.
Now, write
σ = σ1 ; squash j ; σ2 ; squash j ; σ3,
where σ and j are as given in the lemma. Let m′ = m (σ1 ; squash j) and m
′′ =
mσ2. Then m
′.q.j.pc upd = and m′′.q.j.pc upd = . Since ≺ , the general
fact above implies that for some j′ < j, squash j′ must occur in σ2. 
Let us say that j is safe for m if there is no chain of rules that applies to m and
contains squash j′ for some j′ ≤ j.
S. Krstic´ et al. / Electronic Notes in Theoretical Computer Science 174 (2007) 7–22 15
Lemma 3.9 Suppose σ is a chain of rules that applies to m and does not involve
squash j′ for any j′ ≤ j. Then: if j is safe for mσ, it is safe for m too.
m
ρ′ 
ρ
		
• 
π′
		
• . . . •  •
squash j′  •
m′
π  •
Fig. 6. Illustration for proof of Lemma 3.9.
Proof (Sketch) It is no loss of generality to assume that σ is a single rule ρ distinct
from squash j′ for any j′ ≤ j. Assuming that there is a chain μ of rules that starts
at m and includes squash j′ for some j′ ≤ j, we can prove that there exists a chain
μ′ that starts at m′ = mρ and also includes squash j′ for some j′ ≤ j. See Figure 6,
where μ is the chain on the top. The proof is by induction on the number of squash
rules in μ and the length of μ. It depends on the case analysis over pairs (ρ, ρ′) in
the conﬂuence proof (Theorem 3.1) and the form of π, π′. 
Proof of Theorem 3.6. Part 1: From ISA to MOPk. Given an ISA execution
sequence
σ : s1 −→ s2 −→ s3 −→ · · ·
and m1 such that s1 = α(m1), we have a chain of MOP transitions m1 −→ · · · −→
|m1| = γ(s1). Furthermore, for every i there is a path in MOP
1 from γ(si) to
γ(si+1), as explained at the beginning of Section 3. Splicing these paths together
gives a sequence
μ : m1 −→
∗ γ(s1) −→
∗ γ(s2) −→
∗ · · ·
of MOP1 transitions. In this sequence, α(m) = s1 holds for all states m on the
path m1 −→
∗ γ(s1). Also, for every i, α(m) = si holds for all states m on the path
γ(si−1) −→
∗ γ(si), except the ﬁrst. Thus, μ R-matches σ.
Part 2: From MOPk to ISA. Let
μ : m1 −→ m2 −→ m3 −→ · · ·
be an inﬁnite chain of MOPk-transitions. By Theorem 3.4, in the chain
σ : α(m1) −→ α(m2) −→ α(m3) −→ · · ·
we have for every i that either α(mi) −→ α(mi+1) is an ISA transition, or α(mi+1) =
α(mi) holds. We need to show that the ﬁrst case occurs inﬁnitely often.
Since MOP without fetch is terminating, the rule fetch must be used in in-
ﬁnitely many transitions mi −→ mi+1 in μ. We claim that retire must also be
used in inﬁnitely many transitions in μ. Suppose the claim is not true; then there
exists h and i0 such that mi.head = h and mi.tail ≤ h + k, for all i ≥ i0. Since
retire and squash are the only rules that decrease the queue size, to compensate
S. Krstic´ et al. / Electronic Notes in Theoretical Computer Science 174 (2007) 7–2216
for the inﬁnitely many fetch’s, there must be inﬁnitely many squash transitions
in μ. More precisely, it follows that for some j such that h ≤ j ≤ h + k, there are
inﬁnitely many transitions squash j in μ. However, Lemma 3.8 easily implies that
any chain of MOP rules applicable to a MOP state may contain only ﬁnitely many
occurrences of squash j for any particular j. The contradiction ﬁnishes the proof
that μ contains inﬁnitely many occurrences of retire.
Now we know that the (non-decreasing) sequence m1.head ,m2.head ,m3.head , . . .
is unbounded. Consequently, for any given l, there are only ﬁnitely many i such
that mi.tail ≤ l. (Note that the sequence m1.tail ,m2.tail ,m3.tail , . . . is unbounded,
but not necessarily monotonic because of uses of squash.) Let then lˆ denote the
largest i such that mi.tail = l. Clearly, the transition mlˆ −→ mlˆ+1 must be a fetch
for every l.
Fix l. Let l′ be any index such that ml′ .head > lˆ and let σ denote the subchain
of μ leading from m
lˆ
to ml′ . Note that squash j for j ≤ l cannot occur in σ because
that would violate the maximality condition on lˆ. Note also that ml′ is safe for any
j such that j ≤ l, as a consequence of l < ml′ .head . By Lemma 3.9, mlˆ is safe for
any j ≤ l = m
lˆ
.tail . This implies that no squash rule can occur in a (fetch-free)
reduction sequence m
lˆ
−→∗ |m
lˆ
|. By Lemma 3.7, we have α(m
lˆ+1) = α(mlˆ fetch) =
isa step(α(m
lˆ
)).
4 Simulating Microarchitectures in MOP
Suppose MA is a microarchitecture purportedly implementing the ISA. We will say
that a state-to-state map β from MA to MOP is a MOP-simulation if for every MA
transition s −→MA s
′, the state β(s′) is reachable in MOP from β(s). Existence of a
MOP-simulation proves (the safety part of) the correctness of MA. Indeed, for every
execution sequence s1 −→MA s2 −→MA . . . , we have β(s1) −→
+
MOP β(s2) −→
+
MOP
. . ., and then by Theorem 3.4, α(β(s1)) −→
∗
ISA α(β(s2)) −→
∗
ISA . . ., demonstrating
the crucial simulation relation between MA and ISA.
For a given MA, the MOP-simulation function β should be easy to guess. The
diﬃcult part is to verify that it satisﬁes the required property: the existence of
a chain of MOP transitions β(s) −→+MOP β(s
′) for each transition s −→MA s
′.
Somewhat simplistically, this veriﬁcation task can be partitioned as follows.
Suppose MA’s state variables are v1, . . . , vn. (Among them are the ISA state
variables, of course.) The MA transition function
s = 〈v1, . . . , vn〉 −→ s
′ = 〈v′1, . . . , v
′
n〉
is given by n functions next v i such that v
′
i = next v i 〈v1, . . . , vn〉. The n-step se-
quence s = s0  s1  . . .  sn−1  sn = s
′ where si = 〈v
′
1, . . . , v
′
i, vi+1, . . . , vn〉
conveniently serializes the parallel computation that MA does when it makes a
transition from s to s′. These n steps are not MA transitions themselves since the
intermediate si need not be legitimate MA states at all. However, it is reasonable to
expect that the progress described by this sequence is reﬂected in MOP by actual
S. Krstic´ et al. / Electronic Notes in Theoretical Computer Science 174 (2007) 7–22 17
transitions:
β(s) = m0 −→
∗
MOP m1 −→
∗
MOP . . . −→
∗
MOP mn = β(s
′). (1)
Deﬁning the intermediate MOP states mi will usually be straightforward. Once
they have been identiﬁed, the task of proving that β(s′) is reachable from β(s) is
broken down into n tasks of proving that mi+1 is reachable from mi. Eﬀectively,
the correctness of the MA next-state function is reduced to proving a correctness
property for each state component update function next v i.
5 Mechanization
Our method is intended to be used with a combination of interactive (or manual)
and automated theorem proving. The correctness of the MOP system (Theorem 3.4)
rests largely on its local conﬂuence (Theorem 3.1), which is naturally and easily split
into a large number of cases that can be individually veriﬁed by an automated SMT
solver. The solver needs decision procedures for uninterpreted functions, a fragment
of arithmetic, and common datatypes auch as arrays, records and enumeration
types. Once the MA-simulation function β of Section 4 has been deﬁned and the
intermediate MOP states mi in the chain (1) identiﬁed, it should also be possible to
generate the proof of reachability of mi+1 from mi with the aid of the same solver.
We have used manual proof decomposition and CVC Lite to implement the proof
procedure just described. Our models for ISA, MOP , and MA are all written in
the reﬂective general-purpose functional language reFLect [7]. In this convenient
framework we can execute speciﬁcations and—through a HOL-like theorem prover
on top of reFLect or an integrated CVC Lite—formally reason about them at the
same time. Local conﬂuence of MOP is (to some extent automatically) reduced to
about 400 goals, which are individually proved with CVC Lite. For MA we used
the textbook DLX model [9] and proved it is simulated in MOP by constructing
the chains (1) and verifying them with CVC Lite. This proof is sketched in some
detail below.
Mechanization of our method is still in progress. Clean and eﬃcient use of
SMT solvers to prove properties of executable high-level models written in another
language comes with challenges, some of which were discussed in [8]. For us, partic-
ularly exigent is the demand for heuristics for deciding when to expand occurrences
of function symbols in goals passed to the SMT solver with the functions’ deﬁnitions,
and when to treat them as uninterpreted.
5.1 Simulating DLX in MOP
To illustrate the the ideas given in Section 5, we use a simple ﬁve-stage pipelined
processor based on DLX [9]. Its states are 7-tuples s = 〈p1, p2, p3, p4, pc, rf ,mem〉,
where the pi are the pipeline registers and the rest is the architected state. The
DLX next-state function is brieﬂy explained in Figure 7; for more details, see [9].
It is straightforward to associate a MOP parcel to each valid (non-bubble)
S. Krstic´ et al. / Electronic Notes in Theoretical Computer Science 174 (2007) 7–2218
5










p1
4






p2
3






p3
2






p4
1









p′
1
p′
2
p′
3
p′
4
p1
6






p2
3






p3
2






p4
1










p′
2
p′
3
p′
4
p1
		
p2
3






p3
2






p4
1









p1

p′
3
p′
4
(a) (b) (c)
Fig. 7. Dynamics of DLX . (a) In a regular cycle, the DLX step can be seen as a sequence of ﬁve actions:
(1) parcel p4 writes back and retires; (2) p3 performs memory access; (3) alu computes the result of p2 or
the address for its memory access; (4) p1 receives data from the register ﬁle or by forwarding, and if it a
branch, its target is computed as well as whether the branch is taken or not; (5) a new parcel p′
1
is fetched
and pc is incremented. (b) If p1 is a taken branch, action (4) is modiﬁed to include updating the pc with
the computed target (becoming action (6) in the picture), and no parcels are fetched. (c) The machine
stalls one cycle if p2 is a load and p1 depends on it.
m0 = 〈p4p3p2p1, pc, rf ,mem〉
WB RET1

		
m1 = 〈p3p2p1, pc, rf
′,mem〉
LD ST2

		
m2 = 〈p′4p2p1, pc, rf
′,mem ′〉
ALU3

		
m3 = 〈p′4p
′
3
p1, pc, rf
′,mem ′〉
DATA4

		
DATA PC
6
 




m4 = 〈p′4p
′
3
p′
2
, pc, rf ′,mem ′〉
FETCH PC5

		
m6 = 〈p′4p
′
3
p′
2
, pc′, rf ′,mem ′〉
m5 = 〈p′4p
′
3
p′
2
p′
1
, pc′, rf ′,mem ′〉
WB RET =
n
retire if store p4 ∨ branch p4
write back1 ; retire if alu reg p4 ∨ alu immed p4 ∨ load p4
LD ST =
n
load1 if load p3
store1 if store p3
ALU =
n
result2 if alu reg p2 ∨ alu immed p2
mar2 if load p2 ∨ store p2
DATA =
(
data13 ; data23 if alu reg p1 ∨ store p1
data3 if alu immed p1 ∨ load p1
branch data3 ; next pc branch3 if branch p1
DATA PC = branch data3 ; next pc branch3 ; pc update3 if branch p1
FETCH PC =
j
fetch ; decode4 ; speculate if branch p′1
fetch ; decode4 ; next pc nonbranch4 ; pc update4 otherwise
Fig. 8. Simulation of DLX in MOP. The MOP state m0 is β(s), where s is an arbitrary DLX state. If s′
is the next DLX state, then β(s′) is either m5, m6, or m3, depending whether we are in the case (a), (b),
or (c) in Figure 7. The ﬁgure shows the sequence of MOP steps that go from β(s) to β(s′) in all cases.
The “rule” data13 is either data1 rf3 or data1 forward3, depending on whether the parcel p1
depends on the parcels p2, p3 or not (similarly for data23). The “rule” branch data3 is an abbreviation
for data3 ; branch taken3 ; branch target3.
S. Krstic´ et al. / Electronic Notes in Theoretical Computer Science 174 (2007) 7–22 19
value of a DLX pipeline register. Combining the resulting parcels into a MOP
queue and copying the ISA state, we can deﬁne the simulation function β that
maps DLX states to MOP states. If s is as above, then β(s) can be written as
〈p˜4p˜3p˜2p˜1, pc, rf ,mem〉, where p˜4p˜3p˜2p˜1 denotes the MOP queue corresponding to
the contents of the pipeline registers (p1, p2, p3, p4), with the proviso that p˜i is ab-
sent if pi is a bubble. Figure 8 sketches the proof that β is a legitimate simulation
function. (To avoid clutter, we dropped the tildes from all MOP parcels.)
6 Related Work
The idea of ﬂushing a pipeline automatically was introduced in a seminal paper
by Burch and Dill [5]. In the original approach, all in-ﬂight instructions in the
implementation state are ﬂushed out of the pipeline by inserting bubbles—NOPs
that do not aﬀect the program counter. Pipelines that use a combination of super-
scalar execution, out-of-order execution, and variable-latency execution units are too
complex to ﬂush directly. In response, researchers have invented a variety of ways,
many based on ﬂushing, to relate the implementation pipeline to the speciﬁcation.
We cover here only those approaches that are most closely related to our work.
The interested reader is refered to [1] for a relatively complete survey of pipeline
veriﬁcation approaches.
Shen and Arvind [15] were ﬁrst to prove an example of Burch-Dill correctness
using the ﬂushing function deﬁned as the normal form in a conﬂuent system. They
model an abstract out-of-order processor and a simple speciﬁcation machine as
term rewriting systems. Their implementation model is similar to our intermediate
speciﬁcation MOP , and its Burch-Dill correctness against the speciﬁcation ISA is
the main result of [15]. We go a step further by proving stuttering bisimulation.
Also, MOP is for us only an intermediate model that, in turn, allows us to reason
about deterministic and more realistic implementations.
Hosabettu et al. [10] devised a method to decompose the Burch-Dill correctness
statement into lemmas, one per pipeline stage. This inspired the decomposition we
describe in Section 4.
Lahiri and Bryant [12], and Manolios and Srinivasan [13] veriﬁed complex mi-
croprocessor models using the SMT solver UCLID. Some consistency invariants
in [12] occur naturally in our conﬂuence proofs as well, but the overall approach
is not closely related. The WEB-reﬁnement method used in [13] produces proofs
of stuttering bisimulation between ISA and MA that implies liveness. This gave
motivation for our Theorem 3.6, but our stuttering bisimulation proof is diﬀerent.
Skakkebæk et al. [16,11] introduce incremental ﬂushing and use a non-determini-
stic intermediate model to prove correctness of a simple out-of-order core with in-
order retirement. Like us, they rely on arguments about transaction re-ordering.
While incremental ﬂushing must deal with transactions as they are deﬁned for the
pipeline, we decompose pipeline transactions into much simpler “atomic” transac-
tions. This facilitates a more general abstraction and should require signiﬁcantly
less manual proof eﬀort for a given pipeline than the incremental ﬂushing approach.
S. Krstic´ et al. / Electronic Notes in Theoretical Computer Science 174 (2007) 7–2220
Damm and Pnueli [6] use a non-deterministic speciﬁcation that generates all
program traces that satisfy data dependencies. They use an intermediate abstrac-
tion with auxiliary variables to relate the speciﬁcation and an implementation with
out-of-order retirement based on Tomasulo’s algorithm. In each step of the spec-
iﬁcation model, an entire instruction is executed atomically and its result written
back. In the MOP approach, the execution of each instruction is broken into a
sequence of mini-steps in order to relate to a pipelined implementation.
Sawada and Hunt [14] use an intermediate model with an unbounded history
table called a micro-architectural execution trace table. It contains instruction-
speciﬁc information similar to that found in the MOP queue. Arons [2] follows
a similar path, augmenting an implementation model with history variables that
record the predicted results of instruction execution. In these approaches, auxiliary
state is—like the MOP queue—employed to derive and prove invariants about the
implementation’s relation to the speciﬁcation. While their auxiliary state is derived
directly from the MA, MOP is largely independent of MA and has ﬁne-grained
transitions.
7 Conclusion
We have presented an approach for verifying a pipelined system P against its spec-
iﬁcation S by using an intermediate “pipeline mother” system M that explicates
atomic computations occurring in steps of S. For deﬁniteness, we assumed that P
is a microprocessor model and S is its ISA, but the method can potentially be ap-
plied to verify pipelined hardware components in general, or in protocol veriﬁcation.
This can all be seen as a reﬁnement of the classical Burch-Dill method, but with the
diﬃcult ﬂushing-based simulation pushed to the M vs. S level, where it amounts to
proving local conﬂuence of M—a conjunction of easily-stated properties of limited
size, readily veriﬁable by SMT solvers.
As an example, we speciﬁed a concrete intermediate model MOP for a simple
load-store architecture and proved its correctness. We also veriﬁed the textbook
machine DLX against it. However, our MOP contains more than is needed for
verifying DLX : it is designed for simulation of microprocessor models with complex
out-of-order execution that cannot be handled by currently-available methods. This
will be addressed in future work. Also left for future work are improvements to
our methodology (manual decomposition of veriﬁcation goals into subgoals which
we prove with CVC Lite [4]) and performance comparison with other published
methods.
References
[1] M. D. Aagaard, B. Cook, N. A. Day, and R. B. Jones. A framework for superscalar microprocessor
correctness statements. Software Tools for Technology Transfer, 4(3):298–312, 2003.
[2] T. Arons. Veriﬁcation of an advanced MIPS-type out-of-order execution algorithm. In Computer Aided
Veriﬁcation (CAV’04), volume 3114 of LNCS, pages 414–426, 2004.
S. Krstic´ et al. / Electronic Notes in Theoretical Computer Science 174 (2007) 7–22 21
[3] F. Baader and T. Nipkow. Term Rewriting and All That. Cambridge University Press, 1998.
[4] C. Barrett and S. Berezin. CVC Lite: A new implementation of the cooperating validity checker. In
R. Alur and D. A. Peled, editors, Computer Aided Veriﬁcation (CAV’04), volume 3114 of LNCS, pages
515–518, 2004.
[5] J. Burch and D. Dill. Automatic veriﬁcation of pipelined microprocessor control. In D. L. Dill, editor,
Computer Aided Veriﬁcation (CAV’94), volume 818 of LNCS, pages 68–80, 1994.
[6] W. Damm and A. Pnueli. Verifying out-of-order executions. In H. F. Li and D. K. Probst, editors,
Correct Hardware Design and Veriﬁcation Methods (CHARME’97), pages 23–47. Chapman and Hall,
1997.
[7] J. Grundy, T. Melham, and J. O’Leary. A reﬂective functional language for hardware design and
theorem proving. J. Functional Programming, 16(2):157–196, 2006.
[8] J. Grundy, T. F. Melham, S. Krstic´, and S. McLaughlin. Tool building requirements for an API to
ﬁrst-order solvers. Electr. Notes Theor. Comput. Sci., 144(2):15–26, 2006.
[9] J. Hennessy and D. Patterson. Computer Architecture: A Quantitative Approach. Morgan Kaufmann,
1995.
[10] R. Hosabettu, G. Gopalakrishnan, and M. Srivas. Verifying advanced microarchitectures that support
speculation and exceptions. In E. A. Emerson and A. P. Sistla, editors, Computer Aided Veriﬁcation
(CAV’00), volume 1855 of LNCS, pages 521–537, 2000.
[11] R. B. Jones. Symbolic Simulation Methods for Industrial Formal Veriﬁcation. Kluwer, 2002.
[12] S. K. Lahiri and R. E. Bryant. Deductive veriﬁcation of advanced out-of-order microprocessors. In
W. A. Hunt Jr. and F. Somenzi, editors, Computer Aided Veriﬁcation (CAV’03), volume 2725 of
LNCS, pages 341–354, 2003.
[13] P. Manolios and S. K. Srinivasan. A complete compositional reasoning framework for the eﬃcient
veriﬁcation of pipelined machines. In IEEE/ACM International conference on Computer-aided design
(ICCAD’05), pages 863–870. IEEE Computer Society, 2005.
[14] J. Sawada and W. Hunt. Processor veriﬁcation with precise exceptions and speculative execution. In
A. J. Hu and M. Y. Vardi, editors, Computer Aided Veriﬁcation (CAV’98), volume 1427 of LNCS,
pages 135–146, 1998.
[15] X. Shen and Arvind. Design and veriﬁcation of speculative processors. In Workshop on Formal
Techniques for Hardware, Maarstrand, Sweden, June 1998.
[16] J. Skakkebæk, R. Jones, and D. Dill. Formal veriﬁcation of out-of-order execution using incremental
ﬂushing. In A. J. Hu and M. Y. Vardi, editors, Computer Aided Veriﬁcation (CAV’98), volume 1427
of LNCS, pages 98–109, 1998.
S. Krstic´ et al. / Electronic Notes in Theoretical Computer Science 174 (2007) 7–2222
