Algebraic models of behaviour and correctness of SMT and CMT processors  by Harman, N.A.
The Journal of Logic and Algebraic Programming 74 (2007) 32–56
Available online at www.sciencedirect.com
www.elsevier.com/locate/jlap
Algebraic models of behaviour and correctness of SMT and CMT
processors
N.A. Harman
Department of Computer Science, University of Wales Swansea, Swansea SA2 8PP, UK
Received 28 November 2006; revised 8 July 2007; accepted 16 July 2007
Available online 19 September 2007
Abstract
Superscalar microprocessors execute multiple instructions simultaneously by virtue of large amounts of (possibly duplicated)
hardware. Much of this hardware is idle at least part of the time. Simultaneous multi-threaded (SMT) microprocessors utilize this
idle hardware by interleaving multiple independent execution threads. In essence, a single physical processor appears to be multiple
virtual processors. Multi-core, or chip-level multi-threaded (CMT) processors duplicate the execution pipeline, while sharing other
resources. Both approaches increase processor hardware utilization (and hence speed) by introducing thread-level parallelism.
The key question we consider in this paper is: how do we model SMT/CMT processors? In particular, how do we model multiple,
parallel, intercommunicating threads of execution, whose behaviour is defined in terms of some lower level implementation? And,
what does it mean for such SMT/CMT models to be correct? The model developed here focusses particularly on (a) the relationship
between timing behaviour at different levels of abstraction; and (b) what it means for a representation of a processor at one level of
abstraction (typically representing an implementation) to be correct with respect to another (typically representing a specification
consisting of multiple interacting threads). An inevitable, and realistic, consequence of the model of SMT/CMT processors developed
is a weakening of the long-established principle of separation between a processor implementation and specification.
© 2007 Elsevier Inc. All rights reserved.
1. Introduction
Modern high-performance microprocessors employ a range of techniques to improve performance. As well as
obvious (though sometimes misleading) increases in processor clock speed, the main sources of performance gains are
improvements in instruction-level parallelism (ILP) and thread-level parallelism (TLP). Pipelined processors increase
ILP by overlapping the different stages of instruction execution. There are problems that must be addressed: ensuring
that operands generated by earlier instructions and needed by later ones are available (data dependency); ensuring that
sufficient hardware resources are available for processing different instruction stages in parallel without overlapping
demands (resource conflicts); and correctly handling branches, where it is not immediately obvious which of two
E-mail address: n.a.harman@swan.ac.uk
1567-8326/$ - see front matter ( 2007 Elsevier Inc. All rights reserved.
doi:10.1016/j.jlap.2007.07.001
N.A. Harman / Journal of Logic and Algebraic Programming 74 (2007) 32–56 33
alternative instructions to start processing next (procedural dependency). Techniques to handle these problems must at
least ensure that instruction execution remains correct; ideally, they should remove or reduce any performance penalties
that arise. For example, forwarding can ensure operand availability (in at least some cases); resource duplication can
help resolve resource conflicts; and branch prediction and register renaming can ensure correct and fast handling of
branch operations [19].
Superscalar processors permit multiple instructions to be at the same stage of execution simultaneously (that is,
multiple parallel pipelines).1 As well as the problems faced by pipelined processors, superscalar processors must
consider issues that arise from instruction re-ordering. Commonly, instructions may be executed out-of-program
order, which means that the possibility of earlier instructions overwriting the operands (reverse data dependency) or
results (output dependency) of later ones must be considered. Again, these issues can typically be resolved by register
renaming.
Performance gains for superscalar processors are limited by data dependencies between instructions. It is not the
case that performance gains follow from arbitrarily increasing the number of instructions (or degree) a superscalar
processor can in principle execute. In practice, a degree 4 processor will show little gain over a degree 3 processor
because there are usually not enough instructions unaffected by dependencies to keep a high-degree processor’s pipeline
full. Simultaneous multi-threading (SMT) and multi-core, or chip-level multi-threaded (CMT), processors address this
by executing instructions from multiple threads in parallel. Data dependencies will not occur between instructions in
different threads2 and so delays are reduced. A CMT processor duplicates the entire processor core. So, for example,
a degree 4 superscalar processor will be replaced by two degree 2 processors.3 Provided the program being executed
is multi-threaded, and the operating system is aware of the processor’s capabilities, both of which are increasingly
common, it should normally be the case that the dual degree 2 processors are more fully utilized than the single degree 4
processor. A current example of a CMT processor is Intel’s Core 2 Duo. SMT processors operate instead by interleaving
multiple threads of execution within the same superscalar processor. Processor utilization is again increased, because
more instructions without mutual dependencies can be found to fill the pipeline. Some extra hardware is required to keep
track of which thread each instruction belongs to, and a proportion of the pipeline is duplicated (though relatively little).
Some Intel Pentium IV processors use a form of SMT - called Hyperthreading by Intel [24]. Note that Hyperthreading
has been controversial, and blamed for performance decreases and it has been dropped from the Pentium 4’s immediate
successors (the Core processors). However, these problems arise from issues specific to the Pentium 4 architecture (Net-
Burst) and are not inherent in SMT (see Section 5.1). There is, of course, no reason why the SMT and CMT approaches
cannot be adopted simultaneously. From the point of view of modelling behaviour in this paper, the same approach
can be applied to both SMT and CMT processors, and we will use SMT/CMT to collectively refer to both techniques.
This paper is part of a series that develops algebraic models of the behaviour, especially the timing behaviour, and
the correctness of microprocessors. These algebraic models have a common form. Microprocessors are modelled as
(potentially simultaneous) iterated maps operating over some state set and some conveniently-chosen discrete model
of time. Models at different levels of abstraction are related by timing abstraction maps (called retimings) and data
abstraction maps. In [17,18] microprogrammed processors are modelled (including IO in [18]). Microprogrammed
processors complete execution of a single instruction before starting the next: there is a unique and simple timing
relationship between instructions at the level of abstraction visible to the programmer (the programmer’s model) and the
corresponding implementation (the abstract circuit model). This model is extended in [12,11] to accommodate pipelined
processors, which overlap instruction execution. In such processors, there is no longer a unique timing relationship
between instructions at different levels of abstraction (though instructions terminate at unique times). Superscalar
processors incorporate multiple, parallel execution pipelines and are modelled in [12,10]. In superscalar processors
instructions do not necessarily terminate at unique times, or in program order. Superscalar processors no longer represent
the state-of-the-art in processor design. SMT/CMT processors consist of multiple, parallel (real or virtual) superscalar
processors that execute separate (but potentially communicating) threads. The fundamental questions that concern us
in this paper are: How do we extend our algebraic model to accommodate SMT/CMT processors; and, What does it
mean for an SMT/CMT processor to be correctly implemented in the context of this extended model?
1 A fuller discussion of superscalar processors would consider legacy architectures and precise architectural states. However, we omit such a
discussion here.
2 Except for the relatively rare case – in terms of the total number of instructions executed – when threads communicate.
3 Obviously simplistic, but a useful example.
34 N.A. Harman / Journal of Logic and Algebraic Programming 74 (2007) 32–56
The structure of this paper is as follows. In Section 2 we consider other work, historical and current, on modelling
and verifying microprocessors. In Section 3 we summarize the fundamental concepts developed in previous works
by the author and co-workers for modelling conventional, pipelined and superscalar processors and their correctness.
In Section 4 we introduce the superscalar correctness model, which is used as a basis for representing SMT/CMT
correctness. In Section 5 we expand on data dependencies and SMT/CMT concepts. In Sections 6 and 7 we extend
the existing microprocessor model to accommodate SMT/CMT processors and their correctness. In Section 8 we
extend the one-step theorems, which are used to eliminate induction in processor verification, to SMT/CMT. Finally,
in Section 9 we consider a simple illustrative example.
2. Related work
There is a substantial body of work devoted to microprocessor verification and a common characteristic of much of
it is the need to address a specific, usually complex4 example [2,8,5]. A useful summary is [1].
Early work includes [13], which models and verifies a simple processor with eight instructions. This particular
example, and variations, was subsequently a common case study [18,14]. A more substantial example was Viper
which was intended for commercial use [4]. The rigour of Viper’s ‘verified’ status was controversial (leading to
unresolved legal action). This leads to the ‘obvious’ observation that is difficult to ensure that a verified design is
actually manufactured correctly.
A landmark leading to much subsequent work is [3] which develops the concept of flushing and the notion of
verification based on comparisons of fragments of execution traces with appropriate timing (and data) abstraction –
though some of these concepts appeared earlier in [31,15]. Such techniques can be classified (in the useful terminology
of [27] which describes a recent evolution of the technique) as simulation based correspondence. Techniques derived
from [3] have been successfully applied to a wide range of examples.
A variation on such simulation based correspondence techniques is the predecessor work to this paper. In particular,
models of ‘conventional’ (i.e. non-pipelined) processors are described in [17,18], pipelined processors in [12,6,11],
and superscalar processors in [10,6]. The techniques of [12] have been used successfully in the ARM6 verification [8],
which would appear to be the first verification of an ‘off-the-shelf’ processor. That is, a processor that has not been
designed with verification explicitly in mind (and hence with ‘inconvenient’ and hard-to-verify features omitted). The
presence of explicit time in our model does entail establishing that verifications based on finite state traces do lead to
valid correctness proofs. However, these are only required once for each variation of the model: see Section 7.
A significant alternative to techniques derived from [3] (though still owing much to it) are the completion functions
of [20]. In this, the affect of each stage of an execution pipeline on the programmer-visible state of a processor is
considered separately (modelled by individual maps termed completion functions). This provides an obvious and
useful partition of the complete verification obligation, since each of these can be considered separately. Completion
functions have also been successfully applied in practice.
Other significant work includes the VAMP project [2] at Saarbrücken, which has worked on the verification of a
processor based on an extended model of DLX. The DLX processor [19] is a simple, fictitious example developed for
pedagogical purposes and commonly used as the basis of verification examples. Hunt’s Group in Austin,Texas [27]
works on formal verification of microprocessors. Hunt has been a longstanding contributor to the field [23,21,22,28].
3. The basics of the algebraic model
In this section we introduce the basic algebraic tools used to model microprocessors. We omit discussion of the
underlying algebraic theory, which is extensively described elsewhere: see, for example, [26,30].
3.1. Clocks
The underlying temporal model is based on clocks that divide time into discrete clock cycles, representing intervals.
4 At least with reference to the state-of-the-art in processor verification at the time.
N.A. Harman / Journal of Logic and Algebraic Programming 74 (2007) 32–56 35
Fig. 1. A retiming from T to S with associated immersion and tools.
Definition 1. A clock is an algebra (T | 0, t + 1) where: (i) T = {0, 1, . . . } is a set of clock cycles (a renamed copy
of N); (ii) 0 is the initial clock cycle; and (iii) t + 1 is the next or successor clock cycle function.
A clock cycle need not represent a constant subdivision of time. For example, we might use an instruction clock to
represent the execution of instructions in a microprocessor, where each cycle of the clock may last different amounts
of ‘real’ time. Note particularly that we are modelling time intervals and not time points. It is perfectly reasonable in
our model for events to take zero (apparent) time. This simply means the event starts and ends within a single clock
cycle. Furthermore, our philosophy is that events, and their order, define time, rather than the reverse. We are happy to
choose whichever clock (subdivision of time) is convenient in any given case.
3.2. Retimings
Let S and T be two clocks, representing different subdivisions of time. We formally define the relationship between
S and T using retimings5 and their associated tools.
Definition 2. Let S and T be clocks. A retiming λ : S → T is a monotonic, surjective map with λ(0) = 0.
The set of all retimings from S to T is denoted by Ret(S, T ). Fig. 1 shows a retiming λ between two clocks, S and
T . A clock S is said to be faster than a clock T (and clock T is slower than clock S) if there exists a retiming λ from
S to T . In most cases, timing abstraction is determined by state. Since we are concerned with deterministic systems,
one state (usually the initial state) is enough to characterize behaviour.
Definition 3. Given a non-empty set representing some state space A, a state-dependent retiming λ(a) ∈ Ret(S, T )
is determined by some state a ∈ A.
Example
Consider the retiming λ : N → [S → T ] defined by
λ(n)(s) = s/n.
We denote the set of state-dependent retimings from clock S to clock T , parameterized by set A by Ret(A, S, T ).
Given a retiming λ ∈ Ret(S, T ), we commonly wish to define the corresponding immersion λ : T → S.
Definition 4. The immersion of a retiming λ ∈ Ret(S, T ), denoted by λ ∈ [T → S], is defined by
λ(t) = (least s ∈ S)[λ(s) = t].
Observe that because λ is surjective λ is always defined. The immersion λ is shown in Fig. 1.
5 Not to be confused with the retimings of [25].
36 N.A. Harman / Journal of Logic and Algebraic Programming 74 (2007) 32–56
We may also define a retiming λ in terms of its immersion λ:
λ(s) = (least t ∈ T )[λ(t + 1) > s].
The definition of λ can be extended in the obvious way to accommodate state-dependent retimings.
Given s ∈ S and retiming λ ∈ Ret(S, T ), we commonly wish to identify a clock cycle s′ ∈ S such that s′ is the
earliest time for which
λ(s′) = λ(s),
as well as the number of cycles of clock S corresponding with clock cycle t ∈ T .
Definition 5. Given a retiming λ ∈ Ret(S, T ) and a time s ∈ S, the function start : Ret(S, T ) → [S → S]:
start (λ)(s) = λ(λ(s))
returns the first time s′ ∈ S such that, λ(s′) = λ(s).
The start function is illustrated in Fig. 1.
Definition 6. Given a retiming λ ∈ Ret(S, T ) and a time t ∈ T , the length function l : Ret(S, T ) → [T → S+]
returns the number of cycles s′ ∈ S+ = S − {0}, such that λ(s′) = t . The length function l is defined as follows:
l(λ)(t) = λ¯(t + 1) − λ¯(t).
Further discussion of retimings can be found in [16].
3.3. Iterated maps models of microprocessors
We model microprocessors as a trace of states from some set A, generated by the repeated application of a next-
state function next : A → A, starting from some initial state a ∈ A (possibly modified by an initialization function
h : A → A). A state function F : T × A → A, for some clock T , computes the microprocessor state at time t ∈ T ,
from starting state a ∈ A. Typically A will be a Cartesian product of sets representing registers and memories. The
exact choice of the component parts of A will determine the level of data abstraction. The choice of T governs the level
of timing abstraction. A clock T in which each cycle corresponds with an instruction is suitable for an architecture,
or programmer’s model PM . A clock T in which each cycle corresponds with a system clock cycle (or some multiple
thereof) is more suitable for an implementation, or abstract circuit model AC.
Definition 7. Given clock T , non-empty set A, and simultaneous primitive recursive functions next : A → A and
init : A → A, an iterated map with initialization function F : T × A → A is a primitive recursive function defined
by the following equations. For all t ∈ T and a ∈ A
F(0, a) = init (a),
F (t + 1, a) = next (F (t, a)).
Note that we omit discussion of input and output; however, the extension of the iterated map model to include input
and output presents no difficulties [18,6].
3.3.1. Initialization functions in iterated map models
The purpose of initialization functions is not to describe the initial behaviour of a system. Rather it is to eliminate
unwanted starting states in traces: the ‘start’ of a trace may not represent the ‘start’ of execution. For example, consider an
implementation with memory m, program counter pc and instruction register ir: we may initially require ir = m(pc),
and hence not wish to consider starting (trace) states that do not have this property. Initialization functions are often
N.A. Harman / Journal of Logic and Algebraic Programming 74 (2007) 32–56 37
not required for programmer’s model descriptions of processors as often (though not always) all programmer-visible
states are legal. (In principle this can also be the case with AC-level models, and is for our examples in Section 9.
However, for realistic examples, this is unlikely.)
The choice of the term initialization function may seem unfortunate. However, it represents the initialization of the
model, rather than of the hardware being represented, since we may start an execution trace at some arbitrary point
(corresponding to the start/end of an instruction) during a processor’s operation.
The choice of initialization function will vary according to circumstances: we could choose an initialization function
that enforced some predefined reset state; or we could choose the identity function (which could, of course, mean that
the state evolution of F may not be correct if not all states a ∈ A are permitted). Between these alternatives is a useful
class of initialization functions that leaves initial state a unchanged provided a is already consistent with correct future
state traces of F . In realistic examples, there will be a (possibly large) conjunction of (possibly complex) relations
(for example, ir = m(pc)) κ between the components of A that must be true for correct future traces. We can regard
κ as a consistency-checking invariant that must hold, at certain times, for the correct state evolution of F : in the case
where F represents the implementation of a microprocessor, those times will correspond to the start/end of machine
instructions.6 Invariant κ may be checked by an initialization function init , on initial state a ∈ A: if κ holds, then
init (a) = a. Such initialization functions are an important part of the verification process (Section 8.1), and, together
with duration functions, are analogous to the pipeline invariants of [5] and others.
3.4. Uniform state-dependent retimings
For each state of an implementation there is an associated state-dependent retiming. In a uniform state-dependent
retiming λ ∈ Ret(A, S, T ) the length λ(a)(t + 1) − λ(a)(t) of any clock cycle t ∈ T is independent of the numerical
value of t .
Example
The retiming λ :∈ Ret(A, S, T ) defined by λ(a)(s) = s/a can be uniform, since l(λ(a))(t) = a for all t ∈ T and
a ∈ N+. However, the retiming λ(a)(s) = loga(s + 1) cannot be uniform because the value of l(λ(a))(t) increases
monotonically as t becomes larger.
We achieve uniformity by associating a duration with each state in the state-space of F . We structure the definition of
uniform retimings to enable us to syntactically determine uniformity. The retimings used when defining the correctness
of a microprocessor implementation (Sections 3.5 and 4) must be uniform if the one-step theorems are to hold (Section 8).
Definition 8. A state dependent retiming λ ∈ Ret(A, S, T ) is uniform with respect to a state function F : S × A → A
if, and only if, there exists a map dur : A → S+ such that, for all a ∈ A and t ∈ T
λ¯(a)(0) = 0,
λ¯(a)(t + 1) = dur(F (λ¯(a)(t), a)) + λ¯(a)(t)
where λ¯ ∈ Imm(A, S, T ) is the immersion of λ and S+ = S − {0}.
The duration function dur : A → S+ used in the definition of λ will form part of the correctness statement when
verifying an implementation with respect to a specification: see Section 3.5. In the case of a microprocessor dur
will generally identify the number of cycles of an implementation clock S corresponding to the instruction being
executed at some time t ∈ T . We may define dur non-constructively such that it searches the future states of an
implementation until it identifies some condition in the processor’s state identifying the end of a machine instruction.
A retiming λ defined in such a way would not explicitly specify how long instructions take to execute at the AC level.
Alternatively, we could define dur constructively such that rather than searching future states, it explicitly specifies the
number of cycles that each instruction should take in any given state. In highly-pipelined, superscalar and SMT/CMT
6 Identifying start/end times of instructions may be problematic in superscalar examples: [6,10] and Section 4.
38 N.A. Harman / Journal of Logic and Algebraic Programming 74 (2007) 32–56
processors, constructive duration functions become increasingly difficult to define, because of the complexity of the
timing relationships between implementation and specification.
3.5. Correctness definition for non-superscalar processors
We define the correctness of a non-superscalar processor as follows.
Definition 9. Abstract circuit model map AC : S × B → B is said to be a correct implementation of program-
mer’s model map PM : T × A → A for some initializations if, given state-dependent retiming λ ∈ Ret(B, S, T ) and
surjective data abstraction map ψ : B → A, then ∀s = start (λ(b))(s) and b ∈ B, the following diagram commutes.
T × A PM−→ A⏐⏐⏐(λ, ψ)
⏐⏐⏐ψ
S × B AC−→ B.
(1)
Data abstraction function ψ maps states B of AC to states A of PM . In the case that PM and AC represent respectively
the programmer’s model PM and abstract circuit model AC of a microprocessor, commonly ψ will simply discard
elements of the AC state not present in the PM state (internal buffer registers etc.), and perhaps perform simple
operations on the remainder. The map ψ must be surjective to ensure correctness for all valid initial states of the map
PM .
State-dependent retiming λ(b) maps the time on the AC clock S to the time on PM clock T given starting state
b ∈ B. A clock condition s = start (λ(b))(s) is present to ensure that an appropriate number of clock cycles s ∈ S is
chosen. In the case of microprocessors, each cycle of the PM clock T typically represents the execution of a single
instruction; the clock condition is only true at those times s ∈ S corresponding to the start/finish of an instruction.
Clearly, we are not concerned with the correctness of the AC representation AC at times s ∈ S mid-way through the
execution of an instruction. In the event that λ is uniform (Section 3.4), the duration function dur : A → S+ used in
the definition of λ forms part of the correctness statement (as does the data abstraction function ψ).
4. Superscalar correctness models
Although it would in principle be possible to build a non-superscalar SMT processor, in practice this would not be
done because it is extremely unlikely that there would be any performance benefits: the reverse in fact.7 Consequently
we must modify our correctness model to accommodate superscalar processors. In a microprogrammed processor, one
instruction finishes execution before the next one starts: there is always a time s ∈ S on the AC clock corresponding
to each time t ∈ T on the PM clock. In a pipelined processor, instruction execution is overlapped. However, it is still
possible to identify a unique time s ∈ S on the AC clock for each time t ∈ T on the PM clock, corresponding to the
completion of each instruction, and so our correctness statement from Definition 9 remains applicable. In a superscalar
implementation, it is possible for multiple instructions to terminate, or retire, on a single clock cycle. It is also possible
for instructions to retire out of program order in some processor implementations.8 That is, there may be cycles of
clock T that correspond with no cycle of clock S, and no retiming from S to T .
To address this problem, we introduce a retirement clock R. Cycles of clock R mark the retirement of one or more
instructions. We construct two retimings λ1 ∈ Ret(T , R), mapping instruction clock cycles to retirement clock cycles,
and λ2 ∈ Ret(S, R), mapping system clock cycles to retirement clock cycles. Retimings λ1 and λ2 are illustrated in
Fig. 2, where the first two instructions at times t = 0 and t = 1 complete together, taking two cycles of clock S; then
the next three instructions at t = 2, . . . ,t = 4 complete together, taking four cycles of clock S; the instruction at time
t = 5 completes separately, taking two cycles of clock S; and so on.
7 Of course, a non-superscalar CMT processor is perfectly possible.
8 Internally, many superscalar implementations allow instructions to finish out of program order. However, this fact is generally hidden from the
programmer in order to preserve architectural compatibility (precise architectural state [19]) with previous non-superscalar implementations, and
because out-of-order terminations makes debugging complex.
N.A. Harman / Journal of Logic and Algebraic Programming 74 (2007) 32–56 39
0 1 2 3 4 5 6 7 8
0 1 2 3 4
0 1 2 3 4 5 6 7 8 9 10 11 12
R
T
S
λ
λ
1
2
Fig. 2. Retimings from T to R and from S to R.
We can construct the adjunct retiming ρ : S → T from system clock cycles to instruction clock cycles by compo-
sition:
ρ(s) = λ1λ2(s).
Note that, unlike the retimings of Section 3.2, adjunct retimings need not be surjective.
Although adjunct retimings are not in general surjective, we can use them to construct correctness statements for
superscalar microprocessors. We are not concerned with times t ∈ T not in the image of ρ because such times do not
exist in the implementation: i.e. in such cases, a sequence of PM states ax , . . . , at−1, at , at+1 . . . ,ay is effectively
implemented as a single transition ax, ay , where x and y are consecutive times in the image of ρ.
Definition 10. The state-dependent adjunct retimingρ constructed from state-dependent retimingsλ1 ∈ Ret(A, T ,R)
and λ2 ∈ Ret(A, S,R) is defined as follows:
ρ(a) = λ¯1λ2(a).
The set of all state-dependent adjunct retimings from S to T through R is denoted by Ret(A, S,R, T ).
5. Simultaneous multi-threading and multi-core
Modern superscalar microprocessors attempt to execute multiple instructions simultaneously, possibly out of
program order [19]. In order to achieve this, they contain multiple, parallel pipelines,9 and multiple execution units
which are usually specialized to deal with different operations. A processor might contain execution units for integer,
floating point, memory access, and branch operations. Modern high-performance processors will duplicate units that
can expect high demand (for example, the integer unit). The Pentium 4 microarchitecture NetBurst has seven execution
units in its Prescott variant: there are two simple integer units, a complex integer unit, two floating point units (which
are also used for streaming SIMD extension 3 (SSE3) instructions, used mainly in graphics/multimedia applications),
and separate memory load and store units. Data dependencies between instructions prevent such hardware being fully
utilized. For example, consider the following pair of (integer) operations:
a = b + c;
d = b - c;
Both of these operations may proceed simultaneously on a processor with two integer units. However the operations:
a = b + c;
d = b - a;
9 Or a single pipeline capable of dealing with multiple instructions simultaneously, depending on one’s point of view.
40 N.A. Harman / Journal of Logic and Algebraic Programming 74 (2007) 32–56
cannot, because the second depends on the first. We say there is a (true) data dependency between the operations.
Because the operations are close together, they give rise to a read-after-write (RAW) hazard10 – an error will occur
unless we take special steps to prevent it. So it would not be possible for both to proceed simultaneously, and one
integer execution unit would be idle. Specifically, the first instruction must complete – or at least its result value a must
be available – before the second can proceed.11 It may well be the case that some other, unrelated, operation can use
the second integer unit, or that the compiler can re-order the code to insert other, unrelated, instructions between the
two. However, in real processors, execution units (and much of the rest of the implementation pipeline) are idle for
substantial periods of time.
A simultaneous multi-threading (SMT) processor attempts to utilize idle resources in the execution pipeline by
interleaving multiple independent threads of instructions. Because threads are essentially independent, dependencies
cannot arise between them except in the specific and rare (at the level of abstraction of instructions) instances of
inter-thread communication. Note that from the perspective of an operating system, a simultaneous multi-threaded
processor12 appears as multiple physical processors, thus taking of advantage of existing operating systems that are
already multi-processor aware.
At the level of their implementation (the AC level), SMT machines are nominally uniprocessors – that is, they do not
explicitly duplicate all the hardware typically found within a microprocessor, which would rather defeat the object. By
necessity some parts are replicated: however, the majority of hardware is shared by the executing instruction threads.
SMT is commonly called Hyperthreading – the name chosen by Intel for its implementation. By steadily increasing
the proportion of hardware that is duplicated in an SMT pipeline, eventually we will reach the point at which it is all
duplicated and we have, effectively, a CMT processor. Note there is no reason why a processor cannot be both CMT
and SMT.
It is important to note that ‘multi-threading’ in SMT/CMT processors is low level. That is, we are simply concerned
with the process of executing instructions from separate threads: we are not concerned with issues like deadlock and
lockout. All such high-level issues are implemented by the operating system or the application program using the
primitives provided by the hardware. We do, however, have to consider the issue of shared state: that is, inter-thread
communication via shared memory. Practical implementations of SMT/CMT generally also address issues like cache
coherence, though we omit discussion of that here.
5.1. SMT in the Pentium 4
The implementation of SMT in Pentium 4 processors (NetBurst) is called Hyperthreading [24] and is considered
problematic. This is because the very long pipeline is difficult to fill, and consequently the instruction scheduler is
aggressive. It will schedule instructions for execution even when operands are not available in the Level 1 cache. When
such an instruction (inevitably) fails, the replay unit will simply repeatedly attempt to re-execute it until it succeeds.
This will very nearly fully occupy the processor’s pipeline until it succeeds, starving other threads of processor time.
Since it may be necessary to fetch an operand, or operands, from main memory, this situation could last for some time.
This problem is a consequence of the NetBurst microarchitecture, which is now being superceeded. Hyperthreading
is not enabled in the early Pentium 4 successor processors (the Core series). However, it is considered probable that it
will be reactivated in future processors. If so, such processors would be both CMT and SMT.
6. Modelling SMT processors
In this section, we consider how we can extend our existing microprocessor model to accommodate SMT/CMT
processors. From the perspective of an operating system kernel programmer, an SMT/CMT processor appears as
multiple PM-level processors in which some state is shared. We will use the term virtual thread model (V TM)
10 The term hazard is adopted from digital electronics where it denotes a slightly different concept – a timing error that may or may not occur,
depending on the precise physical circumstances at the time. The hazards we are concerned with will definitely lead to errors unless appropriate
steps are taken.
11 We omit discussion of the possibility of predicting the value of a since no available hardware implements this as yet, though it is in principle
possible.
12 At least in the case of Intel’s current implementation.
N.A. Harman / Journal of Logic and Algebraic Programming 74 (2007) 32–56 41
PM Level VTM Level
AC Level
ITM Level
AC Level
(Id)
ψ,λ/ρ
ψ=ψ’ψ’’, λ=λ’λ’’, ρ=ρ’ρ’’
ψ’,λ’/ρ’
ψ’’,λ’’/ρ’’
AC levels the same in both models
VTM level for CMT/SMT corresponds with PM level in conventional processors
ITM level to minimize details of AC implementation visible at VTM level (in this
paper, ITM=AC)
Fig. 3. Conventional AC-PM Model compared to SMT/CMT VTM-ITM Model.
to distinguish these processors from the conventional PM level. The approach taken in this paper is to include the
shared state in the model definition. An alternative suggested in [9] is to move the shared state outside the (immediate)
processor model, and to treat communication with it as explicit input–output. Such an approach would enable each
thread to be modelled independently, potentially simplifying the model. We do not adopt that approach here (and
omit all discussion of explicit input–output) precisely because one of the aims is to model all threads simultaneously.
However, future work on an input–output based model is planned.
In the case of communicating threads the temporal relationship with other V TM processors is exposed via the shared
state. This relationship is defined by state information that is not present in the V TM state, but is in the implementation
(AC) state. Consequently, V TM models must be parameterized by at least part of the corresponding AC state. In this
paper we choose to use the complete AC state. However, there is a case for introducing a new, intermediate level of
abstraction: that is, we should not expose any more complexity at any given level than is strictly necessary. This is set
against the need to introduce a new level of abstraction and overall model complexity of course. Nevertheless, we do
not wish to preclude the future inclusion of a separate abstraction level between the AC and V TM levels. Consequently
we will use the term intermediate thread model (ITM) which in this paper will be synonymous with abstract circuit
model, but in general need not be. The relationship between the ‘conventional’ AC-PM model and the V TM-ITM
model for SMT/CMT processors is shown in Fig. 3. In the case of this paper, the maps λ′′, ψ ′′ and ρ′′ are all the identity
function.
6.1. VTM model definition
Consider an SMT/CMT processor able to execute n threads. Each virtual processor F iVTM, i ∈ {1, . . . , n}, will
operate over clock Ti ; the state set VTM of F iVTM will be composed of some parts that are local and some that are
shared with processors F 1VTM, . . . , F
i−1
VTM, F
i+1
VTM, . . . , F
n
VTM. We assume, without loss of generality, that the local
state elements priv ∈ privVTM precede the shared state elements share ∈ shareVTM in the state vector:
VTM = privVTM × shareVTM.
The state trace of each V TM-level processor will be a function of its own local and shared state, and the shared state
of all other V TM-level processors. There is only one shared state in the ITM-level implementation state set ITM.
However each individual V TM model appears to have its own copy from its own perspective: a conceit we wish to
maintain. Consequently we need to merge the shared states of each V TM-level processor: Fig. 4.
Each V TM-level processor operates with its own clock: to correctly merge shared states from different V TM-level
processors, we must match states at the appropriate times. We can relate times on different V TM-level processors
using the retimings and corresponding immersions: Fig. 5.
42 N.A. Harman / Journal of Logic and Algebraic Programming 74 (2007) 32–56
VTMi 
sharepriv
VTMn
sharepriv
VTM
sharepriv
merge τ
1
The state of VTM1,..., VTMn   is
merged to generate the new shared
state of VTMi.
Fig. 4. Generating the shared state in VTM Models.
Clock for
Thread 1
Clock for
Thread 2
Implementation
Clock
T1 T2R1 R2S
The component functions for mapping from Thread 1’s clock to Thread 2’s
clock where a is some ITM (implementation) state.
λ    (a)1,1 λ    (a)2,1λ    (a)2,2λ    (a)1,2
ρ  (a)
1
ρ  (a)
2
Fig. 5. Timing Relationship between VTM Clocks for Thread 1 and Thread 2.
Definition 11. Given clocks Ti , for i ∈ {1, . . . , n}; ITM state set ITM; V TM state set VTM; private state projection
functions πipriv : VTM → privVTM; merge operator τ : (VTM)n → shareVTM; data abstraction functions ψi : ITM →
VTM; and next-state and initialization functions next : VTM → VTM and init : VTM → VTM, we model an
individual FVTM level processor F iVTM as follows:
N.A. Harman / Journal of Logic and Algebraic Programming 74 (2007) 32–56 43
F iVTM : ITM → [Ti × VTM → VTM],
F iVTM(σITM)(0, σVTM) = init (σVTM),
F iVTM(σITM)(t + 1, σVTM) = next[πipriv(F iVTM(σITM)(t, σVTM)),merge(t, σITM, σVTM)].
where merge : Ti × ITM × VTM → shareVTM is defined by
merge(t, σITM, σVTM) = τ(F iVTM(σITM)(t, σVTM),
F 1VTM(σITM)(
1,i (σITM)(t), ψ1(σITM)),
...
F i−1VTM(σITM)(
i−1,i (σITM)(t), ψi−1(σITM)),
F i+1VTM(σITM)(
i+1,i (σITM)(t), ψi+1(σITM)),
...
F nVTM(σITM)(
n,i(σITM)(t), ψn(σITM))).
and 
i,j (σITM) = ρi(σITM)ρj (σITM), for i, j ∈ {1, . . . ,n}.
Definition 11 uses unchanged the next-state and initialization functions from a probably pre-existing PM model.
This preserves the (probably substantial) work of defining such functions.
Note that the composition of adjunct retimings to define the map between thread clocks (as, for example, in Fig. 5)
is well-defined. In particular

j,i(s)
i,j (s)(t) ≤ t.
As well as being parameterized by the implementation state ITM, our definition contains the state dependent
adjunct retimings ρi , i ∈ {1, . . . ,n}. Although it is in principle possible to define each ρi independently of an FITM
level model, in practice the complexity of such a definition makes it impractical, and it is more usual to define ρi in
terms of FITM (Section 3.2). Consequently, the ITM implementation of a VTM level model is deeply embedded in the
definition of FVTM. However, this embedding is necessary since the relative timing behaviour of threads in SMT/CMT
processors does depend on the precise implementation. Observe also that FVTM is parameterized by both the ITM and
VTM state sets, and we cannot expect correct operation if we do not use the correct, and corresponding starting states.
However, we address this issue when defining correctness (see Section 7).
We use the data abstraction maps ψi : ITM → VTM to extract the V TM-level state data for each processor. These
data abstraction maps will also form part of the correctness statement (see Section 7).
The merge operation τ unifies the various shared state components of the n FVTM processors. The definition of τ
will depend on (i) the precise nature of the shared state shareVTM; and (ii) the behaviour of the processor implementation
– for example, if two FVTM processors attempt to update the same state unit simultaneously. A common situation will
be where the shared state consists of the processor’s main memory. One possible, though very simplistic, definition in
this case is as follows:
Definition 12. Let [I → D] be the shared component of FVTM state VTM, where I is an index set and D is a data
set, both consisting of bit vectors. The merge operation τ : [I → D]n → [I → D] is defined as follows:
τ(mi,m1, . . . , mi−1,mi+1, . . . , mn)[j ] =
⎧⎪⎪⎪⎪⎨
⎪⎪⎪⎪⎩
mi[j ], if m1[j ] = · · · = mn[j ];
m1[j ], if mi[j ] /= m1[j ];
...
...
mn[j ], if mi[j ] /= mn[j ].
44 N.A. Harman / Journal of Logic and Algebraic Programming 74 (2007) 32–56
We use the notation m[j ] to represent the j th value of m: see Section 9. Note that Definition 12 leaves open the
behaviour in the case that more than one FVTM processor has changed (or attempted to change) an individual memory
word.
In Definition 11, the merge operator is a function of both private and shared state. However, the merge operator in
Definition 12 above does not depend on any private state components. In this case we can assume for the purposes
of the definition that there are no private state components. However, it will commonly, but not always (see
Section 9) be the case that the merge operator is independent of any private state components. In such cases,
we can modify Definition 11 for independent shared state by adding a shared state projection operator πshare as
follows:
Definition 13. Given clocks Ti , for i ∈ {1, . . . , n}; ITM state set ITM; V TM state set VTM; projection functions
πipriv : VTM → privVTM and πshare : VTM → shareVTM; merge operator τ : (shareVTM)n → shareVTM; and next-state and ini-
tialization functions next : VTM → VTM and init : VTM → VTM, we model an individual FVTM level processor
F iVTM with independent shared state as follows.
F iVTM : ITM → [Ti × VTM → VTM],
F iVTM(σITM)(0, σVTM) = init (σVTM),
F iVTM(σITM)(t + 1, σVTM) = next[πipriv(F iVTM(σITM)(t, σVTM)),merge(t, σITM, σVTM)].
where merge : Ti × ITM × VTM → shareVTM is defined by
merge(t, σITM, σVTM) =τ(πshare(F iVTM(σITM)(t, σVTM)),
πshare(F
1
VTM(σITM)(
1,i (σITM)(t), ψ1(σITM))),
...
πshare(F
i−1
VTM(σITM)(
i−1,i (σITM)(t), ψi−1(σITM))),
πshare(F
i+1
VTM(σITM)(
i+1,i (σITM)(t), ψi+1(σITM))),
...
πshare(F
n
VTM(σITM)(
n,i(σITM)(t), ψn(σITM))))].
Definition 13 differs from Definition 11 only in the definition of the merge operation.
A further variation we might wish to consider is multiple merge operators. For example, consider the case where
not all execution threads are considered equal. As a simple example, a processor with two threads, with a shared
memory, might give priority to memory writes from one thread when they conflict. That is, if both threads attempt
to write to the same word at the same time, the write from one thread is ignored. We modify Definition 11 above as
follows.
Definition 14. Given clocks Ti , for i ∈ {1, . . . , n}; ITM state set ITM; V TM state set VTM; private state pro-
jection functions πipriv : VTM → privVTM; merge operators τi : (VTM)n → shareVTM; and next-state and initialization
functions next : VTM → VTM and init : VTM → VTM, we model an individual FVTM level processor F iVTM
with independent merge as follows:
F iVTM : ITM → [Ti × VTM → VTM],
F iVTM(σITM)(0, σVTM) = init (σVTM),
F iVTM(σITM)(t + 1, σVTM) = next[πipriv(F iVTM(σITM)(t, σVTM)),mergei(t, σITM, σVTM)].
N.A. Harman / Journal of Logic and Algebraic Programming 74 (2007) 32–56 45
where mergei : Ti × ITM × VTM → shareVTM is defined by
mergei(t, σITM, σVTM) = τi(F 1VTM(σITM)(
i,1(σITM)(t), ψ1(σITM)),
...
F i−1VTM(σITM)(
i−1,i (σITM)(t), ψi−1(σITM)),
F iVTM(σITM)(t, σITM),
F i+1VTM(σITM)(
i−1,i (σITM)(t), ψi+1(σITM)),
...
F nVTM(σITM)(
n,i(σITM)(t), ψn(σITM)))].
Clearly, we could also combine Definitions 13 and 14 (independent shared state and independent merge). We omit
this obvious definition.
7. Correctness of SMT/CMT processors
Given definitions of V TM and ITM level models, we now consider what it means for a V TM-level model to be
correctly implemented by an ITM-level model. Note that because there are n V TM level processors corresponding
to each ITM level processor, we must establish the commutivity of diagram (2), or equivalently of Eq. (3), for each
of the n separate cases.
Definition 15. intermediate thread model map FITM : S × ITM → ITM is said to be a correct implementation of
virtual thread model mapsF iVTM : ITM → [Ti × VTM → VTM], for i ∈ {1, . . . , n} for some initializations if, given
state-dependent adjunct retimings ρi ∈ Ret(ITM, S, Ti) and surjective data abstraction maps ψi : ITM → VTM,
then, for each clock Ti , ∀s = start (ρi(σITM))(s) and state ∈ ITM, the following diagrams, i ∈ {1, . . . , n}, commute.
Ti × VTM
F iVTM(ITM)−→ VTM⏐⏐⏐(ρi, ψi)
⏐⏐⏐ψi
S × ITM FITM−→ ITM.
(2)
Alternatively
ψi(FITM(s, σITM)) = F iVTM(σITM)(ρi(σITM)(s), ψi(σITM)). (3)
8. SMT/CMT processors and the one-step theorems
An important property of our earlier models of microprocessors (microprogrammed, pipelined and superscalar)
are the one-step theorems which reduce verification to state exploration by eliminating induction over time. The
fundamental notion is that, in actual hardware, future state evolution is dependent only on the current state13: not on the
value of some clock T . Therefore, an implementation AC is correct provided it (a) correctly implements all possible
state transitions in the specification PM; and (b) is subsequently in a state that will allow it to correctly implement
further possible state transitions of PM . As well as reducing the complexity of practical verifications, the one-step
theorems represent an important property of real hardware: that is, state transition does depend only on the current
state and (possibly) inputs at the current time. Hence if the one-step theorems do not hold for the extended SMT/CMT
model, confidence in it is substantially reduced.
13 And possibly current inputs: see [6].
46 N.A. Harman / Journal of Logic and Algebraic Programming 74 (2007) 32–56
The definitions of the one-step theorems found in [6,12] must be extended to accommodate our extended SMT/CMT
model. In particular, it is necessary to establish that our VTM model is time-consistent. We proceed as follows. First,
we briefly review the one-step theorems from [6,12]. Then we explain why further theorems are required for the VTM
model. Finally, we state and prove the new one-step theorems for the VTM model.
8.1. Simplifying correctness with the one-step theorems
The one-step theorems (Theorems 16 and 17 below) allow us to simplify the formal verification of a map AC :
S × B → B with respect to a map PM : T × A → A, with data abstraction map ψ : B → A and uniform state-
dependent retiming λ ∈ Ret(B, S, T ) by only considering correctness at times s = 0, and s = λ(b)(1): that is, initially
and after one cycle of the more abstract clock T . To apply the one-step theorem, we must establish that PM and AC
are time-consistent: that is, for all b ∈ B and t ∈ T :
AC(λ(b)(t + t ′), b) = AC(λ(AC(λ(b)(t), b))(t ′), AC(λ(b)(t), b)); and (4)
PM(t + t ′, ψ(b)) = PM(t ′, (PM(t, ψ(b))), (5)
The correctness of Eqs. (4) and (5) depends on the definitions of the initialization functions initAC and initPM for AC
and PM respectively. Eqs. (4) and (5) require that initAC can be inserted in the trace of AC at times s = λ(b)(t), and
initPM in the trace of PM at all times t with no effect. Note that the Equations 4 and 5 are trivially true if initAC and
initPM are the identity operation, which is commonly true for the PM case (Eq. (5)) though generally not for the AC
case (4). Note that in the case of AC (Eq. (4)) we only require time consistency for times s ∈ S corresponding to the
start end of cycles of clock T , where S and T are related by retiming λ. In this case, we say that AC is time-consistent
with respect to λ.
The obvious proof of 4 above is by induction, which would defeat the point of eliminating induction in the correctness
proof. However, the first one-step theorem (Theorem 16), enables us to eliminate induction when establishing time-
consistency for iterated maps with initialization functions. Then, in Theorem 17, we state the principal one-step theorem
which is used to establish correctness.
Theorem 16. LetF : T × A → A be an iterated map with initialization function init : A → A. Letλ ∈ Ret(A, S, T )
be a uniform retiming with respect to F . For all t ∈ T , a ∈ A
F(λ(a)(t + t ′), a) = F(λ(F (λ(a)(t ′), a))(t), F (λ(a)(t ′), a))
if and only if
F(λ(a)(0), a) = init (F (λ(a)(0), a)) and (6)
F(λ(a)(1), a) = init (F (λ(a)(1), a)). (7)
That is, to establish time-consistency, it is sufficient to check at times t = 0 and t = 1 that a = init (a), where a is
the state of F at times t = 0 and t = 1.
Proof. Omitted. See [6]. Note Theorem 16 still holds if retiming λ is replaced by adjunct retiming ρ (that is, for
superscalar processors): see [6]. 
Theorem 17. Let PM : T × A → A and AC : S × B → B be iterated maps. Let λ ∈ Ret(B, S, T ) be a uniform
retiming with respect to AC. Let ψ : B → A be a surjective data abstraction map. If
(1) PM is time-consistent; and
(2) AC is time-consistent with respect to λ,
then for all b ∈ B and s = start (λ(a))(s)
PM(λ(b)(s), ψ(b)) = ψ(AC(s, b))
N.A. Harman / Journal of Logic and Algebraic Programming 74 (2007) 32–56 47
if and only if
PM(0, ψ(b)) = ψ(AC(0, b)) and (8)
PM(1, ψ(b)) = ψ(AC(λ(b)(1), b)). (9)
That is, having established that PM and AC are time-consistent, it is sufficient to show that diagram (1) commutes
at times t = s = 0, and t = 1, s = λ(b)(1) to ensure that it commutes at all times t . Note that the equivalent one-
step theorem for the superscalar case, where specification and implementation are temporally related by an adjunct
retiming is slightly different [6]. We omit the formal statement of this case but we do state and prove the one-
step theorem for SMT/CMT processors (which is very similar to the SMT/CMT case) with adjunct retimings in
Section 8.2.
Proof. Omitted. See [6]. 
8.2. Extending the one-step theorems to the VTM model
In considering how we can apply the one-step theorems to our VTM model, recall that an ITM processor FITM :
S × ITM → ITM is essentially identical to an AC processor AC. That is, it represents the implementation of a
more abstract specification, normally including an initialization function initITM : ITM → ITM, and the fact that
it represents multiple more abstract processors is not relevant to its own time-consistency. We are interested in its
correctness with respect to some uniform state-dependent retiming λ at times s such that s = start (λ)(s). Hence the
existing one-step Theorem 16 shows how to establish that FITM is time-consistent. However, observe that unlike the
non-SMT/CMT case, there are n VTM level clocks T1, . . . ,Tn and n corresponding [adjunct] retimings ρ1, . . . ,ρn.
An ITM-level implementation must be shown to be time consistent (by means of Theorem 16 or some alternative
technique) with respect to each of these retimings.
Although the existing one step theorem for establishing time consistency (Theorem 16) applies to the ITM-level
implementation, it is not immediately clear that a VTM level model is time-consistent, where multiple iterated maps
interact and share state via merge operators, retimings and their immersions.
Theorem 18. Let Fi : ITM → [Ti × VTM → VTM] be the ith component of a V TM level processor; let G :
S × ITM → ITM be a time-consistent (candidate) ITM-level implementation ofFi , i ∈ {1, . . . ,n}; letρi ∈ ITM →
[S → Ti] be a uniform adjunct retiming; and let ψi : ITM → VTM be a data abstraction map. For all t ∈ Ti, i ∈ {1,
. . . ,n} and σITM ∈ ITM:
Fi(G(0, σITM))(t + t ′, ψi(σITM)) = Fi(G(ρi(σITM)(t ′)))(t, Fi(G(0, σITM))(t ′, σITM))
if and only if
Fi(σITM)(0, ψi(σITM)) = init (Fi(σITM)(0, ψi(σITM))), and (10)
Fi(σITM)(1, ψi(σITM)) = init (Fi(σITM)(1, ψi(σITM))) (11)
Note that we do not require that G is a correct implementation of F (which would result in a circular argument); only
that it is time-consistent. Clearly however, G is likely to be chosen because it is a candidate implementation of F whose
correctness we ultimately aim to establish.
Proof. Given thatλ is a uniform retiming andG is a time-consistent iterated map, the desired state trace ofFi(σITM)(t +
t ′, ψ(σITM)) is as follows:
Fi(σITM)(0, ψ(σITM)),
Fi(σITM)(1, ψ(σITM)),
. . . . . .
Fi(σITM)(t, ψ(σITM)) = Fi(G(λi(σITM)(t), σITM))(0, Fi(σITM)(t, ψ(σITM))),
48 N.A. Harman / Journal of Logic and Algebraic Programming 74 (2007) 32–56
Fi(σITM)(t + 1, ψ(σITM)) = Fi(G(λi(σITM)(t), σITM))(1, Fi(σITM)(t, ψ(σITM))),
. . . . . .
Fi(σITM)(t + t ′, ψ(σITM)) = Fi(G(λi(σITM)(t), σITM))(t, Fi(σITM)(t, ψ(σITM))).
The left-hand side represents the state trace if Fi continues execution uninterrupted until time t + t ′. The right-hand
side represents the state trace if execution is interrupted at time t , the clock is reset, and execution is restarted and
continued for a further t ′ clock cycles. The right-hand state trace is equivalent to the left-hand trace provided:
Fi(σITM)(t, ψ(σITM)) = init (Fi(σITM)(t, ψ(σITM))).
Fig. 6 illustrates the state trace of Fi up to time t + t ′, with the effects of other V TM-level processors, except for
one (Fj ), and the projection and merge operations omitted for clarity. Time consistency requires that initialization
function init can be inserted at the points marked ‘*’ without changing the trace. Theorem 18 states that this can be
established by checking only at the points marked ‘•’. The state trace of Fi consists of repeated fragments of the form
shown in Fig. 7 with t ∈ {0, . . . }. In order to successfully assemble trace fragments of the form shown in Fig. 7 with
applications of init inserted at point ‘*’ in the case that the initialization function init is not the identify operation,14
the state set σVTM of Fi contains a subset σ permVTM such that the following conditions must hold:
init (p) = p iff p ∈ σ permVTM , (12)
∀p ∈ σ permVTM | next (p) ∈ σ permVTM . (13)
That is, (a) init does not change any permitted states but does change non-permitted ones; and (b) applying next
to a permitted state always results in a permitted state. Effectively, init must define σ permVTM by acting as a predicate:
σITM ∈ σ permVTM iff init (σITM) = σITM. Eq. (10) established the condition defined in Eq. (12) since
init (Fi(σITM)(0, ψi(σITM))) = init (init (ψ(σITM))).
If Eq. (10) hold for all σITM ∈ ITM, then init can be arbitrarily inserted into the state trace of Fi provided
next (Fi(σITM)(t, ψ(σITM))) ∈ σ permVTM . (14)
Eq. (14) is established by Eq. (11). If for all σITM ∈ ITM
Fi(σITM)(1, ψi(σITM)) = init (Fi(σITM)(1, ψi(σITM))),
then
Fi(σITM)(1t, ψi(σITM)) ∈ σ permVTM
because no states can arise at an arbitrary time t that are not present at time t = 1 and Fi . 
It remains to consider the second one-step theorem for SMT/CMT processors: namely, that given n time-consistent
FVTM processors Fi : ITM → [Ti × VTM → VTM], . . . , Fn and a time-consistent FITM-level implementation
G : S × ITM → ITM together with the appropriate (adjunct) retimings and data abstraction maps, to establish
correctness for all times t ∈ Ti it is sufficient to establish correctness at times 0, 1 ∈ Ti , i = 1, . . . ,n.
Note that we can consider a number of variants of this second one-step theorem: that is, processors with or without
explicit input–output, and superscalar or non-superscalar processors. We do not consider input–output in this paper.
However, given that a CMT processor must by necessity be superscalar, it is appropriate here to consider the superscalar
case: especially since the non-superscalar variant is just a special case of it.
14 If it is, then Fi is trivially time-consistent.
N.A. Harman / Journal of Logic and Algebraic Programming 74 (2007) 32–56 49
Fi (σ)(t+t’,ψ (σ))i
next
*
Fi (σ)(t+t’-1,ψ (σ))i
next
*
Fi (σ)(t+t’-2,
next
*
Fi (σ)(t,ψ (σ))i
ψ (σ))i
next
*
next
*
Fi (σ)(0,ψ (σ))i
next
*
Fj(σ)(ρ  (σ)(t+t’-1),ψ (σ))jj,i
Fj (σ)(ρ  (σ)(t+t’-2),ψ (σ))jj,i
Fj (σ)(ρ  (σ)(t),ψ (σ))jj,i
j,i
Fj (σ)(0,ψ (σ))j
*
Fig. 6. State trace of Fi .
next
Fi (σ)(t,ψ (σ))iFj (σ)(ρ  (σ)(t),ψ (σ))jj,i
π i
priv
τ
∗
Fig. 7. Fragment of the state trace of Fi .
50 N.A. Harman / Journal of Logic and Algebraic Programming 74 (2007) 32–56
Theorem 19. Let Fi : ITM → [Ti × VTM → VTM] be the ith component of a FVTM level processor. Let G :
S × ITM → ITM be a time-consistent FITM-level implementation of Fi , i ∈ {1, . . . ,n}. Let ρi ∈ ITM → [S → Ti]
be a uniform adjunct retiming defined by λiλS,Ri , where λi ∈ Ret(Ti, Ri) is a retiming relating the instruction-
level clock Ti and the corresponding retirement clock Ri for FVTM processor Fi, and λS,Ri is a retiming relating
system clock S with retirement clock Ri . Let ψi : ITM → VTM be a data abstraction map. For all s ∈ S such that
s = start (λS,Ri )(s):
Fi(σITM)(ρi(σITM)(s), ψi(σITM)) = ψi(G(s, σITM))
if and only if
Fi(σITM)(0, ψi(σITM)) = ψi(σITM), and (15)
Fi(σITM)(λi(σITM)(1), ψi(σITM)) = ψi(G(λS,Ri (σITM)(1), σITM)). (16)
Informally, G correctly implements Fi provided G is correct at time r = 0, and r = 1.
Proof. The proof is by induction over clock Ri . The base case is established by Eq. (15). Assuming that
Fi(σITM)(λi(σITM)(r), ψi(σITM)) = ψi(G(λS,Ri (σITM)(r), σITM)),
we show that
Fi(σITM)(λi(σITM)(r + 1), ψi(σITM)) = ψi(G(λS,Ri (σITM)(r + 1), σITM)).
In the following, we use the abbreviations:
t = λi(σITM)(r),
s = λS,Ri (σITM)(r).
That is, the times on instruction clock Ti and system clock S corresponding to retirement clock cycle r ∈ Ri .
Fi(σITM)(λi(σITM)(r + 1), ψi(σITM))
= Fi(σITM)(λi(G(s, σITM))(1) + t, ψi(σITM)), since λi is uniform
= Fi(G(s, σITM))(λi(G(s, σITM))(1), Fi(σITM)(t, ψi(σITM)), since Fi is time-consistent
= Fi(G(s, σITM))(λi(G(s, σITM))(1), ψi(G(s, σITM))), by induction hypothesis
= ψi(G(λS,Ri (G(s, σITM))(1),G(s, σITM))), by Eq. (16)
= ψi(G(λS,Ri (G(s, σITM))(r + 1), σITM)). since G is time-consistent 
9. A simple example
We now consider a simple example of SMT/CMT. In practice, SMT/CMT processors are extremely complex: they
are by their nature superscalar, and even simple superscalar examples are non-trivial [6]. An SMT/CMT microprocessor
example is in development based on an extension of the superscalar example in [6,10]. However, in this paper, we restrict
ourselves to a simple example, that captures the key features of SMT/CMT: an implementation that can implement
multiple (virtual) specifications, that are in turn able to interact via shared memory. In Section 9.1 we first define the
conventional PM-level model T R; then we define an ITM level implementation T RH which implements two virtual
T R processors, with a shared state element; and finally we use the next-state and initialization functions from T R to
N.A. Harman / Journal of Logic and Algebraic Programming 74 (2007) 32–56 51
define a V TM level model T RV . Then in Section 9.2 we present a modified version of the same example, illustrating
multiple merge operators.
9.1. Example with single merge operator
Our example T R : T × TR → TR will consist of a simple processor with a memory m ∈ [N → N], an error bit
e ∈ B and a memory address register mr ∈ N computing over a clock T .
TR = [N → N] × B × N
On each cycle of clock T , T R will compute the function nextTR : TR → TR
nextTR(m,mr) =
{
m[mr/m[mr] + m[mr − 1]], b,mr + 1, if b = ff ;
m, b,mr, if b = tt
where the (infix) memory read and memory write operations _[_] : [N → N] × N → N and _[_/_] : [N → N] × N2 →
[N → N] are defined by the following equation:
m[a/b][c] =
{
b, if a = c;
m[c], if a /= c.
Observe that nextTR does not change the value of b: hence as a stand-alone system, T R either continuously executes
the b = ff case, or the b = tt case, depending on the initial value of b. However, our V TM model below, will permit
b to change state during execution traces.
Because all states are legal, no initialization function is necessary. We define T R as follows:
T R(0,m, b,mr) = m, b,mr,
T R(t + 1,m, b,mr) = nextTR(T R(t,m, b,mr)).
Our FITM level implementation T RH : S × TRH → TRH for some new clock S will implement two virtual T R
processors, where the memory m and error bit b are shared and mr is private. Hence the state of T RH is
TRH = [N → N] × B × N2
We define T RH as follows:
T RH(0,m, b,mr1,mr2) = m, b,mr1,mr2,
T RH(s + 1,m, b,mr1,mr2) = nextTRH(T RH(s,m, b,mr1,mr2)),
with nextTRH : TRH → TRH defined as follows:
nextTRH(m, b,mr1,mr2) =
⎧⎪⎪⎪⎪⎪⎪⎨
⎪⎪⎪⎪⎪⎪⎩
m[mr1/m[mr1] + m[mr1 − 1]] if mr1 /= mr2,
[mr2/m[mr2] + m[mr2 − 1]], and b = ff ;
ff ,mr1 + 1,mr2 + 1
m, tt,mr1,mr2, if mr1 = mr2
or b = tt.
(17)
Again (and unusually at this level of abstraction), no initialization function is necessary. Observe that error bit b is
set to true if nextTRH attempts to write the result of the two additions to the same memory word.
52 N.A. Harman / Journal of Logic and Algebraic Programming 74 (2007) 32–56
The V TM level model T RV will consist of two iterated maps T RV1 and T RV2, each operating over an independent
clock (T1 and T2 respectively) and the ITM level state set TRH:
T RV1 : TRH → [T1 × TR → TR]
T RV2 : TRH → [T2 × TR → TR]
T RVi , i ∈ {1, 2} will be defined in terms of the PM level next-state function nextTR, projection function
πTRVpriv :TR → N,
merge operator
τTRV :([N → N] × B × N)2 → B × [N → N];
and state dependent retimings
λi ∈Ret(TRH, S, Ti), i ∈ {1, 2}.
We define these functions below
πTRVpriv (m, b,mr) = mr, (18)
and
τTRV(m1, b1,mr1,m2, b2,mr2) =
{
tt,m1, if mr1 = mr2
ff ,m1[mr2/m2[mr2]], otherwise,
If the two memory address registers mr1 and mr2 are equal, then an error has occurred: we set b true and leave the
memory m1 unchanged. Otherwise, we merge any changes that have been made to the other memory m2.
It only remains to define the retimings λi . In the case of T R, T RH and T RV λ1 and λ2 are trivial:
λi(m, b,mr1,mr2)(s) = s, i = {1, 2}.
It follows that the immersions are also trivial:
λ(m, b,mr1,mr2)(t) = t.
Note that in a real microprocessor, issues such as instruction dependencies and memory access latencies would mean
that the retimings would be non-trivial.
We can define T RV1 and T RV2 (expanding the trivial definitions of the retimings) as follows:
T RV1(m, b,mr1,mr2)(0,m, b,mr1) = (m, b,mr1),
T RV1(m, b,mr1,mr2)(t + 1,m, b,mr1)
= nextTR2(τTRV(T RV1(m, b,mr1,mr2)(t, m, b,mr1),
T RV2(m, b,mr1,mr2)(t, (m, b,mr2)), π
TRV
priv (T RV1(t, m, b,mr1))),
T RV2(m, b,mr1,mr2)(0,m, b,mr1) = (m, b,mr2),
T RV2(m, b,mr1,mr2)(t + 1,m, b,mr2)
= nextTR2(τTRV(T RV2(m, b,mr1,mr2)(t, m, b,mr2),
T RV1(m, b,mr1,mr2)(t, (m, b,mr1)), π
TRV
priv (T RV1(t, m, b,mr2))).
N.A. Harman / Journal of Logic and Algebraic Programming 74 (2007) 32–56 53
9.2. An example with separate merge operators
The example above illustrates Definition 11. An alternative behaviour in the event that both threads attempt to write
to the same memory word simultaneously would be to give priority to writes from one thread (eliminating the error
bit). This would be an example of Definition 14. We define this variant below.
T R2 : T × TR2 → TR2 consists of a memory m ∈ [N → N] and a memory address register mr ∈ N computing
over a clock T .
TR2 = [N → N] × N.
On each cycle of clock T , T R2 will compute the function nextTR2 : TR2 → TR2
nextTR2(m,mr) = m[mr/m[mr] + m[mr − 1]],mr + 1.
As before, no initialization function is necessary. We define T R2 as follows:
T R2(0,m,mr) = m,mr,
T R2(t + 1,m,mr) = nextTR2(T R(t,m,mr)).
The implementation T RH2 : S × TRH2 → TRH2 implements two virtual T R2 processors as before, with memory
m shared and mr private. Hence the state of T RH2 is
TRH2 = [N → N] × N2
We define T RH2 as follows:
T RH2(0,m,mr1,mr2) = m,mr1,mr2,
T RH2(s + 1,m,mr1,mr2) = nextTRH2(T RH2(s,m,mr1,mr2)),
where nextTRH2 : TRH → TRH is defined as
nextTRH2(m,mr1,mr2) = m[mr2/m[mr2] + m[mr2 − 1]][mr1/m[mr1] + m[mr1 − 1]].
Observe that we perform the mr2 write first, so in the event that both writes are to the same memory word, the mr1
result will replace the mr2 result.
The V TM level model T RV 2 consists of T RV 21 and T RV 22:
T RV 21 : TRH2 → [T1 × TR2 → TR2]
T RV 22 : TRH2 → [T2 × TR2 → TR2]
T RVi , i ∈ {1, 2} will be defined in terms of the PM level next-state function nextTR2, projection function πTRV2priv :
TR2 → N, data abstraction maps ψi : TRH2 → TR2, merge operators τTRV2i : ([N → N] × N)2 → [N → N] ×
N, i ∈ {1, 2}; and retimings λi : TRH2 → [S → Ti], i ∈ {1, 2} defined as follows.
πTRV2priv (m,mr) = mr,
ψ1(m,mr1,mr2) = (m,mr1),
ψ2(m,mr1,mr2) = (m,mr2),
with merge operators
τTRV21 (m1,mr1,m2,mr2) = m2[mr1/m1[mr1]]
and
τTRV22 (m2,mr2,m1,mr1) =
{
m1, if mr1 = mr2
m1[mr2/m2[mr2]], otherwise,
54 N.A. Harman / Journal of Logic and Algebraic Programming 74 (2007) 32–56
The first merge operator τTRV21 copies the changes made by T RV 1 regardless of the value of mr1; however τ
TRV2
2 only
copies changes made by T RV if they will not overwrite the immediately proceeding change made by T RV 1.
The retimings λi and corresponding immersions are identical to those in Section 9.1.
We can define T RV 21 and T RV 22 (again expanding the trivial definitions of the retimings) as follows:
T RV 21(m,mr1,mr2)(0,m,mr1) = (m,mr1),
T RV 21(m,mr1,mr2)(t + 1,m,mr1)
= nextTR2(τTRV21 (T RV 21(m,mr1,mr2)(t, m,mr1),
T RV 22(m,mr1,mr2)(t, (m,mr2)), πTRV2priv (T RV 21(t, m,mr1))),
T RV 22(m,mr1,mr2)(0,m,mr1) = (m,mr2),
T RV 22(m,mr1,mr2)(t + 1,m,mr2)
= nextTR2(τTRV22 (T RV 22(m,mr1,mr2)(t, m,mr2),
T RV 21(m,mr1,mr2)(t, (m,mr1)), πTRV2priv (T RV 21(t, m,mr2))).
10. Concluding remarks
We have extended our existing models of microprocessors and their correctness criteria to superscalar SMT/CMT
processor implementations, which represent the state-of-the-art in current commercial implementations. However,
although we can successfully model such processors, and define what it means for them to be correct, practical
verification of realistic examples would be a formidable undertaking. Of course, this is generally the case: practical
verifications of complete non-trivial processors are currently limited to non-superscalar pipelined processors: [7,8].
There are some practical steps that can be taken to reduce complexity, though these simplifications are unlikely to
make realistic examples practical at the current time. Nonetheless, we feel that a systematic approach to modelling
processors and their correctness that runs ahead of actual application is useful, and indeed necessary: the modelling
approaches to pipelined processors that were ultimately used to verify ARM6 [7,8] were developed some years in
advance of their practical use. Approaches to understanding and modelling microprocessor have a range of goals. In
some instances, their is an immediate need to establish the correctness of a specific processor or processor fragment.
In other instances, the aim is to develop models of processors in general. In the former case, there is likely to be an
emphasis on tools and the development of precisely-focussed modelling techniques. In the later case, work is likely
to focus more on establishing a mathematically-sound framework and the precise meaning of, for example, what is
meant by correctness. Although both approaches are necessary, it is would seem to be helpful for general modelling to
precede verification of complex examples. This is supported by the work in [7,8], where prior general model building
would seem to have simplified the verification of a specific commercial processor.
A point worthy of note is the presence of the definition of the AC level implementation in the definition of the V TM
level model. This should not be surprising: in practice, the implementation of an SMT/CMT processor does impact
the behaviour seen by programmers; and the timing behaviour of all processors is a function of their implementation.
This last fact is generally acknowledged in our model by the definition of retimings in terms of the AC-level model.
It has been considered desirable for decades that the programmer’s-level abstraction should be clearly separated
from any corresponding implementation. This separation makes it straightforward to maintain software compatibility
with new and more advanced implementations, for example. The model developed here does not preserve such a
separation. However, there has been a general weakening of the long-established separation of processor architecture
and implementation: good compilers for modern processors need to be aware of implementation details (e.g. how many
functional units are there, and of what type) in order to generate high-quality code. The relative timing of threads in
SMT/CMT processors does depend on the actual implementation. Hence an accurate model of SMT/CMT processors
should depend on the implementation.
Finally, observe that a situation similar to SMT/CMT occurs with operating system kernels: a single physical
processor presents as multiple virtual processors. The situation is somewhat different. In a kernel a privileged virtual
N.A. Harman / Journal of Logic and Algebraic Programming 74 (2007) 32–56 55
processor at the higher level of abstraction controls thread/process execution. In SMT/CMT processors, thread execution
is controlled from the lower level of abstraction. However, we believe the work here can be adapted to accommodate
operating system kernels. Together with [29] on modelling high- and low-level languages and their relationships, this
would produce a chain of fundamentally identical models from high-level languages to abstract hardware. The intention
to extend the model presented here to operating systems is one reason why we have chosen not to adopt the approach
suggested in [9] of modelling inter-thread communication as input–output. However, further work is also planned on
this alternative approach, and its relationship to the model presented in this paper.
References
[1] M. Aagaard, B. Cook, N. Day, R. Jones, A framework for superscalar microprocessor correctness statements, Softw. Tools Technol. Transfer,
4 (2003) 298–312.
[2] S. Beyer, C. Jacobi, D. Kr"oning, D. Leinenbach, W. Paul, Instantiating uninterpreted functional unit and memory system: functional verification
of VAMP, in: D. Geist, T. Enrico (Eds.), Correct Hardware Design and Verification Methods, Lecture Notes in Computer Science, vol. 2860,
Springer-Verlag, 2003, pp. 51–65.
[3] J. Burch, D. Dill, Automatic verification of pipelined microprocessor control, in: D. Dill (Ed.), Proceedings of the 6th International Conference,
CAV’94: Computer-Aided Verification, Lecture Notes in Computer Science, vol. 818, Springer-Verlag, 1994, pp. 68–80.
[4] A. Cohn, A proof of correctness of the VIPER microprocessor: the first levels, in: G. Birtwistle, P.A. Subrahmanyam (Eds.), VLSI Specification,
Verification and Synthesis, Kluwer Academic Publishers, 1987, pp. 27–72.
[5] D. Cyrluk, J. Rushby, M. Srivas, Systematic formal verification of interpreters, IEE International Conference on Formal Engineering Methods
ICFEM’97, IEEE Computer Society Press, 1997, pp. 140–149.
[6] A.C.J. Fox, Algebraic Representation of Advanced Microprocessors, Ph.D. thesis, Department of Computer Science, University of Wales
Swansea, 1998.
[7] A.C.J. Fox, Formal specification and verification of ARM6, in: D. Basin, B. Wolff (Eds.), TPHOLs ’03, Lecture Notes in Computer Science,
vol. 2758, Springer-Verlag, 2003, pp. 25–40.
[8] A.C.J. Fox, An algebraic framework for verifying the correctness of hardware with input and output: a formalization in HOL, in: J.L. Fiadeiro,
N.A. Harman, M. Roggenbach, J. Rutten (Eds.), CALCO 2005, Lecture Notes in Computer Science, vol. 3629, Springer-Verlag, 2005, pp.
157–174.
[9] A.C.J. Fox, Private Communication, 2007.
[10] A.C.J. Fox, N.A. Harman, Algebraic models of superscalar microprocessor implementations: a case study, in: B. M"oller, J.V. Tucker (Eds.),
Prospects for Hardware Foundations, Lecture Notes in Computer Science, vol. 1546, Springer-Verlag, 1998, pp. 138–183.
[11] A.C.J. Fox, N.A. Harman, Algebraic models of correctness for microprocessors, Formal Aspects Comput. 12 (4) (2000) 298–312.
[12] A.C.J. Fox, N.A. Harman, Algebraic models of correctness for abstract pipelines, J. Log. Algebr. Program. 57 (2003) 71–107.
[13] M. Gordon, Proving a computer correct with the LCF-LSM hardware verification system, Technical Report 42, Computer Laboratory, University
of Cambridge, 1983.
[14] B. Graham, G. Birtwistle, Formalising the design of an SECD chip, in: M. Leeser, G. Brown (Eds.), Hardware Specification, Verification and
Synthesis: Mathematical Aspects, Lecture Notes in Computer Science, vol. 408, Springer-Verlag, 1990, pp. 40–66.
[15] N.A. Harman, J.V. Tucker, Formal specification and the design of verifiable computers, in: Proceedings of the 1988 UK IT Conference,
University College Swansea, IEE, 1988, pp. 500–503.
[16] N.A. Harman, J.V. Tucker, The formal specification of a digital correlator I: Abstract user specification, in: K. McEvoy, J.V. Tucker (Eds.),
Theoretical Foundations for VLSI Design, Tracts in Theoretical Computer Science, vol. 10, Cambridge University Press, 1990, pp. 161–262.
[17] N.A. Harman, J.V. Tucker, Algebraic models of microprocessors: architecture and organisation, Acta Inform. 33 (1996) 421–456.
[18] N.A. Harman, J.V. Tucker, Algebraic models of microprocessors: the verification of a simple computer, in: V. Stavridou (Ed.), Proceedings of
the 2nd IMA Conference on Mathematics for Dependable Systems, Oxford University Press, 1997, pp. 135–170.
[19] J.L. Hennessey, D.A. Patterson, Computer Architecture: A Quantative Approach, Morgan Kaufman, 1996.
[20] R. Hosabettu, G. Gopalakrishnan, M. Srivas, Formal verification of a complex pipelined processor, Formal Methods in System Design 23 (2)
(2003) 171–213.
[21] W. Hunt, A formal HDL and its use in the FM9001 verification, in: C.A.R. Hoare, M. Gordon (Eds.), Mechanized Reasoning in Hardware
Design, Prentice-Hall, 1992.
[22] W. Hunt, FM8501: A Verified Microprocessor, Lecture Notes in Artificial Intelligence, Springer-Verlag, 1994.
[23] W.A. Hunt, Microprocessor design verification, J. Automat. Reason. 5 (4) (1989) 429–460.
[24] D. Koufaty, D.T. Marr, Hyperthreading technology in the netburst microarchitecture, IEEE Micro 23 (2) (2003) 56–65.
[25] C.E. Leiserson, F.M. Rose, J.B. Saxe, Optimizing synchronous circuitry by retiming, in: Third Caltech Conference on VLSI, Computer Science
Press, 1983, pp. 87–116.
[26] K. Meinke, J.V. Tucker, Universal algebra, in: T.S.E. Maibaum, S. Abramsky, D. Gabbay (Eds.), Handbook of Logic in Computer Science,
Oxford University Press, 1992, pp. 189–411.
[27] S. Ray, W.A. Hunt, Deductive verification of pipelined machines using first-order quantification, in: Computer-Aided Verification CAV 2004,
Lecture Notes in Computer Science, vol. 3114, Springer-Verlag, 2004, pp. 31–43.
56 N.A. Harman / Journal of Logic and Algebraic Programming 74 (2007) 32–56
[28] J. Sawada, W.A. Hunt, Processor verification with precise exceptions and speculative execution, in: A.J. Hu, M.Y. Vardi (Eds.), Computer-Aided
verification: 10th International Conference 10th International Conference, Lecture Notes in Computer Science, vol. 1427, Springer-Verlag,
1998, pp. 135–147.
[29] K. Stephenson, Algebraic specification of the Java virtual machine, in: B. M"oller, J.V. Tucker (Eds.), Prospects for Hardware Foundations,
Lecture Notes in Computer Science, vol. 1546, Springer-Verlag, 1998, pp. 236–277.
[30] W. Wechler, Universal Algebra for Computer Scientists, EATCS Monograph, Springer-Verlag, 1991.
[31] P. Windley, A theory of generic intepreters, in: L. Pierre, G. Milne (Eds.), Correct Hardware Design and Verification Methods, Lecture Notes
in Computer Science, vol. 683, Springer-Verlag, 1993, pp. 122–134.
