We apply algebraic tools for modelling microprocessors to the specification, implementation, and verification of an abstract pipelined case study. We employ a model of time based on counting events by means of a clock. We model systems by iterated maps that evolve over time from some initial state. We define formal correctness conditions, and introduce the one-step theorems that can reduce the complexity of formal verification. The algebraic models provide: (i) modular descriptions of pipelined systems; (ii) equational correctness criteria; and (iii) equational specification and verification techniques for the design of pipelined systems applicable to a range of software systems.
Introduction
This paper examines the nature of initialisation, data abstraction and temporal abstraction for pipelined systems with, and without, inputs, and dynamic stalling, i.e. externally-imposed delays of indeterminate length. Our particular interest is in building models, by which we mean generic mathematical structures which act as frameworks, or templates, that can be used to structure specific examples. By constructing these models carefully, we hope to be able to answer general questions concerning, for example, correctness, and to provide a "road map" for verification attempts.
The typical example of a pipelined system is a microprocessor, and it is microprocessors that particularly interest us. In this paper we focus on aspects of pipelined systems: specifically, those related to verfying the correctness of various forms of pipeline. This paper forms part of a series on on algebraic models of microprocessors. In the past, we have considered specific microprocessorrelated examples. These have included superscalar systems [14, 17, 18] and microprogrammed systems [26, 27] . However, we may also consider the concept of a pipelined system in a simplified, abstracted form, unrelated to specific applications. Such an approach allows us to examine theoretical concepts without the distractions of implementation specific details. This is the approach we take in this paper in order to explore specific issues related to the correctness of pipelined microprocessor systems. In contrast to most other work on microprocessors (pipelined and otherwise) we do not focus on super-specific representations of particular (and often impressively complex) examples using particular software tools. Instead we are concerned with abstract models within mathematical frameworks (which we may later implement in software [24, 15] ). A related paper is [19] which addresses in more detail the background to the one-step theorems (Section 4.6) that enable formal verification to be simplified: this paper focusses on the application of the one-step theorems to pipelined examples. More recently, experimental machine verifications of pipelined examples have been undertaken [23, 24, 15, 16] , using Maude [8] and HOL [21] .
We are interested in models of time, and the complex temporal relationships between microprocessors at different levels of abstraction. We use a model of time called a clock that divides time into segments defined by events. We relate levels of timing abstraction, or clocks, by surjective, monotonic maps called retimings. Abstract pipelines, and concrete examples such as microprocessors, are modelled as iterated maps
where T is a clock, dividing up time, and A represents the state-set of the microprocessor (registers and memories). F is equationally defined as follows F (0, a) = h(a), F (t + 1, a) = f (F (t, a)), where f is a next-state function defining state evolution, and h is an (optional) initialisation function, ensuring/enforcing consistency of initial state a. Mathematically, these are instances of simultaneous primitive recursive functions. Initialisation function h also takes the rôle of an invariant when applying the one-step theorems (Section 4.6, [19, 14] ) to reduce formal verification to state exploration.
To illustrate our techniques, we formally verify two simple implementations of a pipelined system. The first implementation contains a pipeline that is always full. The second has a pipeline that may be empty on initialisation. In addition, we consider systems with input and output, and show how we can model dynamic stalling: that is, externally imposed delays of indeterminate (though generally bounded) length. Dynamic stalls may occur in actual pipelines when, for example, a memory access causes a cache miss in a microprocessor, causing it to seek the required word in a secondary cache, or the main memory; or when a memory access occurs to a word that is being refreshed.
The algebraic methods on which this work is based: (i) are modular, and provide a basis for the formal decomposition of the descriptions of microprocessors and associated correctness criteria; (ii) express correctness in terms of equations, the simplest logical formulae; and (iii) support equational specification and verification techniques for the design of microprocessors, that are not dedicated to specific software systems, but are general, and may be represented in, and processed by, a range of machine reasoning systems.
1 In addition, the methods form the basis of a uniform theoretical framework for modelling microprocessors. Extended discussion of the work in this paper, and of that in [19, 17] , can be found in [14] .
Much of the work on pipelined microprocessors has been greatly influenced by the use of software tools and is motivated by the need to verify specific designs using specific theorem provers. The examples addressed are often frankly impressive. However, the practical necessities of dealing with large and complex examples usually leaves little room for general, systematic model building. This paper, and our work in general, is primarily concerned with a general theoretical framework for mathematically modelling microprocessors. We focus on the notion of correctness and the use of abstraction mechanisms. We present a case study that is not intended to be wholly representative of existing microprocessor designs. Instead, focus is placed on correctness issues while avoiding all unhelpful and uninformative complexities. It is acknowledged that software tools are essential for the verification of 'real world' designs, but it is also important to establish an underlying theory to organise, structure and compliment mechanical verifications. We believe that concrete case studies and verification attempts are more fully understood, and readily managed, when viewed in the context of a general, well-established and mathematically rigorous theory.
The structure of this paper is as follows. In Section 2 we outline the underlying concepts or our model. In Section 3 we examine the basic properties of pipelined systems, and briefly discuss other work on modelling pipelines in particular and microprocessors in general. In Section 4 we introduce the fundamental algebraic tools for modelling data, state, time, timing abstraction, and correctness. In Section 5 we introduce an abstract case study, and develop and prove two pipelined implementations (the proofs are in an appendix). In Section 6 we consider the notions of initialisation, and data and temporal abstraction in modelling systems and the correctness of systems. In Section 7 we extend our model to consider pipelines which can stall dynamically: that is, suspend activity in response to external events.
Evolving State Systems
In this section a philosophical basis for the work of Section 4 and Section 5 is outlined, and set in the context of related work on modelling pipelined systems.
States and State Sequences
Our models are constructed of systems that exhibit a property called state. By defining state, a level of abstraction is established. For example, one may define the state of a traffic signal to be red, yellow or green; but one could choose a more precise of description of the light (wavelength and intensity), or observe other properties of the traffic signal such as temperature, location, mass and so forth. In general, we aim to choose those properties of a system related to the key aspects under consideration. In a digital system this will generally be [some abstraction of] registers and memories. Intuitively, state is the observed 'current condition of being' of the model. This principle of observation is important when considering temporal abstraction.
Computer systems are modelled by considering state-sequences. For example, let A be a non-empty states whose elements a i ∈ A represent the states of some system. The sequence
denotes the evolution of the system through a sequence of states. The statespace of the system is the non-empty 2 set of all possible system states. Only deterministic state evolution is considered in this paper. That is, in the absence of inputs to the system, any given state is always followed by the same next-state. We will initially just consider systems without input. However, in Section 7 we will add input streams.
A Philosophy of Time
An abstract notion of time is defined by the enumeration of state change: time does not exist in itself, but arises from considering the ordering of distinct occurrent states. If a system does not change state, or ceases to change state, then time is, or becomes, redundant; because there is no next-state (though we note that there is a subtle point here in relation to the identity command). For example, consider the following finite state sequence
By definition the zeroth state a 0 occurs at time zero and the first state a 1 occurs at time one. There is no time two because the system only changes state once. Further, given such a model of time, the state sequence
does not define four times (0-3). There can only be three distinct times because the transition a 1 → a 1 does not represent a change in state, and it is impossible for an observer of the system to distinguish the first and second occurrence of the state a 1 without reference to an external meta-system. Time is defined relative to state transition and not the other way around.
Data Abstraction and Re-timing
System states can be 'abstracted', 'specified' or 'observed' with respect to an abstraction mapping ψ. Such data abstraction occurs when the correctness of a system is defined with respect to a more abstract requirements specification. For example, a state a might represent the state of a microprocessor's micro-architecture. The state ψ(a) then represents the state of the processor's architecture. That is, the state components that are of direct relevance to a machine or assembly code programmer. Through the process of data abstraction, a notion of temporal abstraction is induced. For example, if the mapping
is applied to the state sequence then the abstracted state sequence
has only two observable state changes, and would be perceived as the sequence
In this example temporal abstraction has occurred because six distinct times have been replaced by three times: see Figure 1 . In general, we consider time to be defined by events, where an event is simply the occurence of something of significance at the level of abstraction under consideration. For example, when considering a microprocessor, at some levels of abstraction we may only consider the start/end of machine instructions to be events. At a lower level of abstraction, we may also consider register transfer operations to be events.
Pipelined Systems
Pipelining is a common and effective mechanism for increasing the effectiveness of hardware implementations of a range of systems, including (but not limited to) microprocessors. The basic concept is simple: computation is divided into a number of phases, and these phases are overlapped for successive operations. Consider for example a microprocessor with four stages of execution: instruction fetch, instruction decode, instruction execution, and instruction retirement, or writeback as shown in Figure 2 . instruction execution in this way has a number of consequences.
• Hardware utilization is increased. In a typical non-pipelined implementation, each of the above phases of execution would still be present but only one of them would be in operation at any one time.
• Overall execution times are generally reduced, even though the execution time for a single instruction is likely to be unchanged, or even increased, because multiple executions are overlapped.
• There are a number of events that can disturb pipeline execution. In particular, conditional branches may force the pipeline to be emptied (flushed ) if a prediction of its outcome is incorrect. Alternatively, if no prediction is made, the pipeline must halt (stall ) until the outcome is known. Stalling can also occur in other circumstances. For example, a later instruction may require the result of an earlier instruction before it can proceed.
• The relationship between time in a specification (which is not pipelined) and a pipelined implementation is more complex than in the case of a nonpipelined implementation. In the case of a non-pipelined implementation, each time period in the implementation corresponds uniquely with some time period in the specification. In the case of a pipelined implementation, each time period in the implementation [potentially] corresponds with more than one time period in the specification. This is because operations that are temporally discrete at the level of the specification are being overlapped within the implementation.
The last point above has potential implications for correctness models, particularly in the case of superscalar systems. A superscalar system develops the concept of pipelining by permitting more than one operation to be in the same stage of execution simultaneously: more than one instruction may be fetched, decoded etc. at the same time. Commonly, out-of-order instruction execution is permitted. Superscalar systems are widely used in practice: for example all modern desktop systems use superscalar microprocessors. This requires a refinement of our correctness models: detailed discussion can be found in [14, 18, 17] . Fortunately, in the case of non-superscalar pipelined systems there is no difficulty: although events in the specification overlap in the implementation, the start and end of such events do not coincide and are never re-ordered at lower levels of abstraction (see Figure 2 ).
Related Work on Pipelined Systems
Interesting work on pipelined microprocessors includes [63] on UINTA, a processor of moderate complexity, and its verification in HOL; [45, 46] on AAMP5, a more complex processor, and its verification in PVS [48] ; and [7] (which is the basis for much subsequent work) on a fragment of the DLX architecture [28] . More recently, superscalar processors have been addressed: in particular, the increased complexity of verification in the face of complex timing behaviour [62, 6, 58, 11, 45 ]: a refinement of the approach in [7] , more applicable to out-of-order systems and long pipelines is [35, 53] . [50] additionally considers exception processing in such an environment (this work is significant for the relatively advanced nature of the processor verified: see also [51] ). The work of [34] examines scheduling of out-of-order instructions, and uses a mechanism for "tagging" the specification state with additional auxilliary variables. These contain "intermediate" state elements, not actually present in the specification, which can be compared with the corresponding implementation state elements. The work in [37, 10] uses Hawk, a variant of the functional language Haskell.
The intuitive models used by others in modelling and verifying [pipelined] microprocessors are conceptually similar to our own (at least informally) [26, 27, 17] . However, there are substantial differences, particularly in the approach to time, and timing abstraction. The main focus of related, formal work on microprocessors is the practical reality of developing techniques to successfully address more complex, and in some cases industrially-significant, examples (almost always in conjunction with, and tailored to, specific software tools). Our own work is concerned with developing a general formal framework for representing and verifying microprocessors within a uniform and well-developed algebraic theory. We have only recently applied software tools to significant examples [24, 15] .
In [63] systems are modelled as state streams: functions from time to state. Temporal and data abstraction functions are used to map between time and state at different levels of abstraction. In earlier work [61] (which also aims at a more general foundational framework), data and timing abstraction functions are separated (as in this paper). However, in later related work on pipelined systems (e.g. [63] ), data and timing abstraction are combined. This is because the view is taken that the values of specification state components are distributed in time at the level of abstraction of the implementation. For example, the value of a data register reg in an implementation may correspond with a specification state at time t, and the value of the program counter pc with a specification state t + n, where n pipeline stages are required for an instruction to progress from initiation to completion. In this paper, we take the view that, rather than being temporally shifted, such state components are fundamentally different at the levels of specification and implementation (Section 5.3). Consequently, we maintain a separation between data and temporal abstraction functions.
The work of [45, 46] derives from [2, 54] on a simpler pipelined processor. In [2, 54] , specification and implementation are modelled as state sequences, but time is not explicitly present; to synchronise the specification and implementation state sequences, multiple copies of specification states are inserted. In [45, 46] , a different approach is taken. A visible state predicate is introduced which identifies those implementation states that should correspond to a specification state. This predicate is similar in concept to the initialisation function used in this paper. This approach is modified, in a manner similar to that of [63] , to cope with pipelining by distributing data in time. Again, in [45, 46] , time is not explicitly present. A more recent account of this work may be found in [12] . Subsequently, the concept of completion functions [29, 30] has been developed as a technique to manage and organise the complexities of out-oforder instruction execution in superscalar systems when building abstraction functions. As is commonly the case, both data and timing abstractions are combined.
It is not clear that temporal distribution of data (or combining data and temporal abstraction) is necessary. Case studies to date (including a superscalar microprocessor in [14, 18] ) have not required such techniques. Moreover, it may introduce extra complexities. For example, consider the example when an [implementation] user register reg is mapped to a specification state at time t, and the [implementation] program counter pc value is mapped to a specification state at time t + n. The pc value relates to an instruction entering the pipeline, and reg to an instruction leaving the pipeline. Hence it is necessary to consider the entire time span of an instruction's execution cycle, which will typically overlap with other instructions. Our approach only requires times when instructions complete to be considered. In a pipelined system (though not a superscalar one), only one instruction will complete at a time.
A Very Long Instruction Word (VLIW) machine is considered in [59] . A VLIW machine is essentially a superscalar-like processor in which instruction execution parallelism is statically identified, usually by a compiler. The techniques used in [59] are those of [7] together with the concept of uninterpreted functions and predicates as an abstraction mechanism [5] . Uninterpreted functions also appear in [14] .
A substantial number of correctness models used derive from [7] in which a simple three-stage ALU pipeline, and a fragment of DLX are considered. Given a state Q or a pipelined implementation, a new state Q is generated after executing one step of an instruction I. Both Q and Q are then flushed by repeatedly stalling further execution (effectively filling them with no-ops). This results in two new states Q f representing the (flushed) pipeline and Q f representing the (flushed) pipeline after executing instruction I. Q f and Q f can be compared with appropriate specification states by simply projecting out the specification state elements. Note that there is no timing abstraction in this model: specification and implementation are both considered to take a single cycle to execute an instruction. This method is only applicable if some mechanism for stalling the pipeline is available. However, this is generally the case in real processors, though not in our case study (Section 5). Note that this technique does not model issues arising from the interaction of instructions in a pipeline: it simply confirms that individual instructions proceed through the datapath correctly. However, many of the interesting issues in the correctness of pipelined systems arise from the interaction between instructions. For example, if instruction i + 1 requires the result computed by instruction i, it may be necessary (depending on the pipeline design) to stall that pipeline until the result of i is stored. This is a convenient point to observe that some examples of verification in the literature do not establish "correctness" in the naïve sense used in this paper: that is, that an implementation will produce the same results as a specification when run on sequences of instructions (programs) (Section 4.5). Instead, a weaker definition of correctness (not always formally and explictly stated) is used. A helpful analysis and classification of correctness models is [1] . An unusual (for microprocessor correctness) and interesting bisimulation-based model is [40] .
Also of interest is [44] which again has a somewhat similar model of time. An injective, monotonic function f P maps abstract time to concrete time, and is defined in terms of a predicate P . If P (t c ) for some concrete time t c , then there is an abstract time t a such that f P (t a ) = t c . Predicate P is required to be true at an infinite number of times. The map f P is similar to the immersion of Section 4.3.
Interesting earlier work on non-pipelined microprocessors includes the following. Gordon's Computer [20] , a significant example since considered by others in [36, 55, 27] . Viper [9] , which was partially verified in HOL. Landin's SECD machine [38] and others have been considered by Birtwistle in [22, 3] . At the time of writing Birtwistle, Gordon and one of the authors (ACJF) are working on modelling and verifiying the ARM6 processor. [31] [32] [33] 4 ] discuss a PDP-11-based processor and a more advanced successor. [41, 49] discuss parts of the Inmos T800 and T9000 Transputers, using an Occam-based transformation system. A useful earlier work on formal models of hardware, including a comprehensive survey of work to 1989 is [42] .
Algebraic Formalisms
Computer systems are modelled in an algebraic framework using primitive recursive functions. We omit detailed discussion of algebraic specification methods here, and refer the reader to [13, 64, 43, 60] . Functions are defined, with equations, primarily using definition by cases and primitive recursion. Time is modelled using a clock algebra and computer systems are modelled with [many-sorted] state algebras.
A [many-sorted] algebra consists of carrier sets and functions ranging over the carrier sets. For example,
denotes a many-sorted algebra with carrier sets A 1 , A 2 , . . . , A l and functions f i of the form
where 1 ≤ i ≤ m and 1 ≤ s, s j ≤ l for 1 ≤ j ≤ n. The value n is called the arity of function f i and if n = 0 then f i is called a constant.
Clocks and Iterated Maps
This section formally defines clocks and iterated maps. A system starts at time zero in an initial state h(a) where h is called an initialisation function. Subsequent system states are determined by a next-state function f and enumerated by a clock T , which divides time into discrete intervals. Clock enumeration is defined using a clock cycle successor function +1. The function h constrains the number of possible state sequences, thus enabling a more flexible definition of correctness. It is feasible for a system to be correct provided it starts in a valid state. For example, most pipelined designs have inconsistent states which should be unreachable in normal operation, and must be avoided if they are to function correctly: see Section 5.
, where T is a set of clock cycles isomorphic with the natural numbers N, 0 ∈ T is the initial clock cycle and +1 : T → T is the next (or successor) clock cycle function.
A clock cycle need not represent a constant subdivision of time, but will denote an interval between significant events (see Section 2.3). The definition of 'significant' will depend on the level of temporal abstraction we are considering.
For example, we might use an instruction clock to represent the execution of instructions in a microprocessor. Each cycle of the clock would typically last different amounts of real time, because instruction execution times vary in most processor implementations.
Definition 2 Let T be a clock and let A be any non-empty set representing a state space. An iterated map F : T × A → A is a primitive recursive function defined by the equations
where h : A → A and f : A → A are primitive recursive functions called the initialisation and next-state functions of the state function F respectively.
Typically, in a microprocessor, A will be a product set of components representing registers and memories. By requiring f and h to be primitive recursive functions, or to have primitive recursive bounds, we eliminate all potential difficulties with partial functions, and non-termination. In practice, this condition is not restrictive.
Theorem 3 If F : T × A → A is an iterated map with initialisation function
h : A → A and next-state function f : A → A, then F (t, a) = f t (h(a)).
PROOF. Follows trivially from Definition 2.
Definition 4 A non-initialised iterated map is any iterated map F with initialisation function id A , where id A : A → A is the identity function defined by the equation id A (a) = a. All iterated maps that are not non-initialised are called initialised iterated maps.
The Rôle of Initialisation Functions
The purpose of initialisation functions is to eliminate unwanted starting states: it is not to describe the initial behaviour of a system. For example, consider an implementation with memory m, program counter pc and instruction register ir: we may require ir = m(pc) at the start of each instruction, and hence not wish to consider starting states that do not have this property. The precise choice of initialisation function will vary according to circumstances: we could choose an initialisation function that enforced some concrete (ground) 'reset' state; or we could choose the identity function. (In the later case, if not every initial state a of system F is permitted, state evolution of F may not be correct.) Between these alternatives is a useful class of initialisation functions that leaves initial state a unchanged provided a is already consistent with correct future state evolution of F . We can regard the conjunction of the various required relationships between state components of iterated map F (for example, ir = m(pc)) as a consistency-checking invariant I that must hold, at certain times, for the correct state evolution of F : in the case where F represents the implementation of a microprocessor, those times will correspond to the start/end of machine instructions.
3 Invariant I may be checked by an initialisation function h, on initial state a ∈ A: if I holds, then h(a) = a. Such initialisation functions are an important part of the verification process (Section 4.6), and are analogous to the pipeline invariants and visible state predicates of [12] .
Data Abstraction
Data abstraction is modelled using a surjective mapping between two statespaces. Let ψ : B → A be a surjective map between two non-empty sets A and B. Surjectivity ensures that all abstract states in the set A have at least one representative in the set B. If all the states in A have exactly one representative in B then the map ψ is bijective; in this case the state-spaces A and B are said to be at the same level of abstraction and ψ is an isomorphism. Data abstraction maps ψ : B → A are often projections between two composite state-spaces, for example, a map
Commonly, the implementation state-space B contains, without modification, all the components of the abstract state-space A together with components strictly unique to the implementation.
Temporal Abstraction: Retimings and Immersions
This section formally defines retimings 4 and immersions. Two clocks are related using a temporal abstraction map, or retiming. Retimings are characterised by three properties: (i ) cycle zero of one clock is always mapped to cycle zero of the other; (ii ) the mapping is surjective; and (iii ) the mapping is monotonic. Monotonicity ensures there is never a discrepancy, after abstraction, in the temporal ordering of events because, for all s, s ∈ S if s ≥ s, then λ(s ) ≥ λ(s) where λ is a retiming.
Definition 5 Let T and S be two clocks. A retiming λ : S → T is a surjective and monotonic map between two clocks such that λ(0) = 0. The set of all retimings from clock S to clock T is denoted by Ret(S, T ).
Definition 6
The immersionλ of a retiming λ ∈ Ret(S, T ) is defined by the equation
The set of all immersions of retimings in Ret(S, T ) is denoted by Imm(S, T ).
Note that although an immersion is defined by unbounded minimalisation it is total because retimings are surjective.
Theorem 7
If λ ∈ Ret(S, T ) is a retiming then λλ = id T .
PROOF. Follows trivially from Definition 6.
There are many possible mapsλ such that λλ = id T . Each is called a section of λ. We have chosen a cannonical section that is usually convenient in practice. However, we are prepared to modify our definition if a particular example demands it, and have done so [25] .
Definition 9
The length function l : Ret(S, T ) → [T → S + ] is defined by the equation l(λ)(t) =λ(t + 1) −λ(t). 4 Not to be confused with the retimings of [39] . Definitions 6 and 8 are illustrated with an examples in Figure 3 , and by Definition 9 the following equivalences hold for the illustrated retiming: l(λ)(0) = 4 and l(λ)(1) = 2.
Definition 10 A clock S is faster than clock T if there is a retiming λ from S to T . Note that our definition of faster admits the possibility that S runs at the same speed as T .
State-Dependent and Uniform Retimings
Retimings, and consequently immersions, should be determined relative to state transition. In particular, the retiming λ : S → T relating two equivalent systems F : T × A → A and G : S × B → B should be determined by the initial state of G b ∈ B. This is because the orbits of a state function 5 are determined by each initial state G(0, b) ∈ B.
Definition 11 A state-dependent retiming λ : A → Ret(S, T ) is a map from states to retimings. The set of all state-dependent retimings from state space A to retimings in Ret(S, T ) is denoted by Ret(A, S, T ).
Definition 12
The immersionλ of a state-dependent retiming λ is defined by the equation λ(a)(t) = least s ∈ S such that λ(a)(s) = t. 5 The orbit of an iterated map is the sequence of states generated by repeated applications of the next-state function. We will consider the case when input streams are present in Section 7.
The set of all immersions of retimings in the set Ret(A, S, T ) is denoted by Imm(A, S, T ).
For each state of an implementation there is an associated state-dependent retiming. Uniform retimings provide a strong connection between the states generated by a state function F , from an initial state a, and the one retiming associated with the state a. Given a uniform retiming λ : A → Ret(S, T ) the length l(λ(a))(t) of any clock cycle t ∈ T should be independent of the numerical value of t.
Example
The retiming λ : N → Ret(S, T ) defined by λ(a)(s) = s/a can be uniform, since l(λ(a))(t) = a for all t ∈ T and a ∈ N. However, the retiming λ(a)(s) = log a (s) cannot be uniform because the value of l(λ(a))(t) increases monotonically as t becomes larger. We achieve uniformity by associating a duration with each state in the statespace of F . We choose to structure the definition of uniform retimings to enable us to determine syntactically, and hence easily, when a retiming is uniform. We note that compliance with Definition 13 is a sufficient but not a necessary condition for a retiming λ to be uniform.
Definition 13 A state dependent retiming λ ∈ Ret(A, S, T ) is uniform with respect to a state function F : S × A → A if, and only if, there exists a map dur : A → S + such that, for all a ∈ A and t ∈ T λ(a)(0) = 0, λ(a)(t + 1) = dur(F (λ(a)(t), a)) +λ(a)(t) whereλ ∈ Imm(A, S, T ) is the immersion of λ and S + = S − {0}. The set of all uniform retimings with respect to F is denoted by U Ret F (A, S, T ).
Suppose that F represents the implementation of some system over a clock S, and that T is the (slower) clock of the corresponding specification. Then specification clock cycle t + 1 ∈ T lasts dur(x) cycles of clock S, where x = F (λ(a)(t), a) is the state of F on clock cycleλ(a)(t) ∈ S, which is the cycle of implementation clock S corresponding with the start of the previous specification clock cycle t ∈ T . Note that dur is a function only of state, and consequently the number of cycles corresponding with any state is independent of the numerical value of t ∈ T and s ∈ S.
In Section 2.3 it was shown that temporal abstraction is strongly related to data abstraction. Given a data abstraction ψ : B → A a simple definition for the duration map of a uniform retiming λ ∈ U Ret F (A, S, T ) is
where f is the next-state function of F (Definition 2). For simple case studies the duration map can be effectively defined in a static manner. That is, each duration is worked out in advance. However, this becomes increasingly difficult with complex examples [14, 18, 19] . Typically, the duration function of significant examples must be dynamically defined: that is, it must search the future states of F until it identifies a combination of state values indicating the end of a cycle of the specification clock T . Note that a consequences of this is that we cannot easily use our formal verifications to make concrete statements about the timing behaviour of implementations (and neither indeed can other workers who take a similar approach). This is because a retiming λ and its corresponding duration function dur form part of the correctness statement (as does data abstraction map ψ). If dur is defined statically, then it makes concrete statements about how many cycles of the implementation clock S correspond with one cycle of the specification clock T for each possible implementation state b ∈ B. These statements must be accurate for correctness to hold. If dur is defined dynamically, it makes concrete statements about relations between the elements of each b ∈ B that must hold to mark the end of a cycle of clock T . However, it does not identify how many cycles of clock S are required for this to happen.
Implementation Correctness
This section provides an equational definition of correctness through the comparison of two state algebras: the state function of an implementation is compared with that of an abstract requirements specification. For this comparison, the state sequences specified by the implementation are mapped into the abstract [requirements] domain by the suitable application of a data abstraction map ψ and a temporal abstraction map λ. Data and temporal abstractions specify exactly how an implementation is correct and should be viewed as intrinsic parts of the design. 
Theorem 15 A map G is a correct implementation of F with respect to λ and ψ if, and only if, the following diagram commutes for all b ∈ B and s = start(λ(b))(s)
PROOF. Follows trivially from Definition 14.
Correctness is required to hold at all 'start' clock cycles. That is, states of the implementation corresponding with [observable] specification states. This is expressed with the equation s = start(λ(b))(s). All cycles such that s = start(λ(b))(s) are enumerated by the immersionλ ∈ Imm(B, S, T ).
The stipulation that the data abstraction map ψ is surjective means that all abstract states are representable at the implementation level, but this does not imply representatives need ever occur in the range of of the implementation can be restricted in such a way that not all abstract state sequences can be generated. The following definition ensures that all valid initial abstract states can be represented by an initial implementation state (see Figure 4) . Note: The set H({0} × C) denotes the image of the set {0} × C under H. That is, the set {H(0, c) | c ∈ C}.
Time-Consistency and the One-Step Theorems
Iterated map state functions are time-consistent if they facilitate a process of staggered state evolution. The following question is addressed: for all times s ∈λ(b)(t), for all b ∈ B and t ∈ T , if the clock is reset to zero and the current state becomes an initial state, then is their any noticeable effect upon future state evolution? An iterated map F : S × A → A is time-consistent if its initialisation function h : A → A characterises a state invariant. Expressed formally: h(a) = a for all states a ∈ F (λ(A) × A) in the range of F .
We can exploit this property to eliminate induction in the verification of one iterated map representation with respect to another. Briefly, given two timeconsistent iterated maps F : T × A → A and G : S × B → B, related by surjective data abstraction map ψ and uniform retiming λ, we can reduce the verification of G with respect to F by just considering correctness at specification times t = 0 and t = 1: that is, times s = 0 and s ∈ {λ(b)(1) for all b ∈ B}. The uniformity of retiming λ can be established syntactically; the first one-step theorem can be applied to establish time-consistency of an iterated map; and the second one-step theorem can be applied to eliminate induction in verifying correctness.
Definition 17
where s 2 =λ(a)(t 2 ) and s 1 =λ(F (s 2 , a))(t 1 ).
Theorem 18
If F is an iterated map with initialisation function h and nextstate function f then F is time-consistent with respect to λ if, and only if, the following diagram commutes for all a ∈ A, s 2 =λ(a)(t 2 ) and
PROOF. Follows trivially from Definition 18.
The following two results are the one-step theorems. Theorem 19 states that if λ ∈ Ret(B, S, T ) is a uniform retiming then time-consistency with respect to λ is sufficiently verified by examining the implementation at times t = 0, 1.
Time-consistent iterated maps have the property that all possible occurrent states arise at time zero. Theorem 20 states that retiming uniformity and implementation time-consistency are sufficient conditions to enable correctness to be wholly verified by examining the two times t = 0, 1.
Theorem 19
If F : S ×A → A is an iterated map with initialisation function h : A → A and if λ ∈ U Ret F (A, S, T ) is a uniform retiming then F is timeconsistent with respect to λ if, and only if, for all a ∈ A (1) F (λ(a)(0), a) = h(F (λ(a)(0), a)), and (2) F (λ(a)(1), a) = h (F (λ(a)(1), a) ).
PROOF. See [14] . This theorem has also been mechanically verified in HOL [15] .
Theorem 20 Let F : T × A → A and G : S × B → B be iterated maps. Let ψ : B → A be a data abstraction map and let λ ∈ U Ret G (B, S, T ) be a uniform retiming. If
(1) F is non-initialised, and (2) G is time-consistent with respect to λ then G is a correct implementation of F if, and only if
In practice, we initially apply Theorem 19 to establish the time-consistency of implementation G 6 and then apply Theorem 20 to verify the correctness of G with respect to F .
An Abstract Pipeline Case Study
In this section, two abstract pipelined designs, called P 1 and P 2 , are presented and completely verified, by hand, using term-rewriting under the correctness criteria of the one-step theorems. In addition, P 1 and P 2 have also been verified using HOL and Maude. Implementation P 1 is a simple pipeline that is always fully initialised; implemenation P 2 may initially be empty. In Section 7.3 a third design P 3 with dynamic stalling is presented, though not verified. The intent of this work is to provide an elegant treatment of the subtle data and temporal abstraction aspects of pipelined organisations. A very simple non-pipelined requirements specification, called TR, is used. The device TR contains sufficient functionality to demonstrate the underlying temporal and invariance principles of pipelined designs. The device contains the following key components:
Memory. The device TR has two memories: one contains source data src, and the other contains computation results dst. The memories are addressed by registers: the memory source register msr, and the memory destination register mdr. Temporal abstraction is especially significant when a computing device contains memory. Almost by definition, read-write memory keeps a history of past events (state changes). A Composite Operation. The device TR transfers data from source memory to destination memory, and in the process the data is transformed using a composite operation f = (f n • · · ·• f 2 • f 1 ). The pipelined implementations perform the operations f i in time.
The implementations P 1 and P 2 exclude the temporal consequences of [instruction] dependencies. The temporal behaviour of real pipelined [processor] designs is heavily influenced by the instructions under execution: see [14, 18] . Two memories are used to simplify the example, by ensuring the contents of the pipeline are not rendered obsolete by the storage of data to the source memory.
Addressable Memory
Memory is modelled as a map from a memory address-space to memory words. 
The memory word at address a ∈ MAR of memory m ∈ [MAR → W ] is denoted m(a); and if the value w ∈ W is stored at address a then the resultant memory is denoted m[w/a]. The memory substitution function is related to the memory read operation by the following equation
The Requirements Specification
The abstract device TR contains two memories and two memory-address registers. The memory state-space is M = [MAR → W ] where W is any non-empty set, and the memory-address register state-space is MAR. The state-space of TR is
The device TR is specified with state function TR : T × State TR → State TR and next-state function tr : State TR → State TR TR(0, src, msr, mdr, dst) = (src, msr, mdr, dst), TR(t + 1, src, msr, mdr, dst) = tr (TR(t, src, msr, mdr, dst)) where src ∈ M , msr ∈ MAR, mdr ∈ MAR and dst ∈ M . The next-state function tr updates the destination memory dst at location mdr with f (src(msr)), and increments both memory-address registers.
tr (src, msr, mdr, dst) = (src, msr + 1, mdr + 1, dst[f (src(msr))/mdr])
The primitive recursive function f : W → W is not explicitly defined, but the two pipelined implementations assume
Fig. 6. P 1 : A pipelined implementation of TR.
for some 
A Permanently Full Four-Stage Pipeline
A four-stage pipelined implementation of TR, called P 1 , is shown in Figure 6 . Three additional state components w 1 , w 2 and w 3 form a pipeline by storing intermediate computations of the operation f = (
The statespace of P 1 is
The iterated map state function P 1 : S × State P 1 → State P 1 is defined by the equations
where σ = (src, msr, w 1 , w 2 , w 3 , mdr, dst),
and
The initialisation function p 0 1 : State P 1 → State P 1 establishes a full pipeline by ensuring the state of component w i corresponds with source data from memory-cell address msr − i, after the appropriate incremental application of the operations f j , for all j ≤ i. For example, at the third pipeline stage
. Initialisation function p 0 1 is as weak as possible: if σ is already consistent with correct future execution (that is, the pipeline is already correctly initialised), then σ = p 0 1 (σ).
The next-state function p 1 : State P 1 → State P 1 maintains a full pipeline by forwarding computation results along the pipeline. For example, w 3 stores f 3 (w 2 ). The last stage of the pipeline applies the operation f 4 to component w 3 and stores the result in the memory dst at the address mdr.
The temporal relationship between P 1 and TR is trivial: a memory substitution occurs upon every cycle of the clock T and on every cycle of the clock S, therefore P 1 and TR are at the same level of temporal abstraction. That is, they are related by the retiming λ ∈ Ret(State P 1 , S, T ) with λ(a)(s) = s. This may seem counter-intuitive because the main purpose of pipelining is to increase temporal performance. One must remember that the clocks T and S enumerate state change and do not directly represent temporal performance in the physical sense. Given a clock RealT ime enumerating the state change of a physical clock, for some suitable (discrete and equal) time intervals, then the retimings λ 1 ∈ Ret(RealT ime, S) and λ 2 ∈ Ret(RealT ime, T ) express the temporal characteristics of P 1 and TR respectively. For example, if each cycle of clock S lasts one RealT ime cycle, then it is reasonable to assume as a first approximation that each cycle of clock T lasts approximately four RealT ime cycles, because each operation f i should be a [more trivial] stage in the computation of the complex operation f . This is illustrated in Figure 7 where the retiming λ is plotted with respect to the clock RealT ime, and a performance increase is observed for the pipelined implementation P 1 .
Care must be taken when defining data abstraction ψ : State P 1 → State TR . A naïve first attempt to define ψ is the projection ψ(src, msr, w 1 , w 2 , w 3 , mdr, dst) = (src, msr, mdr, dst). occurs. This is incompatible with the presumption that the msr of the specification and implementation are directly equivalent.
One method of perceiving the relationship between the memory source registers is that the implementation msr is a temporally advanced version of the specification msr. In our view, this is misleading, and suggests the components are still one in the same. Instead we regard the components as fundamentally different, because our data abstraction is temporally invariant. The rôle of msr is dictated by whether the pipeline is full or empty, and this is [primarily] a property of state not time. For any fixed time, msr in the specification represents the location of source data, but at the implementation level msr is a component used to fill the pipeline and msr − 3 is the required location of source data. That is, msr in the specification is a function o the current value of msr in the implementation (and possibly, in the general case, other state components). By taking this view, we are able to maintain the division between data and temporal abstraction functions, which are more conveniently defined separately. An alternative view is that msr in the specification is the value of msr in the implementation from three clock cycles earlier. This view does not allow data and timing abstraction to remain separate.
Theorem 21
The map P 1 is a correct implementation of TR with respect to data abstraction map ψ : State P 1 → State TR ψ(src, msr, w 1 , w 2 , w 3 , mdr, dst) = (src, msr − 3, mdr, dst) and uniform retiming λ ∈ U Ret P 1 (State P 1 , S, T ) with duration function dur : State P 1 → S + is defined by the equation dur(src, msr, w 1 , w 2 , w 3 , mdr, dst) = 1.
PROOF. See Appendix. Theorem 21 has also been verified mechanically both with HOL and Maude.
A Self-Initialising Four-Stage Pipeline
The implementation of Section 5.3 assumes that the pipeline is permanently full, but in practice pipelines exhibit different stages of operation. For example, a pipeline may be flushed or emptied. Pipelined microprocessors must flush the instruction pipeline when conditional-branch prediction fails. For example, suppose the instrution at address pc is a conditional branch which has been predicted not taken. If in actuality this branch is taken, the pipeline will contain instructions from addresses pc + 1, pc + 2, . . . but should contain the instructions from addresses dst, dst+1, . . . where dst is the branch destination. We will fill our example pipeline using a counter ctr ∈ {1, 2, 3, 4}. If ctr = 1 then the pipeline is assumed to be full and the implementation maintains the functionality of the first implementation. If ctr = 4 then the pipeline is assumed to be empty with the pipeline components w 1 , w 2 and w 3 containing junk values. The pipeline is filled by decrementing ctr while ensuring junk values are not stored in the destination memory dst until ctr = 1.
The state-space State P 2 is an expansion of State P 1 to include the counter
The iterated map state function P 2 : S × State P 2 → State P 2 is defined by the equations
where σ = (ctr , src, msr, w 1 , w 2 , w 3 , mdr, dst),
(1, p 0 1 (src, msr, w 1 , w 2 , w 3 , mdr, dst)), if ctr = 1 and
(1, p 1 (src, msr, w 1 , w 2 , w 3 , mdr, dst)), if ctr = 1.
The initialisation function p 0 2 : State P 2 → State P 2 ensures the device is either empty or full at cycle zero. In the case that it is empty, we need take no steps to ensure that the intermediate state components w 1 , w 2 , w 3 contain values consistent with correct future execution; if it full, then all of w 1 , w 2 , w 3 must be consistently initialised. We choose to map the intermediate pipeline stages when ctr = 2 or ctr = 3 to an empty pipeline by setting ctr = 4. This is a stronger initialisation function than is strictly necessary: we could choose to define functions to consistently initialise a partly-filled pipeline. However, in this particular example, the simpler definition is adequate, and does not violate time-consistency. This is because, like the first implementation, once the pipeline is full it remains so, and therefore these intermediate states will never re-occur.
If ctr = 1 then, for the state of the pipeline to be consistent with correct execution, w 1 , w 2 and w 3 should contain f 1 (src(msr − 1)), f 12 (src(msr − 2)) and f 123 (src (msr − 3) ). Hence we use p 0 1 ∈ State P 1 to initialise the pipeline, since if initial state σ is already correctly initialised, σ = p 0 1 (σ).
The next-state function p 2 : State P 2 → State P 2 provides for two cases: if ctr = 1 then the pipeline is full and p 2 is [nearly] identical to the next-state function p 1 . But if ctr > 1 then the pipeline is progressively filled. This means ctr is decremented while the dst memory is unaltered to prevent storing junk values.
The second implementation P 2 is a correct implementation of TR. The temporal relationship between P 2 and TR is more in-line with the classical model of pipelines. The duration associated with any given state is directly proportional to the value of the counter ctr . That is, the extent to which the pipeline is full. By the construction of p 0 2 , this gives two cases: four cycles when the pipeline is empty and one cycle when the pipeline is full. The retiming λ ∈ Ret(State P 2 , S, T ) is illustrated in Figure 8 for the case of an empty initial pipeline state. The first four cycles are used to fill the pipeline and after that Figure 8 is identical to Figure 7 with an offset of four. Note: The speed of the pipeline is dictated by the slowest pipeline operation and in practice the cycles s = 4 and t = 1 would not correspond with the same physical time. In practice, the four pipeline stages would take longer than a single application of the operation f . The performance increase of a pipelined design comes from maintaining a full pipeline: one would expect t/4 < t < t, where t is the duration of the slowest pipeline stage and t is the duration of a monolithic non-pipelined implementation.
The rôle of the memory source register is dependent on whether the pipeline is full or empty. If the pipeline is full then the data abstraction ψ : State P 2 → State TR is [nearly] identical to that of Theorem 21. If the pipeline is empty, then the memory source register of implementation P 2 is identical to the specification's memory source register. The rôle of the msr in the implementation is indirectly related to time (because the pipeline becomes full in time), but its relationship to msr in the specification is defined by the value of ctr .
Theorem 22
The map P 2 is a correct implementation of TR with respect to data abstraction map ψ :
(src, msr − 3, mdr, dst), if ctr = 1, and uniform retiming λ ∈ U Ret P 2 (State P 2 , S, T ) where duration function dur : State P 2 → S + is defined by the equation dur(ctr , src, msr, w 1 , w 2 , w 3 , mdr, dst) = ctr .
PROOF. See Appendix. Theorem 22 has also been verified mechanically both with HOL and Maude.
Initialisation and Abstraction
The purpose of the abstract case studies in Section 5.3 and Section 5.4 is to emphasis the distinct rôles of initialisation, data abstraction and temporal abstraction. Initialisation functions identify the desired behaviour of an implementation and directly affect verification efforts. This can be seen by comparing the second implementation P 2 of TR with the first implementation P 1 . The second implementation P 2 is an implementation of P 1 with respect to the retiming λ defined in Theorem 22, and the data abstraction map ψ : State P 2 → State P 1 that distinguishes a full and a partly-full, or empty, pipeline.
(src, msr, w 1 , w 2 , w 3 , mdr, dst) if ctr = 1.
There are two cases to consider in data abstraction map ψ: (i ) if the pipeline of the second implementation P 2 is empty (ctr > 1) then the msr of P 1 is three addresses ahead because its pipeline is full; and (ii ) if the second implementation P 2 is full, then the first and second implementations are identical once the counter ctr is hidden. The first implementation is more abstract because it does not specify how the pipeline is initially full: it just is. Nevertheless if the initialisation function for the second implementation was p 0 2 (ctr , src, msr, w 1 , w 2 , w 3 , mdr, dst) = (1, src, msr, w 1 , w 2 , w 3 , mdr, dst)
then the initial pipeline filling behaviour would never manifest itself and the two implementations are at the same level of abstraction. Therefore, when defining an initialisation function one must keep in mind two concerns: (i ) it must be sufficiently weak to ensure the implementation is time-consistent and complete; and (ii ) it can be even weaker to capture the desired initial implementation behaviour. There is little point in proving the second implementation is correct if the pipeline is always full. In the correctness definitions of [45, 7] it is not possible to define such systems (that is, a pipeline which is only empty at cycle zero) because an explicit model of time is not used. Our correctness definition is more expressive and general by virtue of an explicit and well-developed temporal model.
Data abstraction has been treated exclusively as a function of state, but not necessarily as a pure projection. Case-based data abstraction is used when the implementation exhibits different stages of operation. For example, data abstraction for the second implementation P 2 is adapted to accommodate the implications of empty, together with full, pipelines. It is possible to construct examples where data abstraction must be temporally dependent but it is believed that such examples would be forced and not reflective of actual pipelined microprocessors. The purpose of microprocessors is to implement the program-level behaviour of an architecture. Therefore it is unrealistic that the abstract parts of the implementation state cannot be readily determined, at a fixed time, from a state corresponding with the completion of an instruction. Although data abstraction might not be a pure projection it should not be necessary to 'look into the future' when determining the specification state.
Temporal abstraction has been modelled in a clean manner using duration maps. The first implementation performs a destination memory substitution on every machine cycle. Therefore dur(src, msr, w 1 , w 2 , w 3 , mdr, dst) = 1.
The second implementation might be initially empty. Therefore dur(ctr , src, msr, w 1 , w 2 , w 3 , mdr, dst) = ctr .
At cycle zero the component ctr either has the value four or one, therefore the definition above is effectively equivalent to dur(ctr , src, msr, w 1 , w 2 , w 3 , mdr, dst) =
since initially the pipeline is forced to be either full or empty, and once full remains so.
Input, Output and Dynamic Stalling
So far, we have only considered the case when state evolution is completely determined by the initial state of a system. In practice, this is not the case: state evolution is additionally governed by inputs to the system: either by means of direct input channels, or using some form of memory mapping in which some part of the memory address space is mapped to input devices. In addition, real systems can generate output, again either by means of direct output channels or memory mapping. Additionally, real systems can generally stall dynamically: that is, operation can be suspended because of some delay that does not directly relate to system state. A typical example of this is memory access, where some [possibly variable] significant delay can occur between starting and finishing a memory access operation.
In this section, we consider extending our model to include input and output data, and show how we can additionally applying the same techniques to model dynamic stalling.
Iterated Maps with Input
Our current iterated map model does not allow input or output. In this section, we extend our model to include input: output may also be trivially added [27] though we do not do so here. An input stream a ∈ [T → A] is a map from time to data: a(0) is the element of set A that arrives on input stream a at time 0; a(t) is the element that arrives at time t and so on. We extend initialised iterated maps with input streams as follows.
Definition 23 Let S be a clock, A be any non-empty set representing a state space, and In be any non-empty set representing input values. An initialised iterated map with input F : T × A × [T → In] → A is defined as follows:
where h : A → A and f : A × In → A are primitive recursive functions.
We now extend our correctness model to accommodate iterated maps with input while preserving the one-step theorems. To do this, we must (i) extend our concept of uniform retimings to include input streams; (ii) define an abstraction function mapping streams in an implementation to streams in a corresponding specification; and (iii) reformulate our correctness statement.
In order to extend uniform retimings to accommodate input streams, we must modify the definition of duration functions. This must be done carefully, as a duration function cannot be a [direct] function of time if uniformity is to be preserved. We first introduce the advance :
function, which is defined as follows:
advance(in, t)(t ) = in(t + t ).
The function advance generates a new stream from in such that the first t elements of in are removed: that is, advance(in, t)(0) = in(t). We can now extend the definition of uniform retimings as follows. 
where λ ∈ Imm(A, In, S, T ) is the immersion of λ. The set of all uniform retimings with respect to F is denoted by U Ret F (A, In, S, T ).
Next, given some state-set B, we must construct an abstraction map
for streams. There are two issues to consider: mapping In 1 to In 2 , and mapping streams over clock S to streams over clock T . To map In 1 to In 2 we assume the existence of some stream element abstraction map τ :
To map streams over [faster] clock S to streams over [slower] clock T we must select, or sample, some elements from the slower stream. There are many ways we could do this: the simplest is to select those elements at times s = start(λ)(s), where λ is the retiming that relates S and T . We define a function sample :
We can now define τ as follows
We are now in a position to reformulate our correctness statement. 
We are now able to extend the one-step theorems to iterated maps with input. (1) F (λ(a, in)(0), a, in) = h(F (λ(a, in)(0), a, in)), and (2) F (λ(a, in)(1), a, in) = h(F (λ(a, in)(1), a, in)).
PROOF. See [14] .
In 2 ] → B be iterated maps. Let ψ : B → A be a data abstraction map, let λ ∈ U Ret G (B, In, S, T ) be a uniform retiming and let τ :
It will commonly be the case that input streams will be present in both specification and implementation. However, it is possible that they will be present in the implementation only. This may be because they are hidden by abstraction mechanisms or, as in the case of case of dynamic stalling (Section 7.2), because they are used solely to model some facet of the implementation. This presents no difficulties: we simply omit the stream abstraction map τ and reformulate the correctness statement. 
It is straightforward to modify the definitions of the one-step theorems: this is left as an exercise for the reader.
We will not consider examples with input here. Instead we will show how input streams can be used to model dynamic stalling.
Dynamic Stalling
Real hardware is commonly required to be able to [partially] suspend operation in the face of external delays of indeterminate, but generally bounded length.
Depending on the behaviour of the pipeline in the event of such delays, and However, such pipeline behaviour is not common. Typically, the early stages of the pipeline will stall, and the later stages (which may not be affected by the source of the stall) will continue. For example, in the case of a dynamic stall caused by a memory access, those stages responsible for, for example, writing back the results of previous instructions, can continue until the available data is exhausted. The net result is that pipelines typically partially empty, and then refill when the dynamic stall is resolved. For example, in Figure 9 pipeline element 4 stalls. Pipeline elements 1 to 3 consequently also stall, while elements 5 to 7 continue to operate as long as data is available. Superscalar machines effectively have a number of parallel pipelines, and a dynamic stall may only affect one (or a few) of them. Additionally, although it may in some circumstances be possible to temporally abstract away from some dynamic stalls, we may at a later date wish to model the same hardware at a lower level of temporal abstraction. The net result is that it is necessary to be able to model dynamic stalls in some way. We choose to model dynamic stalling by adding abstract Boolean streams to the implementation. Each source of dynamic stalls is represented by an abstract Boolean stream: when a true element arrives on the stream, the pipeline proceeds normally: when a false element arrives it stalls in the appropriate way.
Although at first sight somewhat clumsy, this method has a two advantages. First, it allows us to model conceptually non-deterministic dynamic stalling in a deterministic way. Second, it allows us to apply the one-step theorems in the normal way without any modification. Note that these new Boolean streams appear only in the implementation, and not in the specification. However, this causes no difficulties: see Definition 28.
Although we may apply the one-step theorems to dynamic stalls, the net workload in undertaking verification may still be high: for example, in the event of a memory access that causes a cache miss 7 it may be necessary to wait 100 or more clock cycles. Consider that we will also need to consider delays of 0,1,2,. . . ,99 etc. clock cycles, for each case (combination of different instructions in the pipeline etc.).
In this paper, our principle concern is mathematical models, and not the practical efficiencies of [semi-]automated verification (which we begin to address elsewhere [15, 24] ). However, clearly it would be advantageous to reduce the verification overhead in some way. Furthermore, the one-step theorems are a consequence of our mathematical models, but nonetheless are of practical benefit in simplifying verification obligations. We distinguish between simplifications derived from our mathematical model and its underlying [philosophical] assumptions, and simplifications arrived at operationally in implementing the mathematical model using software tools (for example, restating correctness conditions to eliminate repeated evaluation of the same terms). Therefore, we seek easily-identifiable conditions, within our mathematical model, that will allow us to further simplify verification obligations.
In the event of a dynamic stall caused by some element in the pipeline, some pipeline elements would typically retain their current state, and others would continue to operate on the available data until it has been exhausted. Subsequently, the pipeline state will remain unchanged until the stall is resolved. From the point of view of functional correctness, once this point has been received subsequent clock cycles can be ignored until the source of the stall is resolved.
A Dynamically Stalling Pipeline
We now extend our self-filling pipeline example from Section 5.4 to include dynamic stalls. We assume that there is some source of dynamic stalls that affect the computation of f 3 . Therefore, in the event of a stall state elements w 1 and w 2 will retain their current states, and the pipeline will continue operating for one cycle utilising the value of w 3 . After that, the entire pipeline will retain its current state until the stall is resolved. We will add a Boolean stream to control dynamic stalls. To ensure that stalls always terminate, we require Booleans streams to be from the set Dstr = {str ∈ [S → B] | ∀s ∈ S, ∃s ∈ S such that s > s and str(s ) = ff }.
The state-space State P 2 is the same as that of the self-filling pipeline P 2 .
The iterated map state function P 3 : S × State P 2 × Dstr → State P 2 is defined by the equations
(2, src, msr, f 1 (src(msr − 1)),
, mdr, dst), or ctr = 2 and stl = ff ; (2, src, msr, w 1 , w 2 , w 3 , mdr, dst), if ctr = 2 and stl = tt ; (2) (1, p 1 (src, msr, w 1 , w 2 , w 3 , mdr, dst)), if ctr = 1 and stl = ff ; (3) (2, p 1 (src, msr, w 1 , w 2 , w 3 , mdr, dst)), if ctr = 1 and stl = tt .
Observe that initialisation function p 0 3 is now more complex because a pipeline state with ctr = 2 should now not be emptied since it may represent a stall state.
There are four cases to consider for next-state function p 3 .
(1) The pipeline is being filled, either because it was initially empty, or because a stall has just finished. (2) The pipeline is currently stalled. (3) The pipeline is operating normally (not stalled). (4) The pipeline has just stalled, and will continue to store results for one more cycle.
We now define the temporal relationship between P 3 and TR in terms of a duration function that in turn defines a uniform state-and input-dependent retiming. First, we do not attempt to eliminate stall states when the entire pipline is static. Duration function dur : State P 2 × Dstr → N + is defined as follows.
dur(ctr , src, msr, w 1 , w 2 , w 3 , mdr, dst, str) =
N xtF alse(str) + 2, if ctr = 2, where N xtF alse : Dstr → N is defined by N xtF alse(str) = least s ∈ S such that str(s) = ff .
Recall from Section 7.1 that before invoking dur in the definition of a uniform state-and input-dependent retiming, the stream str will be advanced, so str(0) will be the current element and all earlier elements will have been discarded. Further recall that the definition of Dstr ensures that for all s ∈ S there will exist an s > s such that str(s ) = ff , and hence N xtF alse is a total function.
In the event that ctr = 1 or ctr > 2, then the duration function for P 3 is identical to that for P 2 . If ctr = 2 then either the pipeline is being filled or a dynamic stall is in progress. If the pipeline is being filled (str(0) = ff ) two cycles will elapse before a result is stored, even if a stall occurs on the following clock cycle (str(1) = tt ). If a stall is in progress (str(0) = tt ) then the next result will be stored two cycles after the stall finishes at some time s > 0 (str(s) = ff ). In either case, the value of dur is two more than the length of the block of contiguous true elements starting at time zero.
Following our philosophy of time (Section 2.2) we can simplify the duration function (and the subsequent verification) by ignoring clock cycles in which P 3 does not change state. In the case of P 3 this point is reached when ctr = 2 and remains the case until a false element arrives on str. In effect, we combine all clock cycles s ∈ S where str(s) = ff into a single cycle of some new "semiabstract" clock S . We could do this by simply modifying our definition of stream set Dstr to disallow contiguous sequences of more than one ff element:
| ∀s ∈ S, ∃s ∈ S such that s > s and str(s ) = ff and ∀s ∈ S, if str(s ) = tt , then str(s − 1) = str(s + 1) = ff }.
However, observe that there is a retiming from clock S to new clock S . Therefore, it is more appropriate to define a stream abstraction map
in terms of a retiming λ ∈ Ret(Dstr, S, S ) 9 . If we are to apply the one-step theorems, it is necessary for λ to be uniform. The simplest way to guarantee that λ is uniform is to define its immersion in terms of a duration function dur :
where dur is defined as follows.
Note that the definition of Dstr ensures that dur is a total function. 9 In this case the stream element abstraction map τ will be the identity function.
The definition of dur for P 3 now becomes: dur(ctr , src, msr, w 1 , w 2 , w 3 , mdr, dst, str) = (2)
4, if ctr > 2;
2, if ctr = 2 and str(1) = ff ; 3, if ctr = 2 and str(1) = tt .
Theorem 29
The map P 3 is a correct implementation of TR with respect to data abstraction map ψ : State P 2 → State TR (theorem 22) and uniform retiming λ ∈ U Ret P 2 (State P 2 , S, T ) where duration function dur : State P 2 → S + is defined as in equation (2).
PROOF. The proof of theorem 29 is too large to include in the Appendix. However, this theorem has been verified mechanically using Maude.
P 3 is a simple example, and the stalling element is near the end of the pipeline and the point at which the pipeline stops changing state is quickly reached. Also, the pipeline restarts quickly. In more complex pipelines, this may not be the case. Furthermore, in the case of P 3 it is possible to constructively define dur because once a dynamic stall occurs, the pipeline's behaviour is always the same: it continues writing to dst for one cycle; its state then remains unchanged until the dynamic stall finishes; and then after one cycle it resumes writing to dst. In the case of more complex pipelines this may not be the case. In such cases it may be necessary to compare successive pipeline states to identify the point at which no further changes occur.
Further Considerations
We have shown how the algebraic techniques developed by Harman and Tucker [26, 27] and extended by Fox and Harman [17, 14, 18] can be systematically applied to pipelined systems, by means of three simple examples. These methods have been applied to substantially more complex examples: in [14] , a pipelined implementation of a microprocessor is defined, and formally verified; and in [14, 18] , a superscalar implementation of the same processor is defined. One of us (ACJF) has also worked extensively on the ARM instruction set architecture. In [52] the ARM instruction set architecture is specified in HOL. In addition, he has used HOL to verify the ARM 6 implementation with respect to the ARM instruction set architecture. In addition, we have undertaken initial analysis of pipelines that can dynamically stall.
We have developed a systematic framework for defining data and temporal abstraction maps, and maintain a separation between the two. We have shown how formal verification can be simplified given easily-established conditions. It is possible to define temporal abstraction in a logical and clean manner because all our implementations and specifications are deterministic. While this may in principle seem be a disadvantage in modelling certain kinds of behaviour (dynamic stalling, transactional buses etc.) we have not so far found this to be the case.
Most formal verifications to date have been performed manually. Our main aim is a systematic formal framework for modelling and representing microprocessors and related systems, and not [mechanised] formal verification. However, we have recently started to undertake machine verifications, largely because of the complexity of examples we wish to consider. By establishing a general theoretical framework, we have ensured our techniques are suited to a range of pre-existing software tools. One of us (NAH) has chosen to use Maude [8] to undertake proofs. Maude uses the same underlying algebraic model we have chosen for our theoretical model. In addition, it is fast, and has meta-level tools, permitting tailored proof strategies to be quickly and easily constructed. However, it is not a theorem prover: hence ACJF has chosen to use HOL to undertake proofs [15] . To date, Maude and HOL have been used to repeat the verifications in this paper (see appendix) and the pipelined processor from [14] as well as ARM 6 in HOL. Work on verifying a superscalar implementation of the same processor [14, 18] is underway.
P 1 (0, src, msr, w 1 , w 2 , w 3 , mdr, dst) = p 0 1 (P 1 (0, src, msr, w 1 , w 2 , w 3 , mdr, dst)), P 1 (1, src, msr, w 1 , w 2 , w 3 , mdr, dst) = p 0 1 (P 1 (1, src, msr, w 1 , w 2 , w 3 , mdr, dst)), TR(0, ψ(src, msr, w 1 , w 2 , w 3 , mdr, dst)) = ψ(P 1 (0, src, msr, w 1 , w 2 , w 3 , mdr, dst)), TR (1, ψ(src, msr, w 1 , w 2 , w 3 , mdr, dst)) = ψ(P 1 (1, src, msr, w 1 , w 2 , w 3 , mdr, dst)).
At time t = 0: P 1 (0, src, msr, w 1 , w 2 , w 3 , mdr, dst) = p 0 1 (src, msr, w 1 , w 2 , w 3 , mdr, dst) = (src, msr, f 1 (src(msr − 1)), f 12 (src(msr − 2)), f 123 (src(msr − 3)), mdr, dst) = p 0 1 (src, msr, f 1 (src(msr − 1)), f 12 (src(msr − 2)), f 123 (src(msr − 3)), mdr, dst) = p 0 1 (P 1 (0, src, msr, w 1 , w 2 , w 3 , mdr, dst)).
At time t = 1: P 1 (1, src, msr, w 1 , w 2 , w 3 , mdr, dst) = p 1 (P 1 (0, src, msr, w 1 , w 2 , w 3 , mdr, dst)) = p 1 (src, msr, f 1 (src(msr − 1)), f 12 (src(msr − 2)), f 123 (src(msr − 3)), mdr, dst) = (src, msr + 1, f 1 (src(msr)), f 12 (src(msr − 1)), f 123 (src(msr − 2)), mdr + 1, dst[f (src(msr − 3))/mdr]) because f = (f 4 • f 3 • f 2 • f 1 ) = p 0 1 (src, msr + 1, f 1 (src(msr)), f 12 (src(msr − 1)), f 123 (src(msr − 2)), mdr + 1, dst[f (src(msr − 3))/mdr]) = p 0 1 (P 1 (1, src, msr, w 1 , w 2 , w 3 , mdr, dst)).
At time t = 0:
TR(0, ψ(src, msr, w 1 , w 2 , w 3 , mdr, dst)) = TR(0, src, dst, msr − 3, mdr) = (src, dst, msr − 3, mdr) = ψ(src, msr, f 1 (src(msr − 1)), f 12 (src(msr − 2)), f 123 (src(msr − 3)), mdr, dst) = ψ(P 1 (0, src, msr, w 1 , w 2 , w 3 , mdr, dst)).
Finally, at time t = 1:
TR(1, ψ(src, msr, w 1 , w 2 , w 3 , mdr, dst)) = tr (TR(0, src, dst, msr − 3, mdr)) = tr (src, dst, msr − 3, mdr) = (src, dst[f (src(msr − 3))/mdr], msr − 2, mdr + 1) = ψ(src, msr + 1, f 1 (src(msr)), f 12 (src(msr − 1)), f 123 (src(msr − 2)), mdr + 1, dst[f (src(msr − 3))/mdr]) = ψ(P 1 (1, src, msr, w 1 , w 2 , w 3 , mdr, dst)).
2
Proof: Theorem 22
The map P 2 is a correct implementation of TR if the following four equations hold with σ = (ctr , src, msr, w 1 , w 2 , w 3 , mdr, dst) ∈ State P 2 P 2 (0, σ) = p 0 2 (P 2 (0, σ)), P 2 (λ(σ)(1), σ) = p 0 2 (P 2 (λ(σ)(1), σ)), TR(0, ψ(σ)) = ψ(P 2 (0, σ)), TR(1, ψ(σ)) = ψ(P 2 (λ(σ)(1), σ)).
Each of these four conditions is split into two sub-cases: ctr = 1 and ctr = 2, 3, 4. This gives the eight cases below.
Case 1: t = 0 and ctr = 1 P 2 (0, 1, src, msr, w 1 , w 2 , w 3 , mdr, dst) = p 0 2 (1, src, msr, w 1 , w 2 , w 3 , mdr, dst) = (1, src, msr, f 1 (src(msr − 1)), f 12 (src(msr − 2)), f 123 (src(msr − 3)), mdr, dst) = p 0 2 (1, src, msr, f 1 (src(msr − 1)), f 12 (src(msr − 2)), f 123 (src(msr − 3)), mdr, dst) = p 0 2 (P 2 (0, 1, src, msr, w 1 , w 2 , w 3 , mdr, dst)) Case 2: t = 0 and ctr = 2, 3, 4 P 2 (0, ctr , src, msr, w 1 , w 2 , w 3 , mdr, dst) = p 0 2 (ctr , src, msr, w 1 , w 2 , w 3 , mdr, dst) = (4, src, msr, w 1 , w 2 , w 3 , mdr, dst) = p 0 2 (4, src, msr, w 1 , w 2 , w 3 , mdr, dst) = p 0 2 (P 2 (0, ctr , src, msr, w 1 , w 2 , w 3 , mdr, dst))
