Abstract
Introduction
Processor evolution has had the effect of increasing performance by the use of design techniques such as pipelining and parallelism. The complexities resulting from such techniques manifest as interactions between operations and contention for resources. Simulation has been used in industry to validate such designs. In order to claim that simulation exercises all possible behaviors of the design, we would require to exhaustively simulate the design. Exhaustive simulation is prohibitively expensive in time and space. Hence, industry relies on simulating a limited number of patterns which exercise a small fraction of the circuit. But such a validation procedure could lead to undetected bugs. This weakness leads us to think about formal verification. Formal verification uses a set of languages, tools and techniques to mathematically reason about the hardware system. Our verification methodology, based on Symbolic Trajectory Evaluation, is able to verify an RTL, gate or switch-level implementation of a processor against its instruction set architecture. This paper focuses on the application of this methodology to verify the ARM core. This processor has many complicated features such as forwarding logic, instruction pipeline, pipeline interlock, multiple cycle instructions and conditionally executed instructions. A practical consideration in choosing this processor was that the ARM designers were available at Carnegie Mellon University to provide descriptions of the design. A high-level overview of our methodology and some of the related work is presented in Section 2. Section 3 discusses the architecture and implementation details of the ARM. Section 4 describes the steps required by our methodology to verify the ARM. The abstract assertion and the implementation mapping for a representative immediate bitwise-OR instruction is detailed in this section. The bugs discovered are presented in Section 5.
Verification Methodology
Our verification methodology can be used to show that an implementation correctly fulfills an abstract specification of the desired system behavior. The abstract specification defines the instruction set architecture of the processor. The specification is a set of abstract assertions defining the effect of each instruction on the user-visible state elements. The verification process has to bridge a wide gap between the detailed processor implementation and the abstract specification. To bridge this gap, the verification process requires some additional mapping information. The implementation mapping relates the abstract specification to the complex temporal and spatial behavior of the pipelined implementation. In effect, the mapping exposes the micro-architecture of the processor. The implementation mapping is a nondeterministic mapping defined in terms of state diagrams. As an example, an instruction might stall in a pipeline stage waiting to obtain the necessary resources. The order and timing in which these resources are granted vary, leading to nondeterministic behavior. Our methodology will verify the implementation under all possible orders and timing. The abstract specification and the implementation mapping are used to generate the trajectory specification. The trajectory specification consists of a set of trajectory assertions. Each abstract assertion gets mapped into a trajectory assertion. A modified form of symbolic simulation called Symbolic Trajectory Evaluation (STE) [1] is used to verify the set of trajectory assertions on the implementation. The reader is referred to [2] [3] for a more detailed description of our verification methodology.
Related Work
Beatty [4] laid down the foundation of our methodology for formal verification of processors. He used the methodology to verify the Hector microprocessor. A switch-level implementation of the processor was verified against its instruction set architecture. Hector was a simple non-pipelined processor whereas the ARM has a 5-stage pipeline. Nelson [5] used our methodology to verify parts of a PowerPC implementation called the Cobra-Lite processor. This was a post-facto verification that was done after the Cobra-Lite processor had been designed and fabricated. Cobra-Lite is implemented as a set of interconnected functional units. Since the Cobra-Lite verification was effected by verifying individual functional units, complete processor verification would constitute modeling protocols on the interface signals between the unit under verification (UUV) and units that interact with the UUV and later reasoning that -since individual UUV's work correctly -the overall system works correctly. In contrast to this, the entire ARM core was verified as a single unit without decomposing the verification task into smaller subtasks. The ARM verification was done concurrently with the design implementation leading to considerable overlap of the two phases. Another differentiating factor is that we used the Efficient Memory Modeling (EMM) [11] technique to reduce the system memory demands of the verification task. Srivas [6] used PVS to verify the AAMP5 microprocessor. The AAMP5 has a large complex instruction set, multiple data types and addressing modes and a microcoded, pipelined implementation. Srivas decomposed the verification problem into three subproblems: 1. A part that reasons about stalling behavior, 2. A part that reasons about individual instructions in the absence of stalling, 3. A part that combines the first two parts.
Sawada [7] used the ACL2 theorem prover to verify an out-of-order pipelined processor. The processor includes out-of-order execution and speculative instruction fetch. Sawada used a table to store an execution trace of instructions representing states in the implementation. The table representation helps to easily define various pipeline properties such as the absence of WAWhazards. In a sense, the table captures the past history of the processor. In Symbolic Trajectory Evaluation, the initial state of the circuit is set to the most general state that reflects all possible past histories for the processor.
ARM Processor Architecture
The ARM CPU core is a 32-bit RISC processor macrocell upon which the current generation of ARM processors is based. It has 32-bit data and address buses. It has a single 32-bit external data interface through which both instructions and data pass during execution. It includes 15 general purpose registers. A 5-stage pipeline is employed to speed the execution of instructions. Because branches cause the sequential flow of instructions to be interrupted, it is usual to employ the ARM's conditional execution facility when possible. The ability of every instruction to be conditionally executed increases the chance that the program address references will run sequentially thereby allowing the memory subsystem to make predictions about the next address required. Non-sequential addresses are held for two cycles. The implementation we used was a hybrid between the ARM7 [8] and StrongARM [9] cores. The memory interface was derived from the ARM7. The pipeline structure was derived from the StrongARM.
CPU Core Functional Blocks
This section briefly explains some of the major functional blocks in the ARM such as the Register File, the Barrel Shifter, the ALU, the Booth's multiplier and the Control Logic. Register File and other Registers: The ARM CPU core has a total of 16 registers comprising 15 general purpose 32-bit registers. The implementation of the register file has two read ports and one write port. Register R15 is the Program Counter. Because the PC is accessible to programmers, it can be included in standard instructions, and as a base for load and store instructions. This permits the easy generation of position independent code. A further register, the Current Program Status Register (CPSR) is also accessible to programmers. This register stores the condition code flags. The condition codes flags are Negative/Less than (N), Zero (Z), Carry/Borrow/Extend (C) and Overflow (O). These flags may be changed as a result of arithmetic, logical and comparison operations in the CPU and which may be tested by all instructions to determine whether execution is to take place. Some of the other 28 bits of the 32-bit CPSR can be used for storing the CPU mode bits in future implementations of the ARM [10] . The current implementation of the ARM operates in the User Mode only. The Barrel Shifter: The 32-bit Barrel Shifter implements shift/rotate logic of its input by any amount to produce an output within a fixed period. It has associated logic to allow values to be arithmetic shifted (preserving the sign-bit) or rotated through the carry bit (to give a 33-bit shift register). The ALU: The ALU performs all arithmetic, logical and comparison operations on two input operands. There is a carry look-ahead within each 4-bit ALU block. A second level carry look-ahead option provides 16-bit carry lookahead capability increasing the speed of the ALU at the expense of area. The Multiplier: We did not verify the multiplier due to the exponential memory complexity associated with representing multipliers using BDDs.
Pipelining
The ARM uses a 5 stage instruction pipeline -Instruction Fetch (IF), Instruction Decode (ID), Execute (EX), Memory (M) and Write-back (WB).
The ARM completes an instruction every clock cycle under most circumstances. The instruction set allows instructions to execute conditionally. The pipeline has features like data forwarding and stalling to achieve maximal concurrency in instruction execution. Data forwarding can occur from (; ⇒ ,' and 0 ⇒ ,'. Since the 3& can be the target of a particular instruction, three other forwarding paths exist i.e. (; ⇒ 3&, 0 ⇒ 3& and ,' ⇒ 3&. The ,' ⇒ 3& forwarding path is activated when an instruction like MOVAL PC ← R10, LSL 0 is executed (the AL mnemonic encodes that the instruction is always executed, and LSL 0 implies that R10 is moved as isconsequently the (; stage is not necessary for shifting). An instruction stalls in a particular stage when an earlier instruction in the pipeline takes multiple cycles or it has not got its operands. The critical path of the processor is the ,' stage since this is where many control decisions are made. Most instructions normally spend a single cycle in each stage. The cases in which the instruction spends more time in a particular stage are:
1. The instruction is waiting for the result of a previous instruction i.e. a data dependency. 2. The instruction inherently takes more than one cycle e.g. a multiply instruction which depending on its arguments, may spend up to 3 cycles in the multiplier. 3. The instruction is doing a memory access and this takes more than 1 cycle. If, as a result of an instruction spending more than one cycle in a pipeline stage, the next pipeline stage becomes empty, then the processor will place a null instruction in this next pipeline stage. Once a null instruction is in the pipeline, it will spend one cycle in each remaining pipeline stage unless the pipeline is stalled. If an instruction completes its current instruction before the next pipeline stage is available, it will stall in the current pipeline stage. This will normally stall all previous pipeline stages. If, however, a previous pipeline stage is executing a multi-cycle operation, then that stage will not stall until the multi-cycle operation completes. In some instructions like long multiplies, the processor fetches and decodes the instructions as a single instruction but the ,' stage passes multiple instructions in the (, 0 and :% stages. Also, in the case of a load instruction, since the off chip memory places the restriction that the address should be held for 2 cycles, the ,' stage splits this load instruction into two. Splitting of the load instruction serves another purpose -base write-back (if the write-back bit is asserted) is done when the first copy of the instruction is in the :% stage and the actual data is loaded when the second copy of the instruction is in the 0 stage. Incidentally, this obviates the need to have two write ports in the register file despite the fact that a load instruction could potentially update two registers.
ARM Processor Verification
This section applies our methodology to verify the ARM processor.
Abstract Specification
The first step is to define the instruction set of the ARM as a set of abstract assertion in a Hardware Specification Language. The exact syntax and associated formal semantics of this language is described in [2] . For the purposes of this paper, an abstract assertion is of the form: P LEADSTO Q, where P serves as the precondition and Q as the postcondition. P and Q are conjunction of clauses where each clause is an assignment to an abstract state element. As an example the abstract assertion for immediate bitwise-OR instruction is as follows:
(op IS OR)and(RA IS ra)and(RT IS rt) and(Imm IS imm)and(Reg[ra] IS dataA)
The first two lines constitute the precondition of the abstract assertion. The clause (op IS OR) specifies that the opcode must be that for the immediate bitwise-OR instruction. The clauses (RA IS ra) and (RT IS rt) specify that the source and destination register identifiers are some symbolic values ra and rt respectively. The clause (Imm IS imm) says that the immediate source is the symbolic value imm. The clause (Reg[ra] IS dataA) specifies that the content of Reg[ra] is the symbolic value dataA. The last line specifies that in the postcondition, the content of the target register will contain the bitwise-OR of the immediate data and the register data.
Implementation Mapping
The next step is to define the implementation mapping. The implementation mapping has to relate the high-level information flow to a transfer of logic values on actual signals in the circuit. Our intention is to verify the instruction under verification (IUV) under every possible sequence of leading instructions. One possible way to represent all leading instruction streams, is to issue two completely symbolic instructions before fetching the IUV into the IF stage of the ARM pipeline since, potentially, the longest span of the forwarding is two instructions away from the IUV. The problem with this approach is that the symbolic computation required for the leading instructions is prohibitively large. Hence, we capture all possible leading instruction streams by exposing and asserting some of the internal state elements in the ARM pipeline. We get savings in symbolic computation because the IUV interacts with only a limited number of internal state elements but we still capture every possible sequence of leading instructions. The implementation mapping consists of a main machine and a set of map machines. The main machine defines the flow of control of a generic instruction. The map machines define a mapping for each abstract state element in the abstract specification. The main machine and the map machines are modeled as control graphs. Control graphs are state diagrams with the capability of synchronization at specific time points. A control graph has two sets of vertices: 1. State vertices that represent some non-zero duration of time and 2. Event vertices that represent instantaneous time points. A control graph has a source, an event vertex with no incoming edges, and a sink, an event vertex with no outgoing edges. Nondeterminism is modeled as multiple outgoing edges from a vertex.
Main Machine
The main machine for the ARM is shown in Figure 1 Eventually, the previous instruction will move into the EX stage and the previous-to-previous instruction will move into the M stage and forward the data to the IUV in the ID stage. The map machine for the abstract clause (Reg[ra] IS dataA) is shown in Figure 3 . The control graph is aligned with the ID stage of the main machine. The node assignments in the upper half of the state vertex are asserting signals in the implementation based on the following criteria: 1. If the address ra is the same as the target of the previous instruction prev_rt then the hold register in EX stage is asserted to the value dataA. 2. Else if the address ra is the same as the target of the previous-to-previous instruction prevprev_rt then the hold register in the M stage is asserted to the value dataA. The derived map machine is shown in Figure 3 . The map machine is shifted by the nextmarker so that it gets aligned with the EX stage of the main machine. The lower half of the state vertex defines the desired response from the processor. The desired response is that the hold register in the EX stage should be assigned the bitwise-OR of dataA and the immediate operand. This section has presented a somewhat simplified view of forwarding in the ARM. A few more internal states had to be exposed to set up all the necessary conditions for data forwarding.
Symbolic Trajectory Evaluation
Section 4.2 gave a flavor of the map machines that were specified for the ARM processor. A total of 10 map machines were required for the processor. The abstract assertion for the immediate bitwise-OR operation and the implementation mapping were used to automatically generate the trajectory assertion. The trajectory assertion corresponds to the composition of the 10 map machines defined in the implementation mapping. Composition amounts to taking the cross-product of these aligned map machines under restrictions placed by the synchronization function. Symbolic Trajectory Evaluation was used to verify the immediate bitwise-OR trajectory assertion on a gatelevel model of the ARM processor. The node assignments in the upper half of the state vertices define the stimulus for the simulator. The node assignments in the lower half of the state vertex define the desired response and state transitions. The verification process uncovered a few bugs that are discussed in detail in the next section.
Bugs Uncovered
We discovered four bugs in the ARM core. Three of them were corner cases that resulted from designer oversight. Section 5.1 gives a background that helps to bring these bugs into context.
Background
All data processing instructions in the ARM accept one or more registers as their operands and always return the result to a register, optionally setting the condition code flags according to the result. The first source operand of a data processing instruction is (except for MOV and MVN) is always a register and is known in syntax definitions as Rn. Any register may be specified, including the PC (R15). The second operand (the only operand of MOV and MVN) may either be a register Rm that is optionally shifted before use, or an 8-bit immediate constant optionally rotated before use. The shifted register forms allow one of the following types of multi-bit shifts: 1. LSL -Logical Shift Left 2. LSRLogical Shift Right 3. ASR -Arithmetic Shift Right 4. ROR -Rotate Right. In each case, the number of bits to shift by is supplied by either as a constant or by another register. One further shift type is available -Rotate Right Extended (RRX) which performs a single bit rotation of the operand through the Carry Flag. In an LSL operation, the contents of the register Rm are moved by the number of bits specified by the shift amount to more significant bit positions. The least significant bits thus revealed are filled with zeros and the most significant bits are discarded except that the least significant discarded bit becomes the shifter carry output (which may later set the Carry Flag in the CPSR). An LSL with a shift amount of 0 is treated as a special casethe shifter carry output is simply the old value of the Carry Flag and the contents of the operand register Rm are passed through unshifted. The LSR, ASR and ROR operations behave like the LSL operation but for the shift direction and shift by 0. Since LSR by 0, ASR by 0 and ROR by 0 would duplicate the effect of LSL by 0, the ARM avoids this redundancy by doing the following. LSR by 0 is reserved and is used to encode LSR by 32, ASR by 0 is reserved and is used to encode ASR by 32 and ROR by 0 is reserved and is used to encode RRX. LSR by 32 yields a result of 0 but makes the shifter carry output become bit-31 of the source register. ASR by 32 duplicates the sign bit (bit-31) of the source register throughout the result (i.e. in 2's complement the result is a -1 or a 0) and the shifter carry output also takes the values of bit-31. ROR by 0 encodes the special case which performs RRX -the contents of the 33-bit shift register formed by concatenating the Carry Flag and Rm is rotated by a single bit to less significant bit positions and the new shifter carry output becomes the original bit-0.
The Bugs
Three of the bugs were shift-class bugs. The expected (specified) behavior of LSR by 0, ASR by 0 and ROR by 0 were unimplemented in the ARM processor design. On talking with the designer about these bugs, it became clear that this was due to oversight of what could be termed as the typical corner cases. These bugs went unnoticed when the ARM model was tested by the Dhrystone simulation benchmarks. The fourth -a datapath-class bug -related to the conditional execution feature in the ARM. In the specification, condition code 1001 in an instruction represents that the instruction is executed if the C flag is cleared or the Z flag is set (unsigned lower or same). Instead, the implementation misinterpreted this condition code to represent the case that the instruction is executed if C flag is cleared and the Z flag is set. We were able to detect design errors in a timely manner as the design evolved and provide valuable feedback to the designers.
Conclusion
This paper has shown the applicability of our methodology for formal verification of an ARM core using Symbolic Trajectory Evaluation. The user specified the instruction set architecture as a set of abstract assertions. An implementation mapping captured the micro-architecture of the processor. The abstract specification and the implementation mapping were used to generate a set of trajectory assertions. The trajectory assertion captures all possible nondeterministic interactions that can arise in the implementation due to an instruction. Symbolic Trajectory Evaluator was used to verify the trajectory assertion on a gate-level implementation of the ARM processor. The verification process uncovered four bugs that were reported back to the designers.
