Abstract. Theorem proving techniques are particularly well suited for reasoning about arithmetic above the bit level and for relating di erent levels of abstraction. In this paper we show how a non-restoring integer square root algorithm can be transformed to a very e cient hardware implementation. The top level is a Standard ML function that operates on unbounded integers. The bottom level is a structural description of the hardware consisting of an adder/subtracter, simple combinational logic and some registers. Looking at the hardware, it is not at all obvious what function the circuit implements. At the top level, we prove that the algorithm correctly implements the square root function. We then show a series of optimizing transformations that re ne the top level algorithm into the hardware implementation. Each transformation can be veri ed, and in places the transformations are motivated by knowledge about the operands that we can guarantee through veri cation. By decomposing the veri cation e ort into these transformations, we can show that the hardware design implements a square root. We have implemented the algorithm in hardware both as an Altera programmable device and in full-custom CMOS.
Introduction
In this paper we describe the design, implementation and veri cation of a subtractive, non-restoring integer square root algorithm. The top level description is a Standard ML function that implements the algorithm for unbounded integers. The bottom level is a highly optimized structural description of the hardware implementation. Due to the optimizations that have been applied, it is very difcult to directly relate the circuit to the algorithmic description and to prove that the hardware implements the function correctly. We show how the proof can be done by a series of transformations from the SML code to the optimized structural description.
At the top level, we have used the Nuprl proof development system Lee92] to verify that the SML function correctly produces the square root of the input.
We then use Nuprl to verify that transformations to the implementation preserve the correctness of the initial algorithm.
Intermediate levels use Hardware ML OLLA93], a hardware description language based on Standard ML. Starting from a straightforward translation of the SML function into HML, a series of transformations are applied to obtain the hardware implementation. Some of these transformations are expressly concerned with optimization and rely on knowledge of the algorithm; these transformations can be justi ed by proving properties of the top-level description.
The hardware implementation is highly optimized: the core of the design is a single adder/subtracter. The rest of the datapath is registers, shift registers and combinational logic. The square root of a 2n bit wide number requires n cycles through the datapath. We have two implementations of square root chips based on this algorithm. The rst is done as a full-custom CMOS implementation; the second uses Altera EPLD technology. Both are based on a design previously published by Bannur and Varma BV85] . Implementing and verifying the design from the paper required clearing up a number of errors in the paper and clarifying many details.
This is a good case study for theorem proving techniques. At the top level, we reason about arithmetic operations on unbounded integers, a task theorem provers are especially well suited for. Relating this to lower levels is easy to do using theorem proving based techniques. Many of the optimizations used are applicable only if very speci c conditions are satis ed by the operands. Verifying that the conditions hold allows us to safely apply optimizations.
Automated techniques such as those based on BDDs and model checking are not well-suited for verifying this and similar arithmetic circuits. It is di cult to come up with a Boolean statement for the correctness of the outputs as a function of the inputs and to argue that this speci cation correctly describes the intended behavior of the design. Similarly, speci cations required for model checkers are di cult to de ne for arithmetic circuits.
There have been several veri cations of hardware designs which lift the reasoning about hardware to the level of integers, including the Sobel Image processing chip NS88] , and the factorial function CGM86]. Our work di ers from these and similar e orts in that we justify the optimizations done in order to realize the square root design. The DDD system BJP93] is based on the idea of design by veri ed transformation, and was used to derive an implementation of the FM9001 microprocessor. High level transformations in DDD are not veri ed by explicit use of theorem proving techniques.
The most similar research is Verkest's proof of a non-restoring division algorithm VCH94]. This proof was also done by transforming a design description to an implementation. The top level of the division proof involves consideration of several cases, while our top level proof is done with a single loop invariant. The two implementations vary as well: the division algorithm was implemented on an ALU, and the square root on custom hardware. The algorithms and implementations are su ciently similar that it would be interesting to develop a single veri ed implementation that performs both divide and square root based on the research in these two papers.
The remainder of this paper is organized as follows. In section 2 we describe the top-level non-restoring square root algorithm and its veri cation in the Nuprl proof development system. We then transform this algorithm down to a level suitable for modelling with a hardware description language. Section 3 presents a series of ve optimizing transformations that re ne the register transfer level description of the algorithm to the nal hardware implementation. In section 4 we summarize the lessons learned and our plans for future research.
2 The Non-Restoring Square Root Algorithm An integer square root calculates y = p x where x is the radicand, y is the root, and both x and y are integers. We de ne the precise square root (p) to be the real valued square root and the correct integer square root to be the oor of the precise root. We can write the speci cation for the integer square root as shown in De nition 1.
De nition 1 Correct integer square root y is the correct integer square root of x= y 2 x < (y + 1) 2 We have implemented a subtractive, non-restoring integer square root algo- In each iteration we square the partial root (y), subtract the squared partial root from the radicand and revise the partial root based on the sign of the result. There are two major classes of algorithms: restoring and non-restoring Flo63]. In restoring algorithms, we begin with a partial root for y = 0 and at the end of each iteration, y is never greater than the precise root (p). Within each iteration (i), we set the i th bit of y, and test if x ? y 2 is negative; if it is, then setting the i th bit made y too big, so we reset the i th bit and proceed to the next iteration.
Non-restoring algorithms modify each bit position once rather than twice. Instead of setting the the i th bit of y, testing if x?y 2 is positive, and then possibly resetting the bit; the non-restoring algorithms add or subtract a 1 in the i th bit of y based on the sign of x ? y 2 in the previous iteration. For binary arithmetic, the restoring algorithm is e cient to implement. However, most square root hardware implementations use a higher radix, non-restoring implementation. For higher radix implementations, non-restoring algorithms result in more e cient hardware implementations.
The results of the non-restoring algorithms do not satisfy our de nition of correct, while restoring algorithms do satisfy our de nition. The resulting value of y in the non-restoring algorithms may have an error in the last bit position. For the algorithm used here, we can show that the nal value of y will always be either the precise root (for radicands which are perfect squares) or will be odd and be within one of the correct root. The error in non-restoring algorithms is easily be corrected in a cleanup phase following the algorithm.
Below we show how a binary, non-restoring algorithm runs on some values for n =3. Note that the result is either exact or odd. fun sqrt n radicand = iterate update (n-1) (init n radicand)
The iterate function performs iteration by applying the function argument to the state argument, decrementing the count, and repeating until the count is zero.
fun iterate f n state = if n = 0 then state else iterate f (n -1) (f state)
The top level (L 0 ) is a straightforward implementation of the non-restoring square root algorithm. We represent the state of an iteration by the triple Statefx,y,ig where x is the radicand, y is the partial root, and i is the iteration number. Our initial guess is y = 2 (n?1)
. In update, x never changes and i is decremented from n-2 to 1, since the initial values take care of the (n ? 1) st iteration. At L 0 , init and update are: fun init (n,radicand) = State{x = radicand, y = 2 ** (n-1), i = n-2 } fun update (State{x, y, i}) = let val diffx = x -(y**2) val y' = if diffx = 0 then y else if diffx > 0 then (y + (2**i)) else (* diffx < 0 *) (y -(2**i)) in State{x = x, y = y', i = i-1 } end
In the next section we discuss the proof that this algorithm calculates the square root. Then we show how it can be re ned to an implementation that requires signi cantly less hardware. We show how to prove that the re ned algorithm also calculates the square root; in the absence of such a proof it is not at all obvious that the algorithms have identical results.
Veri cation of Level Zero Algorithm
All of the theorems in this section were veri ed using the Nuprl proof development system. Theorem 1 is the overall correctness theorem for the non-restoring square root code shown above. It states that after iterating through update for n-1 times, the value of y is within one of the correct root of the radicand. We have proved this theorem by creating an invariant property and performing induction on the number of iterations of update. Remember that n is the number of bits in the result. In Theorems 2 and 3 we show that init and update are correct, in that for all legal values of n and radicand, init returns a legal state and for all legal input states, update will return a legal state and makes progress toward termination by decrementing i by 1. A legal state is one for which the loop invariant holds. The correctness of init is straightforward. The proof of Theorem 3 relies on Theorem 4, which describes the behavior of the update function. The body of update has three branches, so the proof of correctness of update has three parts, depending on whether x ? y 2 is equal to zero, positive, or negative. Each case in Theorem 4 is straightforward to prove using ordinary arithmetic. We now prove that iterating update a total of n-1 times will produce the correct nal result. The proof is done by induction on n and makes use of Theorem 5 to describe one call to iterate. This allows us to prove that after iterating update a total of n-1 times, our invariant holds and i is zero. This is su cient to prove that square root is within one of the correct root.
Theorem 5 Iterating a functioǹ 8prop; n; f; s:
prop n s =) (8n 0 ; s 0 :prop n 0 s 0 =) prop (n 0 ? 1) (f s 0 )) =)
prop 0 (iterate f n s)
Description of Level One Algorithm
The L 0 SML code would be very expensive to directly implement in hardware.
If the state were stored in three registers, x would be stored but would never change; the variable i would need to be decremented every loop and we would need to calculate y 2 , x ? y 2 , 2 i , and y 2 i in every iteration. All of these are expensive operations to implement in hardware. By restructuring the algorithm through a series of transformations, we preserve the correctness of our design and generate an implementation that uses very little hardware.
The key operations in each iteration are to compute x ? y 2 and then update y using the new value y 0 = y 2 i , where is + if x ? y 2 0 and ? if x ? y 2 < 0. The variable x is only used in the computation of x ? y 2 . In the L 1 code we introduce the variable diffx, which stores the result of computing x ? y 2 . This has the advantage that we can incrementally update diffx based on its value in the previous iteration: y' = y 2 i y 02 = y 2 2 y 2 i + (2 i ) 2 = y 2 y 2 i+1 + 2 2 i diffx = x ? y 2 diffx'= x ? y 02 = x ? (y 2 y 2 i+1 + 2 2 i ) = (x ? y 2 ) y 2 i+1 ? 2 2 i
The variable i is only used in the computations of 2 2 i and y 2 i+1 , so we create a variable b that stores the value 2 2 i and a variable yshift that stores y 2 i+1 . We The L 1 versions of init and update are given below. Note that, although the optimizations are motivated by the fact that we are doing bit vector arithmetic, the algorithm is correct for unbounded integers. Also note that the most complex operations in the update loop are an addition and subtraction and only one of these two operations is executed each iteration. We have optimized away all exponentiation and any multiplication that cannot be implemented as a constant shift.
We could verify the L 1 algorithm from scratch, but since it is a transformation of the L 0 algorithm, we use the results from the earlier veri cation. We do this by de ning a mapping function between the state variables in the two levels and then proving that the two levels return equal values for equal input states. The transformation is expressed as follows:
De Again, the initialization theorem has an easy proof, and the update1 theorem is a case split on each of the three cases in the body of the update function, followed by ordinary arithmetic.
Description of Level Two Algorithm
To go from L 1 to L 2 , we recognize that the operations in init1 are very similar to those in update1. By carefully choosing our initial values for diffx and y, we increase the number of iterations from n-1 to n and fold the computation of radicand -b' in init into the rst iteration of update. This eliminates the need for special initialization hardware. The new initialize function is: fun init2 (n,radicand) = State{diffx = radicand, yshift = 0, b = 2 ** (2*(n-1))}
The update function is unchanged from update1. The new calling function is:
fun sqrt n radicand = iterate update1 n (init2 n radicand)
Showing the equivalence between init2 and a loop that iterates n times and the L 1 functions requires showing that the state in L 2 has the same value after the rst iteration that it did after init1. More formally, init1 = update1 init2. We prove this using the observation that, after init2, diffx is guaran- 3 Transforming Behavior to Structure with HML The goal of this section is to produce an e cient hardware implementation of the L 2 algorithm. The rst subsection introduces Hardware ML, our language for specifying the behavior and structure of hardware. Taking an HML version of the L 2 algorithm as our starting point, we obtain a hardware implementation through a sequence of transformation steps.
1. Translate the L 2 algorithm into Hardware ML. Provision must be made to initialize and detect termination, which is not required at the algorithm level.
2. Transform the HML version of L 2 to a set of register assignments using syntactic transformations.
3. Introduce an internal state register Exact to simplify the computation, and \factor out" the condition DiffX >= %0. 4. Partition into functional blocks, again using syntactic transformations. 5. Substitute lower level modules for register and combinational assignments.
Further optimizations in the implementation of the lower level modules are possible. Each step can be veri ed formally. Several of these must be justi ed by properties of the algorithm that we can establish through theorem proving.
Hardware ML
We have implemented extensions to Standard ML that can be used to describe the behavior of digital hardware at the register transfer level. Earlier work has illustrated how Hardware ML can be used to describe the structure of hardware OLLA93, OLLA92]. HML is based on SML and supports higher-order, polymorphic functions, allowing the concise description of regular structures such as arrays and trees. SML's powerful module system aids in creating parameterized designs and component libraries.
Hardware is modelled as a set of concurrently executing behaviors communicating through objects called signals. Signals have semantics appropriate for hardware modelling: whereas a Standard ML reference variable simply contains a value, a Hardware ML signal contains a list of time-value pairs representing a waveform. The current value on signal a, written $a, is computed from its waveform and the current time.
Two kinds of signal assignment operators are supported. Combinational assignment, written s == v, is intended to model the behavior of combinational logic under the assumption that gate delays are negligible. s == v causes the current value of the target signal s to become v. For example, we could model an exclusive-or gate as a behavior which assigns true to its output c whenever the current values on its inputs a and b are not equal: HML's behavior constructor creates objects of type behavior. Its argument is a function of type unit -> unit containing HML code { in this case, a combinational assignment.
Register assignment is intended to model the behavior of sequential circuit elements. If a register assignment s <-v is executed at time t the waveform of s is augmented with the pair (v; t +1) indicating that s is to assume the value v at the next time step. For example, we could model a delay element as a behavior containing a register assignment: The val p = : : : declaration introduces an internal signal whose initial value is false. In the SML description of the algorithm, some elements of the control state are implicit. In particular, the initiation of the algorithm (calling sqrt) and its termination (it returning a value) are handled by the SML interpreter. Because hardware is free running there is no built-in notion of initiation or termination of the algorithm. It is therefore necessary to make explicit provision to initialize the state registers at the beginning of the algorithm and to detect when the algorithm has terminated.
Initialization is easy: the diffx, yshift, and b registers are assigned their initial values when the init signal is high.
In computing an eight-bit root, the L 2 algorithm terminates after seven iterations of its loop. An e cient way to detect termination of the hardware algorithm makes use of some knowledge of the high level algorithm. An informal analysis of the L 2 algorithm reveals that b contains a single bit, shifted right two places in each cycle, and that the least signi cant bit of b is set during the execution of the last iteration. Consequently, the done signal is generated by testing whether the least signi cant bit of b is set (the expression $b sub 0 selects the lsb of b) and delaying the result of the test by one cycle. done is therefore set during the clock cycle following the nal iteration. To justify this analysis, we must formally prove that the following is an invariant of the L 2 algorithm:
Partitioning into Register Assignments
Our second transformation step is a simple one: we transform the HML version of the L 2 algorithm into a set of register assignments. The goal of this transformation is to ensure that the control state of the algorithm is made completely explicit in terms of HML signals.
We make use of theorems about the semantics of HML to justify transformations of the behavior construct. First, if distributes over the sequential composition of signal assignment statements: if P and Q are sequences of signal assignments, then if e then (s <-a; P) Repeatedly applying these two rules allows us to decompose our HML code into a set of assignments to individual registers. The register assignments after this transformation are We begin by observing two facts about the L 2 algorithm: if DiffX ever becomes zero the radicand has an exact root, and once DiffX becomes zero it remains zero for the rest of the computation. To simplify the computation of DiffX we introduce an internal signal Exact which is set if DiffX becomes zero in the course of the computation. If Exact becomes set, the value of DiffX is not used in the computation of YShift (subsequent updates of YShift involve only division by 2). DiffX becomes a don't care, and we can merge the $DiffX = %0 and $DiffX > %0 branches. We also replace $DiffX = %0 with $Exact in the assignment to YShift, and change the $DiffX > %0 comparisons to $DiffX >= %0 (note that this branch of the if is not executed when $DiffX = %0, because Exact is set in this case). Simplifying the algorithm in this way requires proving a history property of the computation of DiffX. Using $DiffX(n) to denote the value of the signal DiffX in the n'th computation cycle, we state the property as: 8t : t 0:($DiffX(t) = 0) =) ($DiffX(t + 1) = 0) Next, we note that the condition $DiffX >= 0 can be detected by negating DiffX's sign bit. We introduce the negated sign bit as the intermediate signal ADD (to specify that we are adding to YShift in the current iteration) and rewrite the conditions in the assignments to DiffX 
Partitioning into Functional Blocks
The fourth transformation separates those computations which can be performed combinationally from those which require sequential elements, and partitions the computations into simple functional units. The transformation is motivated by our desire to implement the algorithm by an interconnection of lower level blocks; the transformation process is guided by what primitives we have available in our library. For example, our library contains such primitives as registers, multiplexers, and adders, so it is sensible to transform This particular example is a consequence of some more general rules which are justi ed as before by appealing to the semantics of HML. The assignments resulting from this transformation are shown below. DiffX' can be computed by a multiplexer. DiffXTmp and Delta can be computed by adder/subtracters. The YShift register can be conveniently implemented as a shift register which shifts when Exact's value is true, and loads otherwise. YShift' can be computed by a multiplexer; YShiftTmp by an adder/subtracter { multiplication and division by 2 are simply wired shifts. The fth, and nal, transformation step is the substitution of lower-level modules for the register and combinational assignments; the result is a structural description of the integer square root algorithm which can readily be implemented in hardware, as shown in Figure 1 . ShiftReg4 is a shift register which shifts its contents two bits per clock cycle; the Done signal is simply its shift output. ShiftReg2, Mux, AddSub and Reg are a shift register, multiplexer, adder/subtracter and register, respectively. SubAdd is an adder/subtracter, but the sense of its mode bit makes it the opposite of AddSub. The Hold element has the following substructure: There are further opportunities for performing optimization in the implementation of the lower level blocks. Analysis of the L 2 algorithm reveals that Delta is always positive, so we do not need its sign bit. This property can be used to save one bit in the AddSub used to compute Delta { only 16 bits are now required. One bit can also be saved in the implementation of SubAdd; the value of the ADD signal can be shown to be identical to a latched version of the carry output of a 16-bit SubAdd, provided the latch initially holds the value true. Figure 2 shows a block diagram of the hardware to implement the square root algorithm, which includes this optimization.
A number of other optimizations are not visible in the gure. The ShiftReg4 can be implemented by a ShiftReg2 half its width if we note that every second bit of B is always zero. The two SubAdd blocks can each be implemented with only a few AND and OR gates per bit, rather than requiring subtract/add modules, if we make use of some results concerning the contents of the B and YShift registers.
To justify these optimizations we are obliged to prove that every second bit of B is always zero: 8i : 0 i < n : :B 2i+1 and that the corresponding bits of B and YShift cannot both be set: We have produced two implementations of the square root algorithm which incorporate all these optimizations. The rst was a full-custom CMOS layout fabricated by Mosis, the second used Altera programmable logic parts. In the latter case, the structural description was rst translated to Altera's AHDL language and then passed through Altera's synthesis software.
Discussion
We have described how to design and verify a subtractive, non-restoring integer square root circuit by re ning an abstract algorithmic speci cation through several intermediate levels to yield a highly optimized hardware implementation.
We have proved using Nuprl that the L 0 algorithm performs the square root function, and we have also used Nuprl to show how proving the rst few levels of re nement (L 0 to L 1 , L 1 to L 2 ) can be accomplished by transforming the top level proof in a way that preserves its validity.
This case study illustrates that rigorous reasoning about the high-level description of an algorithm can establish properties which are useful even for bitlevel optimization. Theorem provers provide a means of formally proving the desired properties; a transformational approach to partitioning and optimization ensures that the properties remain relevant at the structural level. Each of the steps identi ed in this paper can be mechanized with reasonable e ort. At the bottom level, we have a library of veri ed hardware modules that correspond to the modules in the HML structural description AL94].
In many cases the transformations we applied depend for their justi cation upon non-trivial properties of the square root algorithm: we are currently working on formally proving these obligations. Some of our other transformations are purely syntactic in nature and rely upon HML's semantics for their justi cation. We have not considered semantic reasoning in this paper { this is a current research topic.
The algorithm we describe computes the integer square root. The algorithm and its implementation are of general interest because most of the algorithms used in hardware implementations of oating-point square root are based on the algorithm presented here. One di erence is that most oating-point implementations use a higher radix representation of operators. In the future, we will investigate incorporating higher radix oating-point operations. We believe much of the reasoning presented here will be applicable to higher radix implementations of square root as well.
Many of the techniques demonstrated in this case study are applicable to hardware veri cation in general. Proof development systems are especially well suited for reasoning at high levels of abstraction and for relating multiple levels of abstraction. Both of these techniques must be exploited in order to make it feasible to apply formal methods to large scale highly optimized hardware systems. Top level speci cations must be concise and intuitively capture the designers' natural notions of correctness (for example, arithmetic operations on unbounded integers), while the low level implementation must be easy to relate to the nal implementation (for example, operations on bit-vectors). By applying a transformational style of veri cation as a design progresses from an abstract algorithm to a concrete implementation, theorem proving based veri cation can be integrated into existing design practices.
