Abstract. This paper presents the operation and the implementation of novel fixed point square root algorithms with an input range of 0 to 1 into CMOS. These algorithms are referred to as non-restoring and restoring algorithm and were compared with a square root implementation into a lookup table (LUT). The power consumption, the area consumption and the propagation delay were investigated and the results of all three algorithms are discussed in relation to their bitwidth for expedient implementation. These results enable hardware designers a fast exploration of the most suitable algorithm for a given application.
Introduction
The determination of the square root is an important mathematical operation in digital signal processing (DSP) System. Thus, square root algorithms are much more complex in its structure and its functions and therefore, are not easy to implement. This paper compares three different algorithms. All three algorithms work with fixed point values and an input range between 0 and 1. The aim of this work is to implement the different square root algorithms. Thereafter, the power consumption, the propagation delay and the area consumption are investigated and compared. Furthermore, the results are presented for different bitwidths. Both square root algorithms, the non-restoring square root algorithm and the restoring square root algorithm, have been described in theory by Israel Koren in 2004 [1] . The name restoring means that the tentative remainder is negative the partial remainder is restored and the tentative remainder is again shifted one digit to left to 4r i . The two algorithms work differently, but they both calculate the same radix and the same final remainder. The third algorithm is based on a lookup table.
A. The Non-Restoring Square Root Algorithm
The non-restoring algorithm operates with a two's complement representation. In the process an exact value with a remainder is calculated for each iteration. This remainder is used for further calculation until the radix has half the bitwidth of the radicand. If it is assumed that the radicand is denoted by an 8 bit vector, then the final remainder and the radix can be obtained after four iterations. This is possible since the square root of the denominator is also determined and thus the bitwidth is halved. An example can be seen in (1) for an 8 bit wide radicand and a 4 bit wide radix.
The non-restoring algorithm is based on the recursive relationship r i =2r i-1 ±2Q i-1 -2 -i , where r i is the ith partial remainder, Q i is the ith square root bit. The computation is split into four subcomputations. First, the radicand or the partial remainder r i is shifted by one digit to the left to produce 2r i . The second step is the determination of q i+1 by checking the sign of the partial remainder. 
B. The Restoring Square Root Algorithm
The restoring algorithm also works using a two's complement representation. However, unlike the non-restoring algorithm, the square root and the final remainder are obtained with i iterations, if the radicand is i bit wide. First an enlargement with zeros is necessary, around the double bitwidth. This enlargement is essential because the square root is also determined from the denominator. 5 0, 0101 0, 01010000
At each iteration four calculations are essential. The radicand r 0 is shifted one digit to the left so that the product 2r 0 is obtained. 
II. Results
The hardware design was written with the hardware description language VHDL according to the VHDL-87 standard [2] . The Synopsys Design Compiler [3] was used for Synthesis without any design constrains. The implementation technology used is the Europractice ES2 ECPD 0.7µm CMOS technology [4] .
The results of the different square root algorithms and the lookup table implemented will now be discussed. The non-restoring and the restoring algorithm were implemented with a bitwidth of 4 bit, 8 bit, 16 bit, 32 bit and 64 bit. The lookup table was implemented with a bitwidth of 4 bit, 8 bit and 16 bit. The results of 32 bit and 64 bit were interpolated for comparison with all bitwidths used. In the following chapter the results are presented for the power consumption, the area consumption and the propagation delay.
A. Power Consumptions
The square root algorithms and the lookup table implemented were tested towards their power efficiency. The mode of operation of LUTs and square root algorithms is different at several bitwidths and therefore, the both algorithms and the LUT were investigated with different bitwidths. The power consumption is described by the active capacitance of the transistors, by the supply voltage and by the switching frequency and can be calculated as follows:
Power is only consumed when a gate changes its state. Therefore, a test program, PowerCount was used, that creates different random test vectors and measures the active capacitance [5] . For the tests in Figure 1 the following values applied for all power tests; a clock frequency of 10 MHz was chosen and the supply voltage was set to 5V. Figure 1 shows the power consumption of the different algorithms implemented.When comparing both algorithms and the lookup table, it can be seen that the non-restoring requires the most power. This can be explained with the higher number of arithmetic operations in the non-restoring algorithm. The lookup table requires the least power consumption, because it only accesses stored values. With a larger bitwidth it can be seen, that the power consumption of the algorithms and the lookup table changes. If the bitwidth will be increased the computation of the square root will be more complex. Therefore, the power consumption of the nonrestoring and the restoring increases with the bitwidth whereas the power consumption of the lookup table remains nearly constant. The non-restoring algorithm has larger power consumption than the restoring algorithm. This is because of the additional correction at the non-restoring algorithm. One major objective of any hardware design is to achieve a small silicon area. Therefore, the area requirements of the square root algorithms and the lookup table is presented in Figure 2 . The algorithms implemented and the lookup table have different area requirements. Since, the stored values are still small for the lookup table at 4 bit and 8 bit, the areas consumption is smaller than that of the calculated algorithms. The non-restoring and restoring algorithms always require arithmetic operations and consequently the same arithmetic modules are used at different bitwidths. Only the operation steps become larger. If the input vector gets doubles long, the restoring algorithm and the non-restoring algorithm require twice as much steps for the calculation of the square root. For the lookup table, the number of the stored values increases quadratic. Therefore, the lookup table has initially a small area requirement at 4 bit and 8 bit. However, the implementation of the LUT from 16 bit onwards leads to an increased silicon area over the other implementations.
C. Propagation Delay
The propagation delay is the time a signal requires from one functional block of a system to another. In this case the authors are providing the worst case delay of the input of the circuit until the output has settled. As can be seen in Figure 3 the lookup table has the lowest propagation delay, because it needs no arithmetic operation. Since the input vector only causes a memory write operation, the propagation delay is short. This timing behaviour is not considerably changed with an enlargement of the bitwidth. As can be seen, the non-restoring algorithm needs the longest to perform the calculation, because it has the most arithmetic modules. Additionally the results have to be corrected. Increasing the bitwidth, the iterations required are increased and thus the signal propagation delay gets longer. The restoring algorithm requires less propagation delay as it is arithmetically less intensive and thus the result is faster computed. The propagation delay of the restoring algorithm increases continuously with the bitwidth.
III. Conclusions
This paper has investigated the implementation of three different square root algorithms into CMOS. Firstly, the mode of operation of the non-restoring, restoring and the lookup table was described. There the computation steps were investigated and explained. Furthermore, it was demonstrated that all three square root computations work with fixed point values between 0 and 1. During the process the mathematical complexity of the different algorithms was shown. Secondly, the implementation of the different algorithms into hardware was described. Various implementations were realised with different bitwidths in high-level VLSI design and using the hardware description language VHDL. The different bitwidths were implemented into an ASIC and compared in relation to the power consumption, the area requirements and the propagation delay. These properties were illustrated to compare the advantages and disadvantages of each algorithm. In the process it was shown, that the non-restoring, the restoring and the lookup table have different trade-offs. Therefore, with the figures presented in this paper the hardware designer is now are to pick the best possible implementation for a given problem.
