Although SRT division and square-root approaches and GST division approach have been known for long time, square-root architectures based on the GST approach have not been proposed so far which do not require a final division/multiplication of the scale factor. A GST square-root architecture is developed without requiring either a multiplication to update the scaled square-root quotient in each iteration or a division/ multiplication by the scaling factor after completing the square-root iterations. Additionally, quantitative comparison of speed and power consumption of GST and SRT division/square-root units are presented. Shared divider and square-root units are designed based on the SRT and the GST approaches, in minimally and maximally redundant radix-4 representations. Simulations demonstrate that the worst-case overall latency of the minimally-redundant GST architecture is 35% smaller compared to the SRT. Alternatively, for a fixed latency, the minimally-redundant GST architecture based division and square-root operations consume 32% and 28% less power, respectively, compared to the maximally-redundant SRT approach.
INTRODUCTION
Among the four basic arithmetic functions, division is the most difficult algorithm to be implemented in silicon. Assuming the simplest implementation for adder, multiplier and divider, the addition using a carry-ripple adder requires a critical path of n full-adder and the multiplication employing a carry-save scheme and a final carryripple adder requires 2. n full-adder delays (the multiplexer delays are neglected). However, the division using the paper and pencil method, in which a carry-ripple adder is used in each iteration, requires a critical path of n. n full-adder [1] indicate that the total number of division operations can range from one-third to one-half the number of multiplications in computations. Oberman [2] concludes that even though the division is an infrequent operation, it can result in performance degradation if the implementation is being ignored.
In general, division and square-root are defined as N Qd D + R, (1) X--Qs. Qs+R, (2) where N, D, Qd, R, X and Qs are dividend, divisor, division quotient, remainder, square-root operand and square-root quotient, respectively. Generally divider algorithms can be separated into two different kinds of algorithms, the Multiplicative Algorithms (MA) [3] [4] [5] and the Iterative Digit Recurrence Algorithms (IDRA), which is also known as the paper-and-pencil method. A general overview of existing divider algorithm is given in Figure 1 [6, 7] ) and the GST (the g represents generalized, ST due to the accomplishments of Svoboda and Tung [8, 9] ). The case of higher radix division with redundant digit sets which also implies similar considerations for the corresponding higher radix square-root, has been extensively studied in [10] . Other significant work on division and division/square-root can be found in [11] [12] [13] [14] [15] [16] [17] . 
LOW-POWER ARCHITECTURE 367
According to IEEE standard 754, the unsigned operands of the division and the square-root have to be in the range [1, 2) and [0.25, 1), respectively. In case that the inputs do not correspond to this standard, they must be normalized and the their exponents be adjusted. Assuming binary operands of word-length n, it becomes obvious that the division/square-root computation requires n additions/subtractions to obtain the entire quotient.
The computation can be accelerated by either reducing the number of iteration (using higher radices) or by reducing the critical path of an iteration. The SRT and GST algorithms take advantage of a redundant number system and a higher radix r. This leads to a word-length independent critical path of the addition and subtraction and reduces the number of iterations to n/(logzr). In [18] a novel shared architecture for division and square-root was presented, which developed a GST square-root architecture without requiring an additional division by the scaling factor after the square-root operation as in [191. This paper is organized as follows. Section 2 presents the mathematical background for GST division and GST square-root operations. Section 3 presents the architecture of the GST divider/ square-root unit and the estimated power consumption of the SRT and GST square-root architectures. Section 4 presents the simulation results while Section 5 concludes the paper. 
Before the first iteration can be performed, the scaling of the operands is required. Additionally, the arithmetic condition which guarantees that the most significant digit equals zero after the subtraction/addition of the multiple of the denominator, has to be satisfied. Additionally, since the square-root quotient is unknown at the beginning of the computation, it has to be updated and scaled according to the quotient digits qi and k.
Since Qs changes every iteration, kQs has to be updated according to the quotient digits qi and k. The sought result is obtained after every iteration by: Qs,i as,i-1 + r-i qi (9) Scaling Eq. (6) Recalling that in (6) 2O,i was defined as 2Q,i 2. Qs,i-1 -1-qi r i,
and by comparing Eqs. (10), (11) and (12) , the sought equation is obtained as:
The next scaled square-root quotient can now be obtained by just adding the quantities qi" r-i. k and qi+ "r-(i+l)'k to the previous scaled square-root quotient [18] . 
which can be rewritten as < Ri+l < (16) r-1 r-1
By employing the arithmetic condition in the square-root recurrence equation
Ri+l r Ri qi (2Oi q-qi r-i) (17) an upper and lower bound can be computed in which the square-root quotient has to be located.
By obtaining this upper and lower bounds for the square-root quotient, the limitations for the operand X can be computed. By introducing the same rewrite condition as in the division [20] An even distribution of the scale intervals does not lead to an optimal solution as shown in Table I . Besides the fact that eight bits of the square-root operand X have to be examined, the shown bounds of X lead to a longer critical path in the update of the square-root quotient since the word-length of k corresponds to seven bits. To minimize the number of bits to be examined for the prediction of the scale factor k and to limit the scale factor to a multiple of 1/16, the upper and lower bounds of X have to be slightly altered (see Tab. II). This also leads to a useful symmetry which can be used for simplifying the scale factor selection. Figure 3 was introduced in [24] . 
Minimally Redundant Radix-4 GST
In [25] and [20] two minimally redundant radix-4 division architectures have been presented. The architecture has been expanded according to (13 To overcome the additional iteration due the leading q0--1 and the need to compute 13 fractional digits to meet precision requirements, the first subtraction can be performed by replacing h, which is zero, by the scale factor k. The second quotient digit can be obtained by applying random logic to the non-scaled radicand. The combination of qo.q-1.1 covers the range [0.584, 1). Hence, similar to the division, it is sufficient to examine 3 bits (X1X2X3) of the unscaled input operand to predict the correct digit for q.
In every iteration, a decision has to be made, a selection between division and square-root performed (kD or kQi), a multiple of k Y be selected (using a multiplexer structure) and subtracted from the previous remainder (Hybrid-adder) (see Fig. 4 ). RW/DC represents the rewrite and decision criteria of the most significant two digits. The most significant two digits have to be rewritten for the following cases: 2. 12, 1. 02, 2 0., 22 12. This step insures the conversion of the algorithm [20] . The residual bounds can also be found in [20] The square-root iteration also consists of a four bit adder that adds twice the value of the corresponding multiple of the scale factor k to the previous scaled square-root quotient (2qi.r-i.k). In case of a negative multiple, the addition leads to a wrong result, since the most significant bits of the negative multiple of k are ones. Normally, this calls for a full-length addition increasing the critical path tremendously. Nevertheless, this bottleneck can be solved by modifying the Hybrid-adder in such a way, the not one but two hybrid-additions are performed. The first Hybrid-addition subtracts the possible wrong scaled square-root quotient while the second addition corrects the result by adding ones up to the bit position 2i, where corresponds to the current iteration. In case that there are negative quotient digits, entirely different words with leading ones have to be added. This bottleneck can be eliminated by realizing that the addition of all those correcting terms can be simplified by using an on-the-fly converter which uses the scale-factor k and the quotient digit qi as its inputs.
The signal divneg selects the correct value for h in case of square-root operation and a negative quotient digit has been predicted. The result of the updated square-root quotient and the correcting term are pipelined for the next iteration. To perform the update of the term qi2+l r -(i+1) k, a simple multiplexer structure can be chosen which selects between the 0, k and 4k. This term is always positive due to 2 qi+l. The most significant bit of this term has the weight 2-1 smaller than the correcting term h and can be added in parallel to h. Figure 5 indicates the scheme of the update of h. Depending on a positive or negative quotient digit, either a word with all zeros or ones is added to the previous value of h. However, the ones are only placed upto the bit position 2.i, where corresponds to the iteration number.
qi h FIGURE 5 The selection of the correcting term h required for the square-root computation.
In [13] a very high radix square-root architecture which utilizes prescaling and rounding is presented. The shown architecture indicates that two multiplications per iteration have to be performed. These multiplications are in the critical path and increase the iteration delay.
SIMULATION RESULTS
The algorithm has been implemented using 32-bit operands, 24 [27, 28] , however, they are limited to dividers. In the Tables III, IV [29] GST,mr
The supply voltage may accordingly be reduced to LOW-POWER ARCHITECTURE [30] . Since these improvements are applicable to both SRT and GST dividers in an identical way, so these don't change the overall ratio between the GST and SRT behavior of speed and power.
CONCLUSION
The GST division algorithm has been successfully applied to the square-root operation in a hardware-efficient manner. 
