We introduce a right-to-left digit serial algorithm for the integer power operation y x where and x yare positive integers. For nbit words the algorithm utilizes O(n) additions and does not require use of a multiplier. We describe a hardware implementation and evaluate the effectiveness employing a Synopsys tool set with a standard cell implementation. Out digit serial algorithm compares favorably with a popular iterative square and multiply algorithm implemented with the same tool set.
INTRODUCTION
Algorithms for computing the operation y z x = where y is a positive integer have been the subject of considerable research.
The binary squaring method determines 2 4 8 , , , ,... x x x x and processes the bits of y right-to-left to multiply by the appropriate binary powers of x to determine y x . This algorithm has been described in many popular texts [9] [10] [11] . Knuth [11] traces this "fast" algorithm back to al-Kashi in the 15 th century.
We are interested in the particular case where , x y and the result z are all non-negative k-bit integers. For typical word sizes such as k = 8, 16, 32, 64, 128,…, this integer valued powering operation is proposed to supplement the integer addition and multiplication operations. The squaring algorithm may be implemented in hardware with microcode and a fast multiplier much like the floating-point transcendental operations in the Pentium and Athlon processors.
For implementation in hardware there is a need for a simpler algorithm that avoids the use of a large multiplier. There is a further need for a right-to-left digit serial algorithm that requires less time for lower precision operations when a family of precision levels is implemented in hardware.
In this paper, we introduce a novel digit serial algorithm for evaluation of the integer power operation y x that does not require a multiplier. The algorithm employs conversion of x to a discrete log format [2] , bit serial multiplication with the discrete log value providing the "recoded multiplier bits" and the exponent y being the multiplicand, and bit serial deconversion of y x [6] to provide the result z .
The paper is organized as follows. We present some number theoretic background material on the integer power operation and review the foundations for the algorithms in Section 2. In Section 3, we present our digit serial integer power algorithm. Section 4 contains a description of the hardware implementation of our proposed algorithm and Section 5 provides area and delay estimates from the standard cell synthesis procedure. Conclusions are presented in Section 6.
A DIGITAL SERIAL INTEGER POWER ALGORITHM
The inheritance principle [8] for integer operations on binary operands informally states that the k low-order bits of the result depend only on the k low-order bits of the operands for all
This principle provides the basis for right-to-left digit serial integer operations. Specifically, if we assume we have determined the low order (k-1)-bits of the result from the (k-1) low order operand input bits, incorporating the k-th bits of the operands, the k-th result bit can then be determined with the (k-1) lower order result bits "inherited" from the preceding serial computation.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. for all 1.
Integer addition and multiplication clearly have the inheritance property, as evidenced in the traditional right-to-left "carry ripple" algorithms. To establish a foundation for our digit serial integer power algorithm, we note here without proof that the operation y x satisfies the inheritance property given appropriate exception handling for zero valued operands as indicated in the following:
Lemma: Let , x y be integers with 0, 1
Note that x has a unique factorization into odd and even terms 2 p x n = with n odd. It is straightforward to show [1, 3] Binary-to-discrete log conversion refers to determining the pair ( , ) s e given the k − bit odd integer n , and deconversion refers to determining n given the pair ( , ) s e , where n , s , e satisfy . For completeness, we review algorithms from [2, 6] demonstrating that both the exponential residue operation (determining n given e ) and the discrete log operation (determining e given n ) can be performed by a series of less than k table-assisted shift-and-add operations employing exponent recoding.
Existing "Fast" Binary Squaring Algorithm
The existing fast algorithm is based on the fact .. 
Additive Based Discrete Logarithm Modulo 2 k
Computing the discrete logarithm for certain k − bit odd integers x can be accomplished using a method [2] that is essentially the dual of the exponentiation method of Section 2. The method in [2] identifies the set of two-ones residues and thus it is the core of a digit serial conversion method from binary to DLS. 
Algorithm 3 Binary to DLS Conversion Algorithm (DLG)
Stimulus
PROPOSED FEEDBACK SHIFT ADD (FSA) ALGORITHM
Based on previous work, we know that any number can be converted to a triple ( , , ) s p e where 2 p x n = with n odd [2, 6] . So that 2 2 2 ( ( 1) 2 3 In Figure 1 , we calculate the y th power of operand x in a serial fashion. That is we start multiplication and decoding after we obtain the entire value of e . A better technique is a pipelined arrangement of the sub-operations in which multiplication and decoding starts when the first bit of e is available. For every available bit of e , a bit of the intermediate product is generated followed by a bit of z being produced. This method is referred to as the pipelined algorithm and is described in the following algorithm. The third stage updates z according to EXP algorithm (i.e. L19 − L24). The final result is obtained at line L25. As can be seen by inspection of the algorithm, the time complexity is essentially k dependent shift-and-add modulo 2 k operations.
HARDWARE IMPLEMENTATION
The state diagram of a controller for a hardware implementation is given in Figure 2 . 
EXPERIMENTAL RESULTS
In order to evaluate the effectiveness of our method as compared to the well-known "fast" squaring method, we described each method in Verilog RTL and synthesized the circuits using the Synopsys tool set based on a standard cell library from Synopsys [5] and a standard cell library from Oklahoma State University [7] . Since the results from the two standard cell libraries were similar, we only list the result based on the standard cell library from Synopsys. We implemented five designs corresponding to wordsizes of k=8, 16, 32, 64, 128 respectively. We also implemented the existing fast algorithm described in Section 2.3. Table 2 compares the results of our algorithm with the existing fast algorithm for different k values. We also plot the trend of the two algorithms in Figure 3 (speed) and Figure 4 (area). It is seen that for all k values, our algorithm is faster than the existing fast algorithm when each algorithm is synthesized with the standard cell library. Regarding area, our method requires more space for small word sizes but increases slowly compared with the existing fast algorithm. Thus, when 64 k ≥ , our algorithm requires less area.
It should be noted that the area values reported here are only the net area required by the total cell area since we did not route the resulting circuits, thus additional area required by routing is not included. 
CONCLUSION AND FUTURE WORK
We presented a novel algorithm for computing the powering operation modulo 2 k . The algorithm has a critical path of less than k shift-and-add modulo 2 k operations. To evaluate the effectiveness of the method, we compare the standard cell synthesis results for the algorithm for wordsizes of k=8, 16, 32, 64, 128 respectively. The experimental results confirm that the algorithm is an effective way of calculating the discrete logarithm of a value modulo 2 k .
The bottleneck in the new digit serial algorithm is the use of repetitive large shifts to implement the compound product Further work in our laboratory is addressed to reducing the shift penalty for realizing this product to further improve our synthesis results.
One observation we have made can significantly reduce the long shift when the iteration reaches the middle point of the result at the / 2 k bit. At this middle-point, the lower half / 2 k bits of the compound product will not change while the upper half / 2 k bits are still being accumulated. However, the length of shift after reaching the middle-point is larger than / 2 k , which means that the upper / 2 k bits will be shifted to a position that is larger than the final product size k . Thus, it will not affect the compound product any more. Based upon this observation, we can conclude that when the iteration reaches the middle-point, we only need to record the lower half / 2 k bits and shift the recorded data one bit left each time for the next / 2 k computations.
