Abstract-We present a k-bit encoding of the k-bit binary integers based on a discrete logarithm representation. The representation supports a discrete logarithm number system (DLS) that allows integer multiplication to be reduced to addition and integer exponentiation to be reduced to multiplication. We introduce right-to-left bit serial conversion, deconversion, and unified conversion/ deconversion algorithms between binary and DLS. The conversion algorithms utilize OðkÞ additions, do not require the use of a multiplier, and are applicable at least up to 128-bit integers. We illustrate the use of the representation in determining a novel and efficient integer power modulo 2 k operation jx y j 2 k and compare hardware performance with a current state-of-the-art method. Furthermore, we describe properties of the conversion mappings that allow compact table lookup structures to be employed for direct conversion to and deconversion from the DLS encoding. Our lookup architecture allows 16-bit conversion and deconversion mappings to be realized with table sizes of order 2-8 Kbytes, which is up to a 64Â size reduction of the 128 Kbytes of an arbitrary 16-bits-in, 16-bits-out function table. Performance and area results that demonstrate the effectiveness of the table lookup architecture are given. The lookup methodology extends to other 16-bit integer functions such as multiplicative inverse and squaring operations.
INTRODUCTION
T HIS paper's goal is to present the foundation for a discrete log representation and encoding of the integers with efficient conversion between standard binary radix bit string integer representation and bit string encoding of the discrete log representation of each integer. Our bit string encoding of the discrete-log integer representation employs k bits to represent the integers ½0; 2 k À 1 in a scalable manner for all k. The representation employs reduction modulo 2 k of a threeterm product introduced by Benschop. In [1] , Benschop showed that any k-bit integer x ¼ b kÀ1 b kÀ2 . . . b 0 can be represented by an exponent triple ðs; p; eÞ satisfying the factored expression x ¼ jðÀ1Þ s 2 p 3 e j 2 k , where j Á j 2 k denotes the operation of reduction to the standard residue modulo 2 k [9] ; s 2 f0; 1g; and p and e can be uniquely determined by upper bounds determined by k. Note that this representation allows for integer multiplication and powering to be executed more efficiently-much like in the case of real valued logarithms. In order to take advantage of this representation, efficient methods to convert integers to and from the exponent triple are required. In the following, we discuss the mathematics involved and provide a number of algorithms for achieving this goal. We use the term discrete logarithm number system (DLS) to denote the representation of any integer x 2 ½0; 2 k À 1 by a triple ðs; p; eÞ, and use the term DLS bit string for the k-bit encoding of the triple. Access to the separate exponents s, p, and e of the triple is useful for ALU design as with the separate processing of sign, exponent, and significand factors of a floating-point factorization v ¼ ðÀ1Þ s 2 p ð1:b 1 b 2 . . . b nÀ1 Þ. The integrated DLS bit string is most suitable for efficient storage and table lookup. This paper extends our previous papers focusing on the following:
. Conversion/deconversion. How do we implement the binary-integer-to-ðs; p; eÞ triple conversion and ðs; p; eÞ-to-binary-integer deconversion? Since the exponent p can be determined by a count of loworder zeros in the binary radix integer bit string, the question reduces to how do we determine the ðs; eÞ pair for an odd integer q such that q ¼ jðÀ1Þ s 3 e j 2 k in an efficient, reversible, and scalable manner. . Encoding. How do we encode the triple ðs; p; eÞ into a bit string with an appropriate integer range and convenient scalability for variable k-bit word sizes?
Our method can be summarized with regard to three distinct representations of integers by bit strings in a scalable manner parameterized by increasing k, where k denotes the k-bit integers, i.e., the set f0; 1; 2; . . . ; 2 k À 1g: . . . a 0 having a value of vða kÀ1 a kÀ2 . . . a 0 Þ ¼ x, where parsing of the string yields unique integers s, p, and e such that x ¼ jðÀ1Þ s 2 p 3 e j 2 k .
In Section 2, we provide the foundation for the encoding mapping between the triples ðs; p; eÞ of Representation B and the k-bit DLS strings of Representation C that were introduced only by an example in [8] . Our DLS encoding employs self-determined variable length bit fields for encodings of parameters p and e within a k-bit word, in contrast to floating-point encodings, which employ fixed length fields for sign, exponent, and significand terms. A fundamental property of our k-bit encoding shown in Section 2 is that it is one-to-one over the integers ½0; 2 k À 1 and constructed so that DLS strings satisfy the computationally useful inheritance property. This property provides that the jth bit ða jÀ1 Þ of the DLS string a kÀ1 a kÀ2 . . . a 0 depends only on the j low-order bits b jÀ1 b jÀ2 . . . b 0 of the binary integer representation for all 0 j k À 1. The inheritance property provides the theoretical foundation for bit serial algorithms. The reverse mapping from the DLS string to the ðs; p; eÞ triple is accomplished by parsing. The parsing process first identifies parameter p and then the pair ðs; eÞ.
In Section 3, we focus on the conversion/deconversion mappings between the binary radix bit strings of Representation A and the exponent triple ðs; p; eÞ of Representation B. Efficient scalable iterative solutions of the integer-to-discrete log conversion and deconversion questions were presented at the algorithmic level in [2] , [3] , [4] , and [16] , and with hardware implementation in [6] and [17] . The algorithm described in [2] uses binary arithmetic with 3 as the logarithmic base and has a critical path containing one modulo 2 k multiplication operation for each of its k iterations. Extensions of the algorithm to other logarithmic bases and computations using digits in a higher radix 2 r are also described. The algorithm in [2] was improved in [3] by replacing k modulo 2 k multiplication operations with k table lookup-determined shift-and-add modulo 2 k operations. The algorithm described in [3] is well suited for implementation in special purpose hardware as also is a digit serial algorithm for deconversion modulo 2 k introduced in [4] . Our new contribution in Section 3 is a unified conversion/deconversion algorithm extending results from [3] and [4] . The extension provides integration, potentially reducing hardware area requirements by almost one half.
As an internal ALU application of DLS representation, in Section 4, we describe and analyze the performance of a novel bit serial algorithm for the integer power operation jx y j 2 k using the DLS triple as a catalyst. We extend the bit serial powering operation introduced in [7] by employing the unified conversion/deconversion from Section 3. Specifically, the design shows how the data path can be shared substantially reducing the total area for the whole circuit. While the algorithms in [3] and [4] are scalable with increasing k, alternative compressed direct table lookup methods can be employed for sufficiently small values of k, such as values less than or equal to 16 . In Section 5, we investigate a novel hierarchical table lookup architecture for direct binary-DLS bit string conversion traversing directly between Representations A and C. As mentioned in [8] , the one-to-one hereditary property of DLS bit string encoding provides for table compression similar to that obtained for determining the multiplicative inverse in [5] . This hierarchical direct table lookup procedure demonstrates further options for table-assisted computation in optimizing an ALU design. Section 6 provides our conclusions highlighting the features and selective advantages of discrete log representation and encoding of the integers.
THE INHERITANCE PROPERTY AND DLS ENCODING
The binary radix integer bit string given by
with b i 2 f0; 1g for 0 i n À 1. Partitioning the bit string yields
for any 1 k n À 1. This means reduction modulo 2 k is obtained simply by truncating the leading portion of the bit string. While straightforward, this reduction property as summarized in the following is a fundamental feature of radix representation.
The inheritance principle introduced in [5] applies to many integer operations and functions and is formally defined in terms of modular reduction as follows:
Inheritance principle. The integer operation z ¼ x y has the inheritance property and is termed a hereditary operation if for all nonnegative integers x and y,
The integer function or unary operation z ¼ fðxÞ has the inheritance property and is termed a hereditary function if for all nonnegative integers x,
For example, integer addition and multiplication clearly are operations satisfying the inheritance property. The function x 2 and more general x y for any fixed y are hereditary functions.
In view of Observation 1, the inheritance principle may be interpreted as stating for hereditary operations and functions that the low-order k bits of the input arguments determine the low-order k bits of the output for all k ! 1. With this interpretation, the inheritance principle is seen to be the basis for right-to-left bit serial algorithms such as grade school carry ripple addition. It follows from Observation 2 that an arbitrary hereditary function can be represented by a binary tree with edges labeled by the input bits and output bits at the vertices, with bit a k at depth k þ 1. This lookup tree structure was introduced in [5] with specific regards to the modular multiplicative inverse function and table lookup size compression.
Our purpose in the rest of this section is to demonstrate the construction of a k-bit DLS encoding of the triple ðs; p; eÞ, which provides a one-to-one hereditary function from binary integer radix representation directly to the DLS bit string.
To obtain the DLS encoding, first recall that a positive integer x has a unique factorization into odd and even terms The extension of Theorem 4 to even integers is immediate noting that the DLS encoding includes the same ðp þ 2Þ loworder bits as the binary representation of x ¼ 2 p q. The one-to-one mapping between 5-bit discrete log numbers comprising a 5-bit DLS encoding and standard 5-bit binary radix representation is given in Table 1 . The DLS bit string is partitioned as follows to determine the three exponents p, e, and s.
Consider the line in the table for DLS string a 4 a 3 a 2 a 1 a 0 ¼ 10110 (highlighted in Table 1 ). The parsing begins from the right-hand side, determining the variable length field identifying 2 p ¼ a 1 a 0 ¼ 10 2 by counting zeros until the first unit bit is encountered. The 2-bit field "unary" encoding of p determines p ¼ 1. The next bit is a separation bit providing the logical value s È e 0 used to determine s after e is determined. The remaining leading bits are the 5 À ðp þ 2Þ bits of the exponent 0 e 2 5Àðpþ2Þ À 1 sufficient to determine the odd factor q ¼ jðÀ1Þ
is the integer represented with 0 x 2 5 À 1 uniquely determined. In this example, e ¼ 10 2 ¼ 2 10 , and then s ¼ 1 is d e t e r m i n e d f r o m e 0 ¼ 0 a n d s È e 0 ¼ 1. F i n a l l y ,
Note that the low-order p þ 2 bits are identical in both DLS and binary integer encodings.
The conversion/deconversion for odd integers in Table 1 can be visualized by the lookup trees illustrated in Figs. 1a and 1b, respectively. Navigation in Fig. 1a for binary-to-DLS conversion occurs by reading down with edge direction determined by the 5-bit odd integer string read right-to-left. The DLS output string a 4 a 3 a 2 a 1 a 0 is obtained (right to left) from the bits extracted from the vertices. 
DLS CONVERSION AND DECONVERSION ALGORITHMS
Exponent p for an even number can be determined by counting the low-order zeros in the binary radix integer bit string. Without loss of generality, we focus on odd numbers for the rest of this paper. Binary-to-discrete log conversion refers to determining the pair ðs; eÞ given the k-bit odd integer q, and deconversion refers to determining q given the pair ðs; eÞ, where q, s, e satisfy q ¼ jðÀ1Þ s 3 e j 2 k . For conversion, s is determined by conditional complementation to obtain a normalized q. Without loss of generality, assume s ¼ b 2 ¼ 0, so that q is congruent to 1 or 3 (mod 8). This reduces the conversion operation to the determination of the discrete log e ¼ dlgðqÞ, with 0 e 2 kÀ2 À 1 and q ¼ 1; 3 mod 8. The deconversion problem reduces to evaluating the exponential residue operation determining q, where q ¼ j3 e j 2 k . For completeness, we review algorithms from [3] and [4] demonstrating that both the exponential residue operation (determining q given e) and the discrete log operation (determining e given q) can be performed by a series of less than k table-assisted shift-and-add operations employing exponent recoding.
We conclude this section by emphasizing the algorithmic similarities between additive conversion and deconversion and introduce a unified conversion/deconversion algorithm. This has the attractive property of reducing the hardware area required due to sharing a common data path as compared to independent conversion and deconversion algorithm implementations in circuitry.
Additive Exponentiation Modulo
e j 2 k can be computed using the square-and-multiply method [12] . This entails computing j3
by successive squaring. We observe that similar methods lead to the correct result when the exponent e is recoded as a sum of elements e ¼ j P i j 2 kÀ2 [4] . In this case, j3 e j 2 k can be computed as j3 e j 2 k ¼ j3 In [4] , it is shown that any exponent e can be expressed as a sum of dlgð2 i þ 1Þ's termed the two-ones discrete logs.
Since 3 dlgð2
i þ1Þ ¼ 2 i þ 1, it follows that the corresponding multiplications can be performed as a series of shift-and-add Table 2 ) of just k entries.
Algorithm 1 [4] determines the unique set of dlgð2 i þ 1Þ's whose sum modulo 2 kÀ2 is equal to e. It thus allows efficient conversion from DLS to binary. In the algorithmic description that follows, index notation is used for the corresponding bit of the standard binary representation. As an example, if a value x is formed as the bit string x nÀ1 x nÀ2 . . . x 2 x 1 x 0 , the notation x i refers to the bit with subscript i. Note that lines L1-L3 correspond to initialization. The product z is set to either 1 or 3 (corresponding to e 0 ¼ 1 or e 0 ¼ 0). The working variable exponent h is always set in such a way that z corresponds, for each iteration, to 3 raised to the exponent ðe À hÞ and the least significant i bits of h are all 0s. The algorithmic step of lines L4-L8 represents updating h by subtracting dlgð2 iþ2 þ 1Þ, which simply represents the exponent of 3 that reduces to 2 iþ2 þ 1. This is followed by updating the product z to reflect the changes in exponent, z :¼ jz Â ð2 iþ2 þ 1Þj 2 k . Eventually, after ðk À 2Þ steps, h becomes 0 and the "product" z corresponds to j3 eÀ0 j 2 k ¼ j3 e j 2 k . The values dlgð2 iþ2 þ 1Þ can be stored in a lookup table and this method is practical for large k ¼ 64; 128; . . . , since the table has only k entries.
Algorithm 1 DLS-to-Binary Deconversion Algorithm (EXP)
StimulusL4: for i :¼ 1 to k À 3 do /* L4: Loop over bits of h, index 1 to k À 3 */ L5: if h i ¼ 1, then /* L5: Conditional on ith bit of h */ L6: z :¼ jz Â ð2 iþ2 þ 1Þj 2 k /* L6:
Additive-Based Discrete Logarithm Modulo 2 k
Computing the discrete logarithm for certain k-bit odd integers x can be accomplished using a method [3] that is essentially the dual of the exponentiation method of Section 3.1. The key idea is to express x, if possible, as a product of two-ones residues:
Once this is done, the discrete logarithm can be computed as the corresponding sum:
. The solution involves identifying the cases when x can be expressed as such a product and finding the corresponding unique set of two-ones residues. It is shown in [3] that x can be expressed as a two-ones residue product as long as x is congruent with 1 or 3 modulo 8. Note that for the remaining odd residues, corresponding to x congruent with 5 or 7 modulo 8, their additive inverses j À xj 2 k are congruent with 1 or 3 modulo 8. The method in [3] identifies the set of two-ones residues, and thus, it is the core of a digit serial conversion method from binary to DLS. i þ 1Þ, while the exponent e is updated by adding the corresponding values dlgð2 i þ 1Þ, looked up from a table. The final result is returned in line L10 as the sign s and the exponent e pair. The updating of e and p in lines L6 and L7 can be performed concurrently. As can be seen by inspection of Algorithm 2, the time complexity is essentially k dependent shift-and-add modulo 2 k operations.
Algorithm 2 Binary to DLS Conversion Algorithm (DLG)
Stimulus: k, x ¼ x kÀ1 x kÀ2 . . . x 2 x 1 x 0 with x 0 ¼ 1 (odd values) Response: discrete log of x, expressed as an ðs; eÞ pair, where x ¼ jðÀ1Þ s 3 e j 2 k . Method: L1: if b 2 ¼ 1, then x :¼ 2 k À x; /* L1: Get 2's complement of x if b 2 ¼ 1 */ L2: end if L3: t :¼ 1; e :¼ 0; s :¼ b 2 /* L3: Initialize arguments t, e, and s */ L4: for i :¼ 1, 3 to k À 1 do /* L4: Loop over bit indices 1 to k À 1 skipping i ¼ 2 *
Unified Conversion/Deconversion Algorithm
Similarities between the additive versions of the deconversion (Algorithm 1) and conversion (Algorithm 2) algorithms presented in Sections 3.1 and 3.2 are described here. Since there are minimal differences between the two algorithms, they are very suitable for hardware reuse and a unified algorithm is developed. While conceptually one operation is the inverse of the other, they can be executed on the same data path. An intuitive explanation as to why this is feasible is presented here.
Algorithm 2 computes the discrete log of x. In order to do this, e, the discrete log of x, is updated one digit at a time. Concurrently, t is updated to eventually become jx Â x À1 j ¼ 1. The way t and e are updated is strongly related in the sense that t is multiplied with ð2 i þ 1Þ while e is adjusted by dlgð2 i þ 1Þ, its discrete log. Eventually, e represents the discrete log of AEx.
Algorithm 1 starts with product z ¼ 1 and updates it by multiplying with selected two-ones residues, ð2 i þ 1Þ's. Concurrently, s, the exponent, is correspondingly adjusted by subtracting dlgð2 i þ 1Þ. This way, the multiplications by
i þ 1ÞÞ before and inside the core loop of Algorithm 1. Due to this switching, the algorithm would require some changes in the initialization part, but it would still produce s ¼ dlgðzÞ at the end.
As emphasized above, the two algorithms can have a common core-the iteration loop, and the only differences are the initialization steps and the result returned by the algorithm. Last, but not the least, the Result: ðs; eÞ /* L19: Return discrete log result */ L20: end if As mentioned previously and also shown in [7] , the DLS together with conversion/deconversion algorithms provide an efficient method for the integer powering modulo 2 k operation. In Section 4, we present a DLS-based method for the integer powering operation that uses the conversion/ deconversion algorithms above and an analysis of its hardware implementation.
INTEGER POWERING USING INTERMEDIATE DLS REPRESENTATION
Algorithms for computing the operation z ¼ x y , where y is a positive integer, have been the subject of considerable research. The binary square-and-multiply method determines x; x 2 ; x 4 ; x 8 ; . . . and processes the bits of y from right-to-left to multiply by the appropriate binary powers of x to determine x y . This algorithm has been described in many popular texts [10] , [11] , [12] . Knuth [12] traces this "fast" algorithm back to al-Kashi in the 15th century.
We are interested in the particular case of z ¼ jx y j 2 k , where x, y, and the result z are all nonnegative k-bit integers. For typical word sizes such as k ¼ 8; 16; 32; 64; 128, and so forth, this integer valued powering operation is proposed to supplement the integer addition and multiplication operations. The squaring algorithm may be implemented in hardware with microcode and a fast multiplier much like the floating-point transcendental operations in the Pentium and Athlon processors.
For implementation in hardware, there is a need for a simpler algorithm that avoids the use of a large multiplier. There is a further need for a right-to-left digit serial algorithm that requires less time for lower precision operations when a family of precision levels is implemented in hardware.
In this section, we present a novel digit serial algorithm for evaluation of the integer power operation jx y j 2 k that is based on Algorithm 3; hence, it does not require a multiplier. The algorithm employs both conversion and deconversion of x to and from DLS, and bit serial multiplication. The conversion of the input x to DLS is implemented with bit serial multiplication and the discrete log (converted) value provides the "recoded multiplier bits" with the exponent y being the multiplicand and bit serial deconversion of jx y j 2 k provides the result z. In the following, we first describe the existing "Fast" binary squaring algorithm [10] , [11] , [12] .
Existing "Fast" Binary Squaring Algorithm
The existing fast algorithm is based on the fact that y ¼ P kÀ1 i¼0 y i 2 i , so that we can get the formula
As an example, 3 10 ¼ 3 In the above formula, ðÀ1Þ sy determines the sign. 2 py determines the number ðpyÞ of least significant zeros. For odd numbers, we need to calculate e Â y for term 3 ey . Then, we can convert jðÀ1Þ sy 3 ey j 2 k back to binary to obtain z. Computing the yth power of operand x can be done in a serial fashion. That is, we start multiplication and decoding after we obtain the entire value of e. A better technique is a pipelined arrangement of the suboperations in which multiplication and decoding starts when the first bit of e is available. For every available bit of e, a bit of the intermediate product is generated followed by a bit of z being produced. This method is referred to as the pipelined algorithm and is described in the following algorithm. 
Algorithm 5 Additive Digit Serial Powering ðx
The initialization stage is performed in lines L1-L4. All the required initializations for both stages of the algorithm are performed here. The second stage (L5-L7) actually performs the computation for the next to the last LSB with index i ¼ 1. The third stage contains the main iteration step and is represented by lines L8-L20. The third stage can be separated into three substages. Both t and the exponent e are updated (i.e., L9-L11), which generates 1 bit of exponent according to the DLG algorithm. The second substage (i.e., L12-L16) corresponds to the accumulator used to implement e Â y. The third stage updates z according to EXP algorithm (i.e., L17-L19). The final result is obtained at line L20. As can be seen by inspection of the algorithm, the time complexity is essentially k dependent shift-and-add modulo 2 k operations.
Hardware Performance Evaluation for Integer Power Operation
In order to evaluate the effectiveness of our method as compared to the well-known "fast" squaring method, we described each method in Verilog RTL. To implement the "fast" squaring method described in Algorithm 4, a counter is employed to control the number of loops, and the values z and q are updated simultaneously. There are three major components in the implementation of the circuit described in Algorithm 5: a controller, a ROM lookup table, and a computation data path. The controller consists of a counter and state control block, Finite State Machine (FSM). The FSM will start and stop the counting procedure. The output of the counter, count, is used for purposes such as address generation for the ROM, index production for the bit checker and loop controller, and feedback to the FSM for state transition. The ROM is used as a lookup table for the dlgðÞ function. The major components in the data path are adders, shifters, and muxes. The muxes are used to control whether registers holding p, e, z, and q will be updated by the shifter and adder circuits. The modulo operation used in the description of the algorithms is handled by limiting the register size of p, e, z, and q. The width of the register containing p, e, z, and q is set to k. Thus, while updating p, e, z, and q, the result values may be longer than the specified size (causing overflow). This intentional overflow actually implements the modulo 2 k operation. The two designs corresponding to Algorithms 4 and 5 are implemented and both are synthesized using the Synopsys toolset based on a standard cell library from Synopsys [13] and a standard cell library from Oklahoma State University [14] . Since the results from the two standard cell libraries were similar, we only list the result based on the standard cell library from Synopsys. Table 3 compares the results of our Algorithm 5 with the existing fast algorithm (Algorithm 4) for different k values. The latency of both designs are k since they are all bit serial based. We also plot the trend of the two algorithms in Figs. 2 (period) and 3 (area) . It is seen that for all k values, our algorithm is faster than the existing fast algorithm when each algorithm is synthesized with the standard cell library. Regarding area, our method requires more space for small word sizes but increases slowly compared with the existing fast algorithm. Thus, when k ! 64, our algorithm requires less area. It should be noted that the area values reported here are only the net area required by the total cell area since we did not route the resulting circuits, thus additional area and delay required by routing are not included.
As it is often the case, for relatively small values of the word size, a specific function can be looked up in a table as opposed to being computed using a functional hardware implementation. This would allow for speeding up computations at the expense of area. In Section 5, we present the complementary part of the functional algorithms in Section 3 by introducing DLS-related lookup structures and efficient compression methods.
LOOKUP STRUCTURE-BASED METHOD
Integer functions that are determined modulo 2 k for k-bit word results generally have properties allowing much smaller tables for exhaustive storage of k-bits-in, k-bits-out function evaluation than corresponding real valued k-bit functions. Four properties of integer functions are identified, which, when used in combination, allow for significant reduction in size of lookup tables for integer functions adhering to these properties (for 16-bit arguments, the advantages for integer function lookup can be as large as 32-to-1 or even 64-to-1, allowing 5 or 6 more index bits for comparable table size):
1. Inheritance principle. Briefly, this principle states that the low-order k bits of the result depend only on the low-order k bits of the integer argument for all k. In practice, the inheritance principle for integer functions means that a k-bits-in, k-bits-out lookup table can be reduced from a generic k Â 2 k bits ROM table to a lookup tree structure of size 2 Â 2 k bits. This reduces table size by a factor of k=2 (e.g., reduction to 1/8 the size for 16-bit integers). 2. One-to-one correspondence. This property holds when distinct k-bit inputs have distinct k-bit outputs. This property holds for multiplicative inverse and the discrete log of odd integers and is extended to a discrete log encoding of all k-bit integers as illustrated in Section 2. With the inheritance principle, this property allows pre and postprocessing logic to reduce the table size by another half.
Normalization (separating odd and even factors).
Employing a right-normalized binary integer representation, x ¼ 2 p q (where q is the odd factor and 2 p is the even-power factor), integer functions can often be determined in a separable fashion by applying table lookup to the argument's odd factor followed by function-specific postprocessing responsive to the even-power factor. 4. Conditional complementation. This property states that the result of the operation on the conditional 2's complement of the input is the conditional 2's complement of the output. Conditional complementation often applies only to selected bits of the odd factor of the normalized integer argument. When applicable, this allows one half or more further table size reduction. To fully benefit from the implicit compression provided by exploitation of these properties, new table lookup architectures are developed. The functionality of these architectures is easily described through the concept of "lookup trees."
Lookup trees were introduced with regard to the multiplicative inverse function for odd integers modulo 2 k in [5] . Properties 1, 2, and 4 described above were shown to result in substantial table size reduction, but a method and an architecture for efficient lookup were left open. The integer square function satisfies the inheritance principle, with argument normalization and appropriate conditional complementation further reducing the size of the lookup tree. In Section 2, we showed a preferred encoding allowing the discrete logarithm to satisfy and benefit from all four preceding integer function properties.
Table Lookup Architecture
Table lookup allows for direct conversions between binary and DLS encodings resulting in fast performance. Fig. 4 shows the generic table lookup architecture similar to that described in [15] ; however, the originality of our approach is the structure of the pre-and postprocessing logic that results from the concept of the lookup tree described in [5] . In this section, we focus in detail on the example of binary-to-DLS conversion, although the methods pertain similarly to DLSto-binary conversion. The hardware comprises three major components: the preprocess block, the postprocess block, and a ROM. The preprocess block produces the ROM address based on the input operand. After the data in the ROM is read, the postprocess block will select the correct bit fields and perform some additional processing, such as selective complementation. Two schemes for table lookup are compared here. One scheme uses a larger table supplemented by postprocess logic, while the other one uses a smaller table with both preprocess and postprocess logic.
Direct Lookup with Unnormalized Table Index
For the unnormalized index larger sized table implementation, we only exploit the hereditary and one-to-one mapping properties of binary-to-DLS conversion. Due to the one-to-one property, only left children of the lookup tree need to be stored. No preprocessing is required before table lookup occurs. For postprocessing, conditional complementation is required on the table output value with the input value since only left children values are stored in the table. The circuit structure and then the hardware implementation are discussed next.
The ROM structure and select logic are shown in Fig. 5 . The ROM is equivalent to three-level trees. The first level forms 256 rows, where the low 8 bits ð½a0 : a7Þ are used as address bits. In the second level, four subtrees between levels 8 and 9 are formed as 4 bytes. ½a8 : a9 are used to select one of 4 bytes. After the byte is determined, ½a10 and ½a10 : a11 are used to select 1 bit from the byte, respectively, while the other 2 bits are extracted directly without selection. Therefore, a total of 4 bits is extracted from the selected byte. In the third level, there are 32 subtrees between level 8 and level 12 formed as 32 7-bit fields. ½a8 : a12 are used to select one of the 32 7-bit fields. ½a13 and ½a13 : a14 are used to select 1 bit from the selected field, respectively, while the single rightmost bit is extracted directly without selection. Therefore, a total of 3 bits is extracted from the selected 7-bit field. Finally, a 15-bit output is produced from the select logic. The postprocessing logic for this unnormalized index table lookup scheme is simple. Since we only store left children, only 15 bits are extracted from the ROM. A one is padded to the LSB position to produce a 16-bit output. Also, 16 2-bit-input XOR gates serve as conditional complement logic, where the corresponding bit from the result of the padding and the input are connected to the inputs of the XOR gates.
Direct Lookup with Normalized Table Index
The ROM table size may be reduced by utilizing more properties of our discrete log encoding. For normalized binary-to-DLS conversion, the inheritance principle, one-toone mapping property, normalization to odd factor argument, and conditional complementation [5] are used.
Preprocessing consists of even-power and sign bit extraction. Normalization is used to produce the p field of the DLS triple. It is accomplished by shifting to the right and counting the number of trailing zeros. In the worst case, 16 shifts are required. A divide and conquer approach is adopted in our implementation. We first shift 8 bits to the right to check whether it is in the lower 8 bits or the higher 8 bits. Next, we shift 4 bits to the right of the selected 8-bit field from the previous step to check whether it is in the lower 4 bits or the higher 4 bits. This procedure continues until the binary exponent p of the operand is obtained. Another operation is sign extraction. The sign bit is the third bit of the normalized operand. If the sign bit is asserted, it is required to conditionally complement the normalized operand. Since normalization (odd number, no need for a0) and sign symmetry (sign bit a2 is out), the index for address and select logic in the next step are formed as ½a 0 1a 0 3 : a 0 14 after conditional complementation. The ROM structure and select logic are shown in Fig. 6 . The ROM is equivalent to three-level trees. The first level forms 128 rows, where the low 7 bits ð½a 0 1a 0 3 : a 0 8Þ are used as address bits. In the second level, subtrees between levels 7 and 8 are represented as a 6-bit field. ½a 0 9 and ½a 0 9 : a 0 10 are used to select 1 bit from the selected field, respectively. Therefore, a total of 2 bits are extracted from the 6-bit field. In the third level, 16 subtrees between level 7 and level 10 are formed as 16 bytes. ½a 0 9 : a 0 12 are used to select one of 17 bytes. ½a 0 13 and ½a 0 13 : a 0 14 are used to select 2 bits from the selected byte, respectively, while the other 2 bits are extracted directly without selection. Therefore, a total of 4 bits is extracted from the selected 7-bit byte. Finally, a 13-bit output is formed from the select logic.
Postprocessing for the normalized index smaller table lookup scheme is more complex as compared to the larger table approach. Since normalization is performed in the preprocessing circuitry, denormalization is necessary. All bits whose index is less than the power of the original operand are padded with zeros, while all bits whose index is larger than this power are filled with lookup values. Sixteen 2-bit-input XOR gates are used for conditional complementation as described previously.
Performance Evaluation
We described the circuits shown in Figs. 5 and 6 in a Verilog module using the toolset (Design Compiler and Physical Compiler) based on a standard cell library obtained from the Synopsys tutorial files [13] . Table 4 shows the comparison between the two schemes for direct lookup table conversion for k ¼ 16. The ROM size is given in kilobytes. The core area is the area of standard cell implementation for all other logic except ROM. Both circuits have the same minimal clock period of 1.7 ns, but the larger table implementation requires one less cycle for postprocessing. Due to the extra processing before and after accessing the ROM, the normalized version of the circuit requires three clock periods of latency versus the two required for the unnormalized version; however, the ROM size is only 27 percent as large.
Extension of Lookup Structure
In this section, we have investigated standard cell implementations of a novel table lookup procedure for binary-todiscrete log conversion. This method is equally applicable to realizing any integer function satisfying the inheritance principle that can then be described with a "treelike" lookup table structure.
The distinction between real and integer arithmetic in an ALU is conveniently described with reference to the multiplication of two k-bit integer operands. The exact product fits in a 2k-bit field. The real (e.g., floating point) result typically provides a normalized high-order (approximate) part with the low part rounded off. The integer result is the k-bit low-order part providing an exact result in a modular system with the modulus determined by the word size implicitly truncating the high-order part.
Integer functions that are determined modulo 2 k for k-bit word results generally have properties allowing much smaller tables for exhaustive storage of k-bits-in, k-bits-out function evaluation than corresponding real valued k-bit functions. The four properties of integer functions identified in this section used in combination can fundamentally redefine and reduce the size of lookup tables for exhaustive storage. For 16-bit arguments, the advantages for integer function lookup can be as large as 32-to-1 or even 64-to-1, allowing 5 or 6 more index bits for comparable table size.
Our investigation indicates that this table lookup procedure is practical and allows for significant reductions in table size.
CONCLUSION
We have presented an alternative representation for the k-bit integers based on a discrete logarithm representation and introduced a novel DLS encoding with a one-toone mapping between binary-encoded and DLS-encoded k-bit strings. The mapping is shown to be implementable using a new unified conversion/deconversion algorithm that employs just OðkÞ additions with references to a conversion lookup table having just k entries, allowing scalable implementations for k up to 128 bits or more. To illustrate the use of DLS representation, we provided a novel pipelined bit serial integer power operation for 
