We 
Introduction
This paper investigates new table lookup architectures to extend the range of options for table assisted computation in optimizing an ALU design. The focus is on integer function evaluation where recent results have identified needs for new lookup architectures to exploit the potential savings available.
The distinction between real and integer arithmetic in an ALU is conveniently described with reference to the multiplication of two k-bit integer operands. The exact product fits in a 2k-bit field. The real (e.g. floating point) result typically provides a normalized high order (approximate) part with the low part rounded off. The integer result is the k-bit low order part providing an exact result in a modular system with the modulus determined by the word size implicitly truncating the high order part.
Integer functions determined modulo 2 k for k-bit word results generally have properties allowing much smaller tables for exhaustive storage of k-bits-in kbits-out function evaluation than corresponding real valued k-bit functions. The following four properties of integer functions have been recently identified in combination to fundamentally redefine and reduce the size of lookup tables for exhaustive storage. For 16-bit arguments, the advantages for integer function lookup can be as large as 32 to 1 or even 64 to 1, allowing 5 or 6 more index bits for comparable table size.
(1) Inheritance principle: Briefly this principle states that the low order k-bits of the result depend only on the low order k-bits of the integer argument for all k. In practice the inheritance principle for integer functions means that a kbits-in, k-bits-out lookup Lookup trees were introduced with regards to the multiplicative inverse function for odd integers modulo 2 k in [6] . Preceding properties (1), (2), and (4) were shown to result in substantial table size reduction, but a method and architecture for efficient lookup was left open. The integer square function satisfies the inheritance principle, with argument normalization and appropriate conditional complementation further reducing the size of the lookup tree. In Section 2, we summarize needed background on the integer discrete-log binary representation and a preferred encoding allowing the discrete logarithm to satisfy and benefit from all four preceding integer function properties.
Our focus shifts in Section 3 to the main issue of presenting an efficient architecture for implementing lookup trees for integer functions. Employing the discrete log function for illustration, our principal result is given by showing how a rectangular row-bycolumn bit array similar to a ROM can be designed to store a lookup tree, with the details realized in the novel selection architecture for extracting and concatenating the bits of the result into the output register. Unnormalized and normalized argument versions employing various amounts of pre and post processing logic to effect table size reduction are described. Section 4 describes comparative results of standard cell implementations of the two versions and information about the cell library employed, and Section 5 provides a brief conclusion.
Normalization and Discrete Log Encoding
Any positive integer has a unique factorization into odd and even-power factors, n = i×2 p , which provides a right-normalized format for binary integer representation. For integers in the "k-bit" range [ [7] . In this paper we present two direct table lookup conversion solutions applicable for precisions up to 16 bits. Our solutions include resolution of the encoding question in a manner facilitating reducing the size of the table lookup structure with a novel lookup architecture.
The encoding employs variable length fields for the encodings of and and provides a one-to-one "hereditary" mapping between k-bit discrete-log numbers and k-bit unsigned binary integers. Details and proofs are beyond the scope of this paper. We utilize example tables and figures to illustrate the encoding and some of its significant properties. The one-to-one mapping between 5-bit discretelog numbers comprising a 5-bit DLS and the 5-bit integers is given in Table 1 with associated lookup trees illustrated in Figures 1A and 1B . Figure 1A for binary-to-DLS conversion occurs by reading down with edge direction determined by the 5-bit odd integer string read right-to-left. The DLS output string is obtained (right-to-left) from the bits extracted from the vertices on the downward path. For the integer 13, employing the binary value as input in Figure 1A The one-to-one and conditional complementation properties hold for these conversions and are evident as symmetries in the lookup trees. With elementary pre-and postprocessing logic, these properties can be employed to reduce lookup tree size for 16-bit DLS-to-binary and binary-to-DLS conversions to 2 Kbytes each. Figure 2 shows the generic table lookup architecture. In this section, we focus in detail on the example of binary-to-DLS conversion, although the methods pertain similarly to DLS-tobinary conversion, the modular multiplicative inverse, and the square function. The hardware comprises three major components: the pre-process block, the post-process block and a ROM. The preprocess block produces the ROM address based on the input operand. After the data in the ROM is read, the post-process block will select the correct bit fields and perform some additional processing, such as selective complementation. Two schemes for table lookup are compared here. One scheme uses a larger table supplemented by post-process logic, while the other one uses a smaller table with both pre-process and post-process logic.
Hardware Implementation

Figure 2. Table lookup architecture
Direct lookup: Unnormalized table index
For the unnormalized index larger sized table implementation, we only exploit the hereditary and one-to-one mapping properties of binary-to-DLS conversion. Due to the one-to-one property, only left children of the lookup tree need be stored. No preprocessing is required before table lookup occurs. For post-processing, conditional complementation is required on the table output value with the input value since only left children values are stored in the table. The circuit structure and then the hardware implementation are discussed next.
The ROM structure and select logic are shown in Figure 3 . The ROM is equivalent to 3-level trees. The first level forms 256 rows where the low 8-bits ([a0:a7]) are used as address bits. In the second level, four sub-trees between levels 8 and 9 are formed as four bytes. [a8:a9] are used to select one of four bytes. After the byte is determined, [a10] and [a10:a11] are each used to select one bit from the byte respectively, while the other two bits are extracted directly without selection. Therefore, a total of 4 bits are extracted from the selected byte. In the third level, there are 32 sub-trees between level 8 and level 12 formed as 32 7-bit fields. [a8:a12] are used to select one of the 32 7-bit fields.
[a13] and [a13:a14] are each used to select one bit from the selected field respectively, while the single rightmost bit is extracted directly without selection. Therefore, a total of 3 bits are extracted from the selected 7-bit field. Finally, a 15-bit output is produced from the select logic. The post-processing logic for this unnormalized index table lookup scheme is simple. Since we only store left children, only 15 bits are extracted from the ROM. A one is padded to the Least Significant Bit (LSB) position to produce a 16-bit output. Also 16 2-bit-input XOR gates serve as conditional complement logic where the corresponding bit from the result of the padding and the input are connected to the inputs of the XOR gates.
Direct lookup: Normalized table index
The ROM table size may be reduced by utilizing more properties of our discrete-log encoding. For normalized binary-to-DLS conversion, the inheritance principle, one-to-one mapping property, normalization to odd factor argument, and conditional complementation [6] are used.
Pre-processing consists of even-power and sign bit extraction. Normalization is used to produce the field of the DLS triple. It is accomplished by shifting right and counting the number of trailing zeros. In the worst case, 16 shifts are required. A divide and conquer approach is adopted in our implementation. We first shift right 8 bits to check whether in the lower 8 bits or the higher 8 bits. Next, we shift right 4 bits of the selected 8-bit field from the previous step to check whether in the lower 4 bits or the higher 4 bits. This procedure continues until the binary exponent p of the operand is obtained. Another operation is sign extraction. The sign bit is the third bit of the normalized operand. If the sign bit is asserted, it is required to conditionally complement the normalized operand. Since normalization (odd number, no need for ) and sign-symmetry (sign bit is out), the index for address and select logic in the next step are formed as [
The ROM structure and select logic are shown in Figure 4 . The ROM is equivalent to 3-level trees. The first level forms 128 rows where the low 7-bits ([a'1a'3:a'8]) are used as address bits. In the second level, sub-trees between level 7 and 8 are represented as a 6-bit field. [ 9] a and [ 9 are used to select one bit from the selected field respectively. Therefore, a total of 2 bits are extracted from the 6-bit field. In the third level, 16 sub-trees between level 7 and level 10 are formed as 16 bytes. [ 9 are used to select one of 16 bytes.
: 10] a a
: 12] a a
[ 13] a and are used to select two bits from the selected byte respectively, while the other two bits are extracted directly without selection. Therefore, a total of 4 bits are extracted from the selected 7-bits byte. Finally, a 13-bit output is formed from the select logic.
[ 13: 14] a a Post-processing for the normalized index smaller table lookup scheme is more complex as compared to the larger table approach. Since normalization is performed in the pre-processing circuitry, denormalization is necessary. All bits whose index is less than the power of the original operand are padded with zeros, while all bits whose index is larger than this power are filled with lookup values. 16 2-bit-input XOR gates are used for conditional complementation as described previously. 
Experimental Results
We described the circuits shown in Figures 3 and 4 in a Verilog module using the tool set (Design Compiler and Physical Compiler) based on a standard cell library obtained from the Synopsys tutorial files [5] . Table 2 shows the comparison between the two schemes for direct lookup table conversion for k=16. The ROM size is given in KB. The core area is the area of standard cell implementation for all other logic except ROM. Both circuits have the same minimal clock period of 1.7ns but the larger table implementation requires one less cycle for postprocessing. Due to the extra processing before and after accessing the ROM, the normalized version of the circuit requires 3 clock periods of latency versus the 2 required for the unnormalized version; however, the ROM size is only 27% as large. 
Conclusion
In this paper we have investigated standard cell implementations of a new table lookup procedure for binary-to-discrete log conversion. This method is equally applicable to realizing any integer function satisfying the inheritance principal that can be described with a "tree-like" lookup table structure. Our investigation indicates that this table lookup procedure is practical and allows for significant reductions in table size.
