Abstract: This paper presents an approach to area optimization of arithmetic datapaths that perform polynomial computations over bit-vectors with finite widths. Examples of such designs abound in DSP for audio, video and multimedia computations where the input and output bit-vector sizes are dictated by the desired precision. A bit-vector of size m represents integer values reduced modulo 2 m (%2 m ). Therefore, finite wordlength bit-vector arithmetic can be modeled as algebra over finite integer rings, where the bit-vector size dictates the ring cardinality. This paper demonstrates how the number-theoretic properties of finite integer rings can be exploited for optimization of bit-vector arithmetic. Along with an analytical model to estimate the implementation cost at RTL, two algorithms are presented to optimize bit-vector arithmetic. Experimental results, conducted within practical CAD settings, demonstrate significant area savings due to our approach.
I. Introduction RTL descriptions of integer datapaths that implement polynomial arithmetic are found in many practical applications, such as in digital signal processing (DSP) for audio, video and multimedia applications [1] [2] . Such designs perform a sequence of add, mult, shift type of algebraic computations over bit-vectors; hence they are generally modeled at RTL or behavioural-level as multi-variate polynomials of finite degree [2] [3]. Initial algorithmic specifications (such as a matlab model) of such systems involve data representation using floating-point formats. However, they are often implemented with fixed-point architectures in order to optimize the area, delay and power related costs of the implementation [4] . Subsequently, the fixedpoint model can be translated into an RTL description [5] that can be subsequently synthesized into a circuit.
Algebraic techniques and tools have been used for synthesis and optimization of such systems. However, for their efficient and correct modeling, it is important to account for the effect of bit-vector size of the operands on the resulting computation. In other words, a bit-vector of size m represents integer values from 0 to 2 m -1 (or integers reduced modulo 2 m ). This implies that finite word-length (m) bit-vector arithmetic manifests itself as algebra over finite integer rings (Z2m ). Properties of such finite rings should therefore be exploited for RTL optimization of bitvector arithmetic.
This paper models finite word-length bit-vector arithmetic as polynomial functions (or polyfunctions) over f : Z 2 n 1 × Z 2 n 2 × · · · × Z 2 n d → Z2m . Here, (n1, n2, · · · n d ) are the sizes of the input bit-vectors (x1, x2, . . . , x d ). And, m is the size of the output bit-vector f . In other words, the computation is modeled as a multi-variate polynomial f (x1, . . . , x d )%2 m , where each xi ∈ Z 2 n i and f is computed %2 m . Properties of such polyfunctions have been analyzed in [6] . Over such finite rings, polynomials with different degrees and coefficients (i.e. computations with different costs) can become computationally (bit-true) equivalent. Using a cost model, along with * This work has been supported in part by the following grants: 1) CAREER award, NSF CCF-546859; and 2) NSF CCF-515010; and Georgia State University (GSU) Research Initiation Grant.
number-theoretic properties of such rings, the paper presents two algorithms for optimization of finite word-length bit-vector arithmetic.
Motivation: Consider a computation with three inputs: II. Related Work Lately, there has been increasing interest in exploring the use of algebraic manipulation for RTL synthesis of arithmetic datapaths. The works of [7] [8] derive new polynomial models of complex computational blocks for efficient synthesis. In [2] , Symbolic Computer Algebra tools are used to search for a decomposition of a given polynomial according to available library elements using a Groebner's bases based approach. However, the derived polynomial models represent the computations over the fields of reals (R), fractions (Q) or over the integral domain (Z) -collectively called the unique factorization domains (UFDs). This often results in a polynomial approximation [3] , without properly accounting for the effect of bit-vector size on the resulting computation. While the work of [9] does account for the datapath-size for allocation, it operates directly on the original (given) arithmetic expression -thus limiting the degree of freedom in searching for a better implementation.
Finite rings of the type Z2m are non-UFDs, due to the presence of nilpotent elements. (An element x of a ring is nilpotent if x n = 0 for some positive integer n.) Unfortunetly, this disallows the use of fundamental computer algebra results on Euclidean division and factorization over non-UFDs. As a result, contemporary (algebra-based) high-level synthesis frameworks are limited in their capability to employ sophisticated algebraic manipulations to reduce the cost of the implementation [2] .
Other algebraic transforms have also been explored for efficient hardware synthesis: factorization and common subexpression elimination [10] [11] , exploiting the structure of arithmetic circuits [12] , term re-writing [13] , etc. However, these techniques also overlook the effect of bit-vector size on the given computation.
Note that our approach does not preclude some of the above mentioned synthesis procedures [11] [10] [9] ; it can be combined 1-4244-0630-7/07/$20.00 ©2007 IEEE.
5C-1
with these approaches as an additional optimization step. Modulo arithmetic has been applied to the task of circuit/RTL verification [14] . The concept of polynomial functions over finite rings has also been applied to the equivalence verification of arithmetic datapaths in [15] [16] . This paper demonstrates its application to optimization of arithmetic datapaths.
The following two sections give the theoretical foundation required to derive algorithmic solutions to the problem addressed in this paper. The proofs of some concepts (lemmas and theorems) are provided in [6] ; and hence not reproduced here.
III. Preliminary Concepts
Z corresponds to the set of integers, Z + to the set of non-negative integers and Zn to the finite set of integers {0, 1, . . . , n − 1}. [6] defines the corresponding polyfunction as follows:
It is possible for a polynomial with non-zero coefficients to vanish on such mappings; in which case the polynomial represents a nil polyfunction or a vanishing polynomial (Refer to column 2 in table I).
Henceforth, polynomial addition and multiplication are performed %n (n = 2 m ). Also, we use the multi-index notation: k =< k1, k2, . . . , k d > are the (non-negative) degrees corresponding to the d input variables x =< x1, x2, . . . , x d >, respectively.
IV. Identifying Vanishing Polynomials
We analyze the univariate polynomials that vanish on Z2m [x] (for didactic purposes) and then extend the results to vanishing polynomials from
Number Theory Perspective: According to a fundamental result in number theory, for any n ∈ N , n! divides the product of n consecutive numbers. For example, 4! divides any 4 consecutive numbers: 99 × 100 × 101 × 102. Consequently, it is possible to find the least k ∈ N such that n divides k! (denoted n|k!). We denote this value k as k = SF (n)
1 . In the ring Z2m , let SF (2 m ) = k, such that 2 m |k!. For example, SF (2 3 ) = 4 as 8 divides 4! but 8 does not divide 3!; hence, least k = 4.
This property can be utilized to interpret the concept of vanishing polynomial as a divisibility issue in
, let 8|f (x). But, 8|4! too, as SF (8) = 4. Therefore, if for all x, f (x) can be represented as a product of 4 consecutive numbers, then f (x) vanishes in Z 2 3 . So, how can we represent a polynomial as a product of 4 consecutive numbers? The answer is: x(x − 1)(x − 2)(x − 3). Such a product expression is referred to as a falling factorial and is formally defined below.
Definition IV.1: Falling factorials of degree k ∈ Z are defined according to: table II shows the example of a univariate vanishing polynomial.)
The above concept of falling factorials can be similarly defined for multi-variate expressions over Z2m [x1, . . . , x d ]:
Extending the above concept, if a multivariate polynomial in Z2m [x1, . . . , x d ] can be factorized into a product of SF (2 m ) consecutive numbers in at least one of the variables xi, then it vanishes %2 m . Column 2 in table II illustrates this idea where both the input variables x1, x2, as well as the output F are in Z 2 2 . We wish to extend the above concepts to analyze polynomials over Z 2 n 1 × Z 2 n 2 × . . . × Z 2 n d to Z2m . For this purpose, we define another quantity [6] :
We consider the following results from [6] :
Column 3 in table II gives an example illustrating the application of this lemma.
When a polynomial cannot be factored into such Y k expressions, can it still vanish? Consider the quadratic polynomial 4x
. It can be written as 4(x)(x − 1). While 4x 2 − 4x cannot be factorized as (x)(x − 1)(x − 2)(x − 3), it still vanishes in Z8. The missing factors, (x − 2)(x − 3) in this case, are compensated for by the multiplicative constant 4; therefore, 4x 2 − 4x ≡ 0%8. We now need to identify the constraints on such multiplicative constants such that the given polynomial would vanish. We state the following result [6] :
|g k ; where,
. In column 4 of table II, we show an example where this lemma can be applied.
The above results can be extended to derive a unique canonical representation for a polynomial function from Z 2 n 1 ×Z 2 n 2 × . . . Z 2 n d to Z2m . We state the following theorem [6] :
Theorem 1: Let F be a polynomial representation for the function f from Z 2 n 1 × Z 2 n 2 × . . . Z 2 n d to Z2m . Then, F can be uniquely represented as:
where, Y k is the falling factorial defined in Eqn. 3;
.
Proof:
The proof is provided in [6] . Briefly reviewing it, any polynomial F from Z 2 n 1 × Z 2 n 2 × . . . Z 2 n d to Z2m can be decomposed in the form
where, Qµ ∈ Z[x1, . . . , x d ] is an arbitrary polynomial; Y k is the falling factorial defined in Eqn. 3;
and c k ∈ Z is an arbitrary integer, such that 0 ≤ c k < b k . It can be clearly seen from eqn. 6, that the first term (QµYµ) is a vanishing polynomial from Lemma IV.1, and the second term (Σ k a k b k Y k ) is a vanishing polynomial from Lemma IV.2. The third term (Σ k c k Y k ) cannot be reduced any further, since Example II Let f : Z 2 1 × Z 2 2 → Z 2 3 be a polyfunction in two variables (x1, x2), defined as:
5C-1
represented by the polynomial F = Then, f is a polyfunction representable by F = 1 + 2x2 + x1x2 2 , 4x1x 2 2 + 4x1x2. While F has non-zero since f (x1, x2) ≡ F (x1, x2)%2 3 for x1 = 0, 1 and x2 = 0, 1, 2, 3. coefficients, F %8 ≡ 0, ∀x1 ∈ Z2, x2 ∈ Z4. F (x) can be factored degrees of x1 and x2 are µ1(2 1 ) = min{2 1 , 4} = 2 = k1 k = < k1, k2 >=< 1, 2 >. So as a product of 4 k1 = 4, and k2 = 1, respectively. satisfying Lemma IV.1 and
consecutive numbers: Note that F %4 can be equivalently µ2(2 2 ) = min{2 2 , 4} = 4 > k2 µ1(2) = min{2, 4} = 2, i.e. (Y4(x) ). Therefore written as F = Y<4,1>(x1, x2)%4 F can now be written as µ2(4) = min{4, 4} = 4.
Since F %4 can be represented as a
3 , product of 4 consecutive numbers ≡ 0 because
the coefficient c k < b k and hence b k cannot divide c k (for Lemma IV.2 to hold true). Hence, eqn. 6 can simply be written as
The following example illustrates the above concept.
. F can be written as follows:
Here, a<1,2> = 1, b<1,2> = 8/(8, 1! · 2!) = 4 and c<1,2> = 2. F can be written in the form given by Theorem 1, and is the unique canonical form representation of the polynomial.
V. Algorithms:Polynomial Reduction In this section, we present two algorithms that use the concepts described in the earlier section to optimize a polynomial function from . It has a complexity of O(n/log n) [19] . This value is then used to obtain the µi values. 3. Find the max. degree (ki) of each variable xi in poly. 4. Divide the polynomial by the falling factorial expressions Yµ in each of the d variables. 5. If the remainder is zero, it is a vanishing polynomial and the cost of the poly is zero. because F = QµYµ. Else, use the remainder as the new poly. 6. Update the degrees (ki), min cost and min poly and continue dividing from Yµ−1 (highest degree) to Y0 for each variable.
After each division, check for the following conditions:
• If the quotient can be written as a k · b k (where b k is defined according to Theorem 1), and the remainder is zero, return 0. It is a vanishing polynomial.
• If the quotient can be written as a k · b k , and the remainder is non-zero, use the remainder as the new poly.
• Check if the coefficient c k > b k . If so, perform the division with b k *Y k , and again use the remainder as the new poly.
• Update the min poly and continue with the next iteration. min poly gives the polynomial with the least cost implementation in this reduction procedure and min cost gives its corresponding cost. The number of multi-variate divisions is bound by O( d µi), where µi is as defined previously and d is the total number of variables.
B. Algorithm II
Consider a polynomial f = x 6 +8x 3 +8x, with bit vector-sizes of {x, f} being {3, 4}, respectively. According to the previous algorithm, the reduction starts with the highest degree monomial (highest degree = 6, in this case) and proceeds further. Using the first algorithm, the polynomial reduction results in the following set of polynomials. Initial Polynomial: f = x 6 +8x 3 +8x 1st Intermediate Polynomial: f = 11 * x 5 +x 4 +9 * x 3 +8 * x 2 +4 * x 2nd Intermediate Polynomial:
Final Reduced Polynomial: f = x 5 + x 4 + 3 * x 3 + 12 * x. Using the cost model, the initial polynomial is estimated to be the least cost polynomial (which requires only 7 multipliers, 2 constant multipliers and 2 adders). However, in this polynomial, the sub-expression 8x 3 +8x is a vanishing polynomial in Z 2 4 . Thus it can be seen that if we choose to reduce this sub-expression only, the initial polynomial f optimizes to x 6 . Initial Polynomial: f = x 6 +8x 3 +8x (Reduce only 8x 3 +8x and retain x 6 as is) Optimized Polynomial: f = x 6 . Now, the optimized polynomial requires only 5 multipliers. Thus, using an approach, where only sub-expressions of the polynomial are reduced, the optimization is further enhanced.
To lend an algorithmic procedure to such an approach, instead of iterating over all possible degrees (refer Alg. I), we iterate over all combinations of all possible degrees. In other words, consider the previous example where f = x 6 +8x 3 +8x. The combination of all possible degrees is given by the set
5C-1
Algorithm 1 OPT POLY: Optimize a given polynomial.
OPT POLY (F1, d, x, m, n {(x 6 +8x 3 +8x), (x 6 +8x 3 ), (8x 3 +8x), (x 6 +8x), (x 6 ), (8x 3 ), (8x)}. Each element of the set is considered as a sub-expression, and reduced 2 . It should be noted that Algorithm I is subsumed in Algorithm II, since one of the elements of the set is the entire polynomial itself. Since this is a more pervasive algorithm than the previous one, the complexity clearly increases. In this algorithm, the number of multi-variate divisions is bound by
, because in the worst-case it has to iterate through all the combinations of all degrees for every variable to determine the optimized polynomial. Using a classic branch and bound procedure, we can further optimize this search and determine the least cost polynomial. Due to lack of space, we do not provide a pseudocode for algorithm II. 
VI. Modeling Area Cost at Polynomial Level
In the two algorithms, at every reduction step we get an intermediate polynomial equivalent to the original one. We wish to estimate the cost (implementation area) of the original polynomial, all intermediate polynomials and also the final reduced form and select the least cost expression for implementation.
Polynomial computations correspond to additions, multiplications and constant multiplication operations (where one input to the multiplier is a constant). For instance, consider f = 5 * x 3 * y + 10 * x 2 * y 2 + 13 * x * y + 6 * y. f can be implemented with 3 adders, 7 multipliers and 4 constant multipliers. If we can determine the cost of the implementation area of these modules separately, their total cost would reflect the cost of implementing the polynomial f . Hence, we model the cost of adders, multipliers and constant multipliers (implemented with finite input and output bit-vector sizes) at polynomial level.
Adders: We estimate the area of an adder based on the implementation of a ripple-carry adder. If the input bit-vector sizes of the adder are n1 and n2, and the output bit-vector size is m: if Max(n1 + 1, n2 + 1) > m, then we require atleast m Full Adder modules, else if Max(n1 + 1, n2 + 1) < m, then we will require Max(n1 +1, n2 + 1) Full Adder modules Cost(Add) = n * Cost(F A) where Cost(F A) is the cost of a full adder and n is the number of Full Adder modules.
Multipliers: The estimated cost of an n1 × n2 to m-bit multiplier is modeled on an array multiplier implementation [20] . Consider the 4-bit array multiplier shown in Fig. 1 . It is composed of partial product generators and an array of full adder modules. Its area can be modeled as the sum of partial product cost and the array network cost. We are interested in the area occupied by the partial products and the array network responsible for generating only the lower m output bits. For instance, in Fig. 1 , if the value of m is 4, then the region of interest is to the right side of the dotted line. Therefore, the cost of the multiplier can be estimated as: Cost(mbit Mult) = Cost(P P (m)) + Cost(Arr(m)), where Cost(P P (m)) is the cost of partial products (implemented with AND gates) and Cost(Arr(m)) is the cost of the array network (implemented with FA modules). Using the structure of the array multiplier and the values of n1, n2 and m, we can determine the minimum number of partial products and Full adder modules required to implement an n1 × n2 to m-bit multiplier.
Constant Multipliers: When an input to a multiplier is a constant, then the constant can be propagated to simplify the circuit. To model this effect, we need to analyze its bit pattern and estimate a cost based on the simplification caused by propagating these bits. We model constant multiplication using the array multiplier model. An n1 ×n2 to m-bit constant multiplier is modeled as an m × m to m-bit multiplier (by either padding -n1, (or m − n2) ) are padded with zeroes till the m-th bit. If n1 (or n2) is greater than m, then only the lower order m-bits (from n1, and n2) are chosen for the implementation. In this manner, for an m × m to m-bit multiplier, only the lower order m-bits are analyzed for constant propagation.
Simplification using Constant Propagation: In figure 1 , consider X as the constant and A as the variable. To propagate the constant X, we analyze the bits from the least significant position (X[0]) to the most significant one (X[m − 1]). Here are some results that we have derived to estimate the area as a result of constant propagation.
1. While traversing X from its LSB to MSB, until we reach a bit position whose value is 1, the cost of the implementation is zero due to zero propagation:
Consider the bit-pattern of
is the least significant bit with value 1. The partial products generated using X[k], k < i will be 0. Therefore, up to the i-th level, 0s are fed into the full adder modules, which results in their complete elimination (simplification) upto (i-1) levels.
2. Until we reach the second bit position with value 1 in X while traversing from its LSB to MSB, the cost of the implementation is still zero:
Consider 3. On encountering the second bit position with value 1 in the traversal of X from its LSB to MSB, the full adder modules in that level can be optimized to half adder modules:
Consider the bit pattern used in the previous result. The partial products generated by X[i] and X[k] are added at the kth level. However, the carry-signals feeding the full adder modules in the k-th level are 0. Hence these can be optimized to half adder modules.
4. For the subsequent levels, if the value of X[i] at any level is 0, then the full adders in that level reduce to half adders:
Since the partial products generated due to X[i] s = 0 are also 0, the full adders being fed by these partial products are simplified to half adders.
Based on the bit pattern of the constants, the above models are employed to estimate the effect of constant propagation on the multiplier area.
Example: Consider the effect of 3 * A and 5 * A in a multiplier with output bit-vector size m=4. Figures 2 and 3 depict the optimization in the designs for the multiply operation with constants 3 and 5, respectively. Quantifying the cost We employ the unit model cost, where every logic gate can be implemented with a unit cost ,to quantitatively calculate the area of the polynomial.
VII. Experiments
The algorithms were implemented in Perl with calls to Maple [21] , along with the presented cost model, for optimizing the given polynomial. The polynomial representing the datapath and the operating bit-vector size (input/output -n1, n2, ....n d /m) were given as the inputs to the tool.
Stepby-step reductions of the given polynomial were performed using our algorithms until a minimal form was obtained. For the original, minimal and every intermediate polynomial generated, the implementation cost was estimated. The polynomial with the least estimated cost was selected for implementation.
We used the Synopsys Design Compiler to generate the required n1 × n2 to m-bit adders and multipliers. These units were used, subsequently, as functional units to implement the polynomials. To compare the area statistics, both the original polynomial and the reduced polynomial with least estimated cost were implemented using the Synopsys Module Compiler.
Experiments have been performed on a variety of DSP benchmarks and the results are presented in Table III . The first four examples are from [12] . Deg4, Janez and Cubic are polynomial filters used in image processing applications [1] . IRR is an image rejection receiver from [22] . Mibench is an automotive application from [23] . Antialias and PSK (phase shift keying) are from [2] , and IIR-4 is a 4th order IIR computation. Column 2 lists the design characteristics: number of variables, their highest degree and the bit-vector sizes of the inputs/output (n1, ..n d /m). Column 3 lists the estimated cost of the original polynomial. Column 4 and 5 list the cost of the optimized polynomial using algorithm I (Alg1) and algorithm II (Alg2), respectively. In columns 6 and 7, we list percentage improvement obtained in the estimated cost using Alg1 (Imp1) and Alg2 (Imp2), respectively. For the implementation cost, we report the results of Alg2. Column 8 and 10 list the actual implementation area of the original and the selected polynomial (synthesized), respectively. Column 9 and 11 depict the critical path delay of the original and the selected polynomial implementations respectively. These implementations have been realized using shifters, multipliers and adders. Column 12 depicts the improvement in the area of the implementation while column 13 depicts the improvement in the critical path delay in the implementation. If the improvement in the estimated cost is less than 1%, we choose the original polynomial for implementation.
For the first 9 benchmarks, we are able to find a reduced implementation. There is an average improvement of approximately 34% in actual implementation. For the remaining benchmarks, our cost estimate provided an improvement of less than 1% and hence, the original polynomial was chosen for implementation. Considering all the benchmarks, the average improvement in the actual implementation area is still approximately 23%.
Expression manipulations: There are many expression ma- nipulation techniques that have been used to optimize arithmetic datapaths such as factorization, tree-height reduction, horner implementation and common-subexpression elimination. While such techniques are commonly used in synthesis of arithmetic polynomials, our approach can be used as a pre-processing step, thus providing an additional scope for optimization. Highlevel synthesis techniques such as scheduling and resourcesharing can also be employed to reduce the number of components and improve the critical path in an arithmetic expression. All the techniques mentioned above operate on the given data-flow graph (computation) and will still need to implement all the operations shown in that graph. On the other hand, the data-flow graph generated by our approach leads to a better implementation. This graph can be further optimized by expression manipulation, scheduling and resource-sharing.
Limitation of our approach: Given a polynomial f of degree k, one can derive a vanishing polynomial q of higher degrees (say, k+1) too. By computing f + q, one can create a higher degree (k+1) polynomial equivalent to f . The cost of f + q might be cheaper than f . Our approach cannot identify cheaper implementations of a higher degree. Unfortunately, there can be more than one vanishing expression of a given degree (depending upon the coefficients) that can be added to f . This makes it difficult to derive a "convergent" algorithm to search for low-cost implementations of higher degree.
VIII. Conclusions and Future Work
This paper has presented an area optimization approach for polynomial datapaths: where the input and output bit-vector sizes of the operands are given as (n1, n2, ..., n d ) and (m), respectively. Finite word-length bit vector arithmetic is then modeled as a polyfunction from Z 2 n 1 × Z 2 n 2 × · · · × Z 2 n d to Z2m . Exploiting the concept of vanishing polynomials over this mapping, we present two algorithms to optimize a given polynomial to a polynomial with low cost implementation. A cost model to estimate the area at polynomial level is also presented. Using the optimization procedure, along with the cost model, allows to select an equivalent lower cost expression for synthesis. Experiments show significant area savings using our approach. Also, it can be seen that the area savings do not worsen timing. We are currently investigating how to extend our approach to perform polynomial decompositions over such arithmetic.
