AbstractÐThis paper considers the design of bit-parallel dedicated finite field multipliers using standard basis. An explicit algorithm is proposed for efficient construction of Mastrovito product matrix, based on which we present a systematic design of Mastrovito multiplier applicable to qp P m generated by an arbitrary irreducible polynomial. This design effectively exploits the spatial correlation of elements in Mastrovito product matrix to reduce the complexity. Using a similar methodology, we propose a systematic design of modified Mastrovito multiplier, which is suitable for qp P m generated by high-Hamming weight irreducible polynomials. For both original and modified Mastrovito multipliers, the developed multiplier architectures are highly modular, which is desirable for VLSI hardware implementation. Applying the proposed algorithm and design approach, we study the Mastrovito multipliers for several special irreducible polynomials, such as trinomial and equally-spaced-polynomial, and the obtained complexity results match the best known results. Moreover, we have discovered several new special irreducible polynomials which also lead to low-complexity Mastrovito multipliers.
INTRODUCTION
E FFICIENT hardware implementations of finite (or Galois) field qp P m arithmetic units are highly desirable for many applications in error-correcting coding and cryptography [1] , [2] , [3] , [4] . Among the qp P m arithmetic operations, multiplication is the most basic and important building block in many applications. A number of efficient qp P m multiplication approaches and architectures have been proposed in which different basis representations of field elements are used, such as standard basis, dual basis, and normal basis. Standard basis is more promising in the sense that it gives designers more freedom on irreducible polynomial selection and hardware optimization. For detailed discussions of these three basis representations, readers are referred to [3] , [5] . In this paper, we are interested in the design of bit-parallel qp P m multipliers using standard basis.
The standard basis multiplication involves two steps: polynomial multiplication and modulo reduction. Let fx be the irreducible polynomial generating qp P m , x be the product of x and x, where x, x, x P qp P m . The finite field multiplication is performed as x x Á x mod fx. An efficient dedicated bitparallel multiplier was proposed by Mastrovito [6] in which a product matrix M is introduced to combine the above two steps together. Thus, the multiplication is carried out by w Á , where c and b represent the coefficient vectors of x and x, respectively. Given irreducible polynomial fx, each entry in matrix M is obtained by Xoring certain coefficients of x. Many entries in matrix M can be computed efficiently by sharing some common items, e.g., two entries wi I Y j I H P Q and wi P Y j P H P R can be computed using three XOR gates by sharing the common item H P . This method is called subexpression sharing [7] . Mastrovito multipliers using two special irreducible polynomials, trinomial and equallyspaced-polynomial (ESP), have been studied by many researchers for their low-complexity implementations [8] , [9] , [10] , [11] , [12] , [13] , [14] . The essence of all these works is to find an architecture to exploit subexpression sharing efficiently based on the specific irreducible polynomials. It has been shown in [8] that Mastrovito multiplier using irreducible trinomial x m x n I only requires m P À I XOR gates and m P AND gates. By generalizing the approach of [8] , Halbutogullari and Koc Ë [14] discovered that the space complexity of Mastrovito multiplier using irreducible ESP x m x tr Á Á Á x r I, where t Ir m, can be reduced to m P À r XOR gates and m P AND gates. Furthermore, [14] presents a new formulation of the Mastrovito product matrix for an arbitrary irreducible polynomial x m x nk Á Á Á x nI I, where the space complexity is given as m P AND gates and m À Im k À I jPx m À I À j XOR gates, x & fHY IY Á Á Á Y m À Pg. However, [14] fails to find a method to explicitly compute the set x , which makes its result less practicable in general.
In general cases, the complexity of computing product matrix is proportional to the Hamming weight of the irreducible polynomial fx (denoted as pwt). So, the Mastrovito multiplier is good only when low-Hamming weight irreducible polynomials are used. A modified Mastrovito multiplier was proposed by Song and Parhi [15] , which has a complexity proportional to m À I À pwt. Its basic idea is to use the complementary irreducible polynomial for the computation of matrix M with appropriate compensation. Such a multiplication scheme is efficient when qp P m is generated by high-Hamming weight irreducible polynomial.
Generally, for large m, efficient design for both original and modified Mastrovito multipliers becomes rather difficult and a systematic design approach is highly desirable. In this paper, we generalize the approach of [8] in a different way compared with [14] . We propose a theorem and an algorithm (which can explicitly compute set x in [14] ) for the construction of product matrix in the original Mastrovito multiplier, based on which we develop an efficient systematic design of the original Mastrovito multiplier. This design effectively exploits subexpression sharing in the computation of product matrix to reduce the complexity. Using a similar methodology, we also develop a systematic design of modified Mastrovito multiplier. For both original and modified Mastrovito multipliers, explicit algorithms and architectures are presented and the complexities are given in detail. Applying our proposed design approach, we study the irreducible trinomial and ESP and the complexity results match the results in [8] and [14] . Meanwhile, another computation approach for trinomial case is proposed to make a trade-off between space complexity and delay. Moreover, with the aid of the proposed algorithm, we discover several new irreducible polynomials leading to low-complexity original Mastrovito multipliers, which is especially desirable when neither an irreducible trinomial nor an irreducible ESP exists. This paper is organized as follows: We introduce the notation of this paper and the fundamentals of finite field and Mastrovito multiplier in Section 2. In Section 3, we propose a theorem and an algorithm for the construction of product matrix, based on which a systematic design approach for original Mastrovito multiplier is developed. In Section 4, using a similar design methodology, we present a systematic design approach for modified Mastrovito multiplier. Efficient computation approaches for several special irreducible polynomials are discussed in Section 5. This paper is an extended version of [16] .
NOTATION AND PRELIMINARIES
Since several arithmetic operations pertaining to matrices and vectors will be extensively used throughout this paper, we first introduce the following notation: Column vectors and matrices are represented by small and capital boldfaced characters, respectively. Matlab matrix notations are used, e.g., iY X, XY j, and iY j represent the ith row vector, jth column vector, and the entry with position iY j in matrix Z, respectively. The operations of shift by feeding zero are represented by corresponding arrows, e.g., v5 P, 3 I, and 5 I represent down shift of vector v by two positions, right shift of matrix by one column, and down shift of matrix by one row, respectively, which are explicitly given as:
where o represents the zero column vector. Furthermore, we note that the AND and XOR gates considered in this paper are all 2-input gates, whose delays are denoted as e and , respectively. Finite field qp P m contains P m elements and can be viewed as an m-dimensional vector space over qp P, which only has two elements, 0 and 1. With the standard basis fIY xY x P Y Á Á Á Y x mÀI g, the elements of the finite field qp P m can be represented as polynomials of degree m À I as follows:
Such polynomial representation is generally used for finite field arithmetic operations, where addition is carried out by polynomial addition over qp P m using bit-independent XOR operations.
Finite field multiplication using standard basis is carried out by polynomial multiplication and modulo operation. Let fx x m f mÀI x mÀI Á Á Á f I x I be the irreducible polynomial generating qp P m , and x be the product of x and x, where xY xY x P qp P m . The polynomial multiplication, dx x Á x, can performed as
the coefficient vectors of x and dx, respectively, and matrix A is given as
Next, we perform the modulo reduction x dx mod fxY which can be expressed as
where f H I, we get
where u lÀI j and u l j are the coefficients of u lÀI x and u l x, respectively. Let u l denote the coefficient vector of u l x, then, using vector notation, (3) can be rewritten as
which consists of the first m entries of d, we can rewrite (2) in matrix notation as follows:
where s mÂm represent m Â m identity matrix and matrix
Note that all successive columns in matrix U have the recursive relation as in (4) . Define
Thus, the qp P m multiplication can be carried out by w Á , which is the well-known Mastrovito multiplication scheme, and the matrix M is called product matrix.
MASTROVITO MULTIPLIER
In this section, we first introduce a theorem and an algorithm pertaining to the construction of product matrix M. Then, we develop an approach to compute M by adding a series of Toeplitz matrices, through which subexpression sharing could be extensively exploited to reduce the XOR complexity. Accordingly, a highly modular multiplier architecture is presented.
Proposed Theorem and Algorithm
In Section 2, we have shown that product matrix M can be expressed as the product of two independent matrices, s mÂm Y and A, and there exists a recursive relation between successive columns in matrix U, as described in (4) . Through the following theorem and algorithm, we will see that matrix U also can be constructed by adding a series of Toeplitz matrices. Input: The parameters of irreducible polynomial: m,
Procedure:
1. Generate a weighted tree D according to the following properties:
. 
c. if j j j mod P I, then insert j into the set x . Here, we note that a multiset is like a set, except that repeated elements are allowed, and j j j represents the order of j . A proof of Theorem 3.1 is given in Appendix A, in which Algorithm 3.1 is developed. From the above algorithm, we know that the least two elements in x are always H and (m À k s ) and we have jx j k s . 
Since j P j mod P j Q j mod P H, we get x fHY Ig.
Therefore, we perform the multiplication as
I H H H H I I H H H I H H H H I I H H H I H H I I I I H H H I H I H I I H H H H I I H H I
P T T T T T T R Q U U U U U U S Á
H H H H H H H H H H I H H H H H I H H H I H I H H H I H I H H H I H I H H H I H H H H H I
P T T T T T T T T T T T T T T T T R Q U U U U U U U U U U U U U U U U S Á I H I I H P T T T T T T R Q U U U U U U S
H I I I I H H I I I I I I H H H H H H I I I I I I
Thus, we have x xx mod fx x R .
Multiplier Architecture
Applying Theorem 3.1 and Algorithm 3.1, in this section, we develop an efficient computation approach for the product matrix M and present the corresponding lowcomplexity Mastrovito multiplier architecture. In the following, we frequently use two special matrices: the Toeplitz matrix and the upper-triangular Toeplitz matrix. It is well-known that the sum of two upper-triangular Toeplitz matrices (or one Toeplitz matrix and one upper-triangular Toeplitz matrix) can be obtained by only computing the sum of two row vectors. This property will be used to exploit the subexpression sharing in the construction of product matrix M to reduce the entire XOR complexity. G i v e n a g e n e r a l i r r e d u c i b l e p o l y n o m i a l fx x m x ks Á Á Á x kI I, we express its coefficient vector as f e I e k I I Á Á Á e k s I , where e i is the m-dimensional ith canonical unit vector: 
Substituting (6), (8), and (9) into (5), we get w e s s iH nPx i kiI 3 n Á e t X II Based on the special structure of i j as defined in (7), it can be easily proven that
and o is an m-dimensional zero column vector. Thus, if we denote nPx ẽ t 3 n as S, (11) can be rewritten as
Because eachẽ t 3 n is an upper-triangular Toeplitz matrix, we easily know that matrix is also an uppertriangular Toeplitz matrix with the following form:
is sufficient to completely construct matrix . Since the first entry of e t IY X is zero, the first n I entries of e t IY X3 n are also zeroes. Recalling that n H is the least element in x , we get the XOR complexity for computing IY X based on (15) is
and if binary tree structure is used, the delay will be dlog P jx je .
After having obtained , we use (13) to compute the product matrix M. From (10) and (14), we know that the addition of e s and S does not need any XOR gates and the matrix e s , denoted by T, is a Toeplitz matrix. Therefore, (13) can be rewritten as
In order to compute (17) efficiently, we introduce the following observation: Applying Observation 3.1, we compute M based on (17) with linear tree structure as follows:
Finally, set w s . In the above algorithm, for each i, iÀI k i IY X IY X requires m À I XOR gates, so we need sm À I XOR gates to compute w with the delay of s . In order to obtain the result x via w Á , we also need m P AND gates and mm À I XOR gates to complete this matrix-vector multiplication. Its delay will be e dlog P me if binary tree structure is used.
Based on the above computation approach, we develop the dedicated Mastrovito multiplier architecture as shown in Fig. 2 . Given irreducible polynomial fx x m x ks Á Á Á x kI I, set x is constructed using Algorithm 3.1. Product Matrix Module computes matrix M and consists of two blocks: M1 and M2. Block M1 generates the vector IY X by computing nPx e t IY X3 n. Supplied with IY X, block M2 computes the product matrix w using Algorithm 3.2. M2 contains (m À I) j blocks, each one generates one row vector of w. If j P fk i Y I i sg, then j is identical to block B2; otherwise, it is identical to block B1. The Matrix-Vector Multiplication Module computes w Á and consists of m identical V blocks. The block V computes the inner-product of two vectors of length m. The total complexity of the proposed original Mastrovito multiplier for general irreducible polynomial is given by . XOR Complexity:
. AND Complexity: m P , . Delay: e s dlog P jx je dlog P me .
It needs to be pointed out that, for a given irreducible polynomial, there likely exist some common items in the computation of (15) and (17) . By sharing these common items, we may further optimize the hardware architecture of product matrix module to some extent. Thus, the multiplier architecture shown in Fig. 2 may need some corresponding modifications and the above XOR complexity value is actually an upper bound for general cases.
MODIFIED MASTROVITO MULTIPLIER
Using a similar design methodology as in Section 3, in this section, we develop a systematic design approach for the modified Mastrovito multiplier [15] , which is preferable if high-Hamming weight irreducible polynomial is used.
Proposed Theorem and Algorithm
For a polynomial fx f m x m f mÀI x mÀI Á Á Á f I x f H , we define its complementary polynomial as
According to [15] , we have the following theorem to construct matrix U in (5) using the complementary irreducible polynomial: 
where e I is the first canonical unit vector and
In the following, we propose a theorem and an algorithm to construct matrix V in Theorem 4.1 by adding a series of Toeplitz matrices together. where i I is defined in (7) and
The sets v and t in Theorem 4.2 are constructed by the following algorithm: 1. Generate a weighted tree D according to the following properties:
. The root d I always has one child node d P connected by an edge with weight w I.
other child nodes, where the weight of edge 
g j, then insert g into j ; c. if j j j mod P I, then insert j into the set v; if j j j mod P I, then insert j into the set t . A proof of Theorem 4.1 is given in Appendix B in which Algorithm 4.1 is developed. 
H H H H H H H H H H H H H H H H H H H H H H H H I I H I H I H I I H I H H H I I H I
P T T T T T T T T T T T R Q U U U U U U U U U U U S
H I H H I I H H I H H I H H H I H H H H H H I H H H H H H I H H H H H H H H H H H H
H I H H I I H H I H H I H H H I H H H H H H I H I I H I H H H I I H I H H H I I H I
P T T T T T T T T T T T R Q U U U U U U U U U U U S X Since UY X
I H H I I H I H H I I H I H H I I H I H H I I H I H H I I H I H H I I H I H H I I H
I I H I H I I H I I I I I H H H I H I H H I H H H I H H I H I I I I H H I H I H I I
P T T T T T T T T T T T R Q U U U U U U U U U U U S X
Multiplier Architecture
In this section, based on the above theorem and algorithm, we develop an efficient computation approach and corresponding multiplier architecture for the modified Mastrovito multiplication scheme. According to (5) , (9) 
Since eachẽ t 3 n is an upper-triangular Toeplitz matrix, we know that both I and P are also upper-triangular Toeplitz matrices, and computing
is sufficient to construct I and P . Since the least element in v is always 0, similar to (16), we obtain the XOR complexities of computing I and P as
with the delay of dlog P jt je and dlog P jvje , respectively. Similarly to Algorithm 3.2, for the computation of w I using (20), we have the following algorithm:
Algorithm 4.2.
1.
Initially, set H e s I ; 2. For I i m À s À I, construct i iÀI P 5 t i by computing
Finally, set w I mÀsÀI .
In the above algorithm, the construction of Toeplitz matrix H doesn't need any gates. For I i m À s À I, each step needs m À I XOR gates, so the XOR complexity and delay of the above algorithm is m À s À Im À I and m À s À I , respectively.
Based on the above discussion, we conclude that the matrix w I can be computed with the following procedure:
Procedure 4.1.
1. Given a high-Hamming weight irreducible polynomial fx x m x ks Á Á Á x kI I, get its complementary polynomial px x t mÀsÀI Á Á Á x t I and construct two sets v and t using Algorithm 4.1; 2. Construct I and P using (21), if we combine v and t together to form a new multiset v Ã , then the complexity of this step is . XOR complexity: Therefore, the computation of w P only needs d iP m À r i XOR gates with the delay of dlog P de .
In the above, we have developed the approaches for computing w I and w P . In order to complete the qp P m multiplication, we only need to compute w Á w I Á w P Á X The matrix-vector multiplication w I Á requires mm À I XOR and m P AND gates with the delay of e dlog P me . Since all row vectors in w P are identical and the first r I elements in each row are zeros, the matrix-vector multiplication w P Á only requires m À r I À I XOR and m À r I AND gates with the delay of e dlog P m À r I e . The addition of w I Á and w P Á needs m XOR gates with the delay of .
Based on the above computation approach, we develop the modified Mastrovito multiplier architecture as shown in Fig. 4 . The set v and t are constructed using Algorithm 4.1. Product Matrix Module computes the two matrices w I and w P and consists of four blocks: PM1, PM2, PM3, and PM4. Blocks PM1 and PM2 generate the vector P IY X and I IY X, respectively. Block PM3 computes the matrix w I using Algorithm 4.2. PM3 contains (m À I) j blocks, each o n e g e n e r a t e s o n e r o w v e c t o r o f w. If j P ft i Y I i m À s À Ig, then j is identical to block B2; otherwise, it is identical to block B1. Block PM4 computes the vector q Á e t , the row vector of matrix w P , where r I Y Á Á Á Y r d represent the position of the nonzero elements in q . The Matrix-Vector Multiplication & Addition Module computes w I Á w P Á . In this architecture, the operation of w I Á and w P Á can be performed in parallel. If their delays are denoted by e m I and e m P , the total delay of the modified Mastrovito multiplier will be e I mxm I Y m P . Therefore, the total complexity of modified Mastrovito multiplier is . XOR complexity: m I m À s À I dlog P mxjvjY jt je dlog P meY m P dlog P de dlog P m À r I eX
We note that, for given high Hamming weight irreducible polynomial, further hardware optimization is possible by sharing common items during computation of w I and w P . So, the above XOR complexity result is also an upper bound for general cases.
SPECIAL IRREDUCIBLE POLYNOMIALS
In Section 3, we presented an efficient computation approach of original Mastrovito multiplication for general cases and pointed out that further simplification can be achieved for specific irreducible polynomials by further exploiting subexpression sharing. In this section, we show that by applying our proposed explicit algorithm for the construction of set x , we can easily obtain efficient multiplication schemes with further reduced complexity for several special irreducible polynomials.
m P ! k s
In the following, we show that, for irreducible polynomial fx x m x k s Á Á Á x k I I, where m P ! k s , we can reduce the XOR complexity of the Mastrovito multiplier by computing IY X in (15) using linear tree structure instead of binary tree. Since Using linear tree structure, we compute IY X, according to (24), as follows:
Algorithm 5.1.
Initially, set v
H e t IY X; 2. For I i s, compute
The XOR complexity of the above algorithm is
with the delay of s . Compared with using binary tree, the XOR complexity doesn't change, but delay increases from dlog P s Ie to s . Next, we will prove that, as compensation for the increased delay, the XOR complexity of computing M using Algorithm 3.2 can be reduced from sm À I to s iI m À k i by sharing the intermediate results obtained in Algorithm 5.1.
Proof. In Algorithm 3.2, based on Observation 3.1, we construct each matrix i by only computing
Moreover, since the last m À k i rows of each matrix i form a Toeplitz submatrix, we have
Therefore, based on (26), (27), and the fact that H e s , by induction we have
From (10), we have
where e k i I is the k i Ith canonical unit vector. Therefore, (28) becomes with the delay of Ps . Moreover, the matrix-vector multiplication w Á requires mm À I XOR and m P AND gates with the delay of e dlog P me . Thus, for an irreducible polynomial in which m P ! k s , if the linear tree structure is used to compute IY X, the entire complexity of Mastrovito multiplier is . XOR complexity: m sm À I, . AND complexity: m P , . Delay: e Ps dlog P me .
Trinomial
If qp P m is generated by irreducible trinomial fx x m x n I, we have
Therefore, we get k mÀP mnÀ , (15) and (13) can be simplified as
Obviously, computing wn IY X e s n IY X n IY X IY X QT is sufficient to construct M in (35). According to the complexity results presented in Section 3, we know that the XOR complexities of computing (34) and (36) are k iI m À im À n À I and m À I, respectively. In the following, we will see that the above complexity values can be further reduced. First, we show that m À n, instead of m À I, XOR gates are sufficient to complete the computation in (36). From (29), we have e s n IY X e t IY X2 m À n e nI H and, since is an upper-triangular Toeplitz matrix, its n Ith row can be written as
So, (36) can be rewritten as wn IY X e t IY X2 m À n e nI H e t IY X3 n IY XX QU Let's consider the addition of e t IY X2 m À n and IY X in (37). Because the last m À n entries of e t IY X2 m À n are zeros, we only need to compute the first n entries of e t IY X2 m À n IY X. Obviously, the first n entries of e t IY X2 m À n IY X are equal to the last n entries of e t IY X IY X3 m À n and we have
Since k mÀP mÀn , we have k Im À n ! m À I. Thus, the item e t IY X3 k Im À n is actually a zero vector and the first n entries of e t IY X2 m À n IY X are identical to the last n entries of IY X. Therefore, the addition of e t IY X2 m À n and IY X does not need any XOR gates. Furthermore, in (37), the sum of the other two items,
has m À n nonzero entries. Thus, we conclude that the computation of (37) only needs m À n XOR gates. The XOR complexity of computing (34) can be reduced by using either the linear tree or hybrid tree method, which will lead to different trade-off between XOR complexity and delay.
Linear Tree
If (34) is computed using linear tree structure as H ik e t IY X3 im À n, we achieve the lowest XOR complexity with the delay increasing linearly with k. This approach has been thoroughly studied in [8] , in which it was proven that n À I XOR gates are sufficient to compute (34) with the delay of k . We have known that the computation of (35) needs m À n XOR gates with delay of I . Thus, the total complexity of the Mastrovito multiplier in [8] was obtained as:
. XOR complexity: m P À I, . AND complexity: m P , . Delay: e k I dlog P me , k mÀP mÀn . Especially, it's pointed out in [8] that if n m P , then the XOR complexity can be further reduced from (m P À I) to (m P À m P ), with the delay reduced by I .
Hybrid Tree
We have known that using linear tree structure to compute (34) will lead to a very low XOR complexity with the delay increasing linearly with k. However, when k is very large, the delay may be intolerable for applications requiring high speed. In the following, we present another approach for computing (34), where the XOR complexity also can be reduced by exploiting subexpression sharing and the delay increases linearly with log P k I. Let the Hamming weight of k I be h and log P k I t h , then k I can always be written as k I P t h P t hÀI Á Á Á P t I , where t h b t hÀI b Á Á Á b t I . If we denote e t IY X3 im À n as e t IY X i , (34) can be rewritten as
where
We compute the vector g h using a binary tree f with the height of t h as shown in Fig. 5 , where each node represents a m-dimensional vector. At each depth j , there are P thÀj nodes with the following values:
In (40), all other items can be directly obtained by right shifting the first item P j ÀI iH e t IY X i by l Á P j columns, where I l P t h Àj À I. So, only computing the first node at each depth is sufficient to construct the whole binary tree. The first node at depth j is computed by adding the first two nodes at depth j À I, where m À P jÀI m À n À I XOR gates are required. Thus, the XOR complexity of computing g h is (41) and (42) gives the total XOR complexity for the computation of (38) as
m À n and the delay is t h I . We have known that computing (35) requires m À n XOR gates with the delay of I . Therefore, in trinomial cases, if the above hybrid tree structure is employed to compute (34), the total computation complexity of Mastrovito multiplier will be . XOR complexity:
. AND complexity: m P , . Delay: e t h P dlog P me , where t h log P k I, h is the Hamming weight of k I, and w i is defined in (39).
Pentanomial
A polynomial fx x m x ks Á Á Á x kI I is called pentanomial if s Q. For general irreducible pentanomials, it's impossible to write the set x in a simple form, e.g., as in the trinomial case, and we have to use the general approach presented in Section 3 to design the product matrix module and perform the possible hardware optimization for the dedicated irreducible pentanomial.
In the following, we present two special irreducible pentanomials for which the set x has a simple form and the complexity of corresponding Mastrovito multipliers can be easily obtained.
Special Case 1
For irreducible pentanomials where m P ! k Q , it follows from the analysis in Section 5.1 that if linear tree structure is used to compute IY X, the total complexity of the Mastrovito multiplier is . XOR complexity: m Qm À I, . AND complexity: m P , . Delay: e T dlog P me .
Special Case 2
For such irreducible pentanomials that applying Algorithm 3.1, we have [8] , it only requires m À Rr À I XOR gates with the delay of d R . Therefore, the total XOR complexity of computing IY X is Pm À Sr À P with the delay of d R I . Furthermore, using Algorithm 3.2, we need Qm À I XOR gates to compute M with the delay of Q . Thus, in this case, the total complexity of the Mastrovito multiplier is given by . XOR complexity: m Qm À I Pm À Sr À P, . AND complexity: m P , . Delay: e d R R dlog P me .
ESP
A polynomial fx I x r x Pr Á Á Á x tr x m , where t Ir m, is called an ESP (equally-spaced-polynomial). An ESP with r I is usually referred as an AOP (all-onepolynomial). For irreducible ESP, applying Algorithm 3.1, we have x fHY rg, based on which it can be shown that matrix U in (5) always has the following form:
where i I is defined in (7) and
where y iÂj and s iÂi represent i Â j zero matrix and i Â i identity matrix, respectively. Therefore, the product matrix M can be computed as w e s Á e t e s v Á e t i I 3 r Á e t X Let I denote v Á e t , we have
where r Â m matrix e t I X rY X. Let P denote e s i I 3 r Á e t , we have P e s i I 3 r Á e t e s ẽ t 3 rY RU whereẽ t e t Y o . Obviously, I and P are obtained without any computation. We perform the qp P m multiplication as w Á I Á P Á X According to (46), we know that only computing Á is sufficient to obtain the result of I Á . Since e t I X rY X, we have iY X e t iY X e t IY X3 i À IY which shows that the first i entries in iY X are zeros. Therefore we get the complexity of computing I Á as
and the delay is e dlog P m À Ie . From the definition of e s and e t in (10), we know that each row vector in P I X m À rY X contains r zero entries and each row vector P m À r iY X contains r À i zeros, where I i r. Therefore, in the computation of P Á , the numbers of AND and XOR gates are
and its delay is e dlog P me . Obviously, I Á and P Á can be computed in parallel. At last, we also need m XOR gates to add I Á and P Á together to get the final result . Therefore, we get the total complexity as follows:
. XOR complexity: m P À r, . AND complexity: m P , . Delay: e I dlog P me . Here, we note that the above XOR complexity result is identical to that obtained in [14] .
CONCLUSIONS
In this paper, we have presented a systematic design approach for Mastrovito multipliers. The complexity results are m P AND gates and at most nPx m À n À I À m À I XOR gates. We note that, although the complexity results appear the same as those presented in [14] , we propose an explicit algorithm to compute the set x , which makes our design really practical. We have extended this design approach to the modified Mastrovito multiplication scheme, which is suitable for high-Hamming weight irreducible polynomials. For both original and modified Mastrovito multipliers with general irreducible polynomials, the developed computation approach effectively exploits subexpression sharing and the complexity analyses are given in detail. The corresponding hardware architectures for both cases are highly modular. Meanwhile, in this paper, we have studied several special irreducible polynomials. For trinomials and ESPs, the complexity results match the best known results achieved in [8] and [14] . We also present another computation approach for trinomials to provide a trade-off between XOR complexity and delay. Moreover, several other special irreducible polynomials, which also lead to low-complexity implementation, have been discovered and corresponding complexities are given. Finally, we note that, with the explicit algorithms and design procedures, all the proposed efficient design schemes can be easily employed by VLSI automation design tools for dedicated bit-parallel qp P m multiplier design.
and, except I , each other matrix i in can be constructed by shifting right its parent matrix by w columns, where w P fm À k i Y I i sg. Thus, we can construct a weighted tree D in which each node d i represents the matrix i in and the weight of each edge represents the number of columns by which the parent matrix right shifts to generate its child matrix. According to Procedure A-1, it can be shown that this tree is uniquely determined by the following property:
Property A-1. Since I is identical to the matrix F in Theorem 3.1, we can construct as fp3 hY Vh P rgY
Obviously, the multiset r may contain some repeated elements, which means may contain some identical matrices. Because here the addition is logic XOR, the sum of two identical matrices is actually a zero matrix. So, we can remove those repeated elements in pairs from multiset r using the following algorithm:
Algorithm A-1. Similarly to the proof of Theorem 3.1, we prove Theorem 4.2 in two steps, through which Algorithm 4.1 is developed: 1) First, we use the following procedure to construct two matrix multiset and , where the sum of all the matrices in these two multisets is equal to matrix V.
Procedure B-1.
1.
Initially, set i P, n I P, and n P I. Create two multisets f I Y P g and f I g:
where p represents the coefficient vector of complementary irreducible polynomial px and e I is the first canonical unit vector. 2. If I mY I I, then {Increase n I by 1, create I 's child matrix nI P and insert nI into }; 3. V l P and V l P , do
4. V l P , if l mY i À I I, then {Increase both n I and n P by 1, create l 's child matrix nI and nP as
and insert n I and n P into and , respectively}; 5. V l P , if l mY i I, then {Increase n I by 1, create l 's child matrix nI as
and insert nI into }; 6. Increase i by 1, if i m À I, procedure terminates, else return to Step 3. Using the above procedure, we get two multisets and . Similarly to the proof of Theorem 3.1, based on the recursive relation of successive column vectors in matrix V as shown in Theorem 4.1, it can be proven by induction that matrix V is identical to the sum of all matrices in and : jj iI i jj jI j X fXI 2) Next, we introduce another method to construct multiset and with the aid of a weighted tree. Since the complementary polynomial is px x tmÀsÀI Á Á Á x tI , the last entry of p5 j À I or p5 j À P is 1 only when j P fm À t i Y m À t i IY I i m À s À Ig. According to Procedure B-1, each child matrix in multiset can be obtained by right shifting its parent matrix by w P fm À t i Y m À t i IY I i m À s À Ig columns and the whole construction process begins from two matrices:
I and P I 3 I. Therefore, we can construct a weighted tree D in which every node d i represents the matrix i in and the weight of each edge represents the number of columns by which the parent matrix right shifts to generate its child matrix. According to Procedure B-1, it can be shown that the tree h is uniquely determined by the following property:
Property B-1. Since I is identical to the matrix P in Theorem 4.2, we can construct the multiset as follows: F For further information on this or any computing topic, please visit our Digital Library at http://computer.org/publications/dlib.
