ABSTRACT In this paper, we propose a new type of non-recursive Mastrovito multiplier for GF(2 m ) using an n-term Karatsuba algorithm (KA), where GF(2 m ) is defined by an irreducible trinomial, x m +x k +1, m = nk. We show that such a type of trinomial combined with the n-term KA can fully exploit the spatial correlation of entries in related Mastrovito product matrices and lead to a low-complexity architecture. The optimal parameter n is further studied. As the main contribution of this paper, the lower bound of the space complexity of our proposal is about O(m 2 /2) + m 3/2 ). Meanwhile, the time complexity matches the best Karatsuba multiplier known to date. To the best of our knowledge, it is the first time that Karatsuba-based multiplier has reached such a space complexity bound while maintaining a relatively low time delay.
I. INTRODUCTION
The finite field GF(2 m ) arithmetic has many applications in cryptography and error-correcting code [1] , [2] . For instance, one of the most important applications of GF (2 m ) is the elliptic cure cryptosystem (ECC) [3] . Among the GF(2 m ) arithmetic operations, multiplication is of most importance because other costly operations such as exponentiation and inversion can be carried out by iterative multiplications. Therefore, it is necessary to design highly efficient multipliers for GF(2 m ) multiplication.
The choices of the field basis and irreducible polynomials are crucial to multiplier design. Compared with other bases, polynomial basis (PB) is more promising in the sense of flexibility in irreducible polynomial selection and hardware optimization [9] . Moreover, some variations of polynomial basis, e.g., shifted polynomial basis (SPB) [5] , [13] and generalized polynomial basis (GPB) [10] , are proposed as well to optimize the multiplier architecture further. Among these irreducible polynomials in use, irreducible trinomial is one of the most common considerations. During recent years, many bit-parallel multiplier using PB have been proposed for GF (2 m ) generated with an irreducible trinomial [7] , [12] , [16] , [17] , [27] .
Generally speaking, the PB multiplication consists of two steps: polynomial multiplication and modulo reduction. The polynomial multiplication can be optimized using a divide-and-conquer algorithm such as Karatsuba algorithm (KA) [4] , [18] . Such an algorithm saves coefficient multiplications at the cost of extra additions compared to the school-book method. Thus, it can be easily adopted to design efficient GF(2 m ) multipliers. Specifically, there exists a class of Karatsuba based multipliers, named as non-recursive Karatsuba multiplier, only apply KA once in the polynomial multiplication and obtain a trade-off between the space and time complexities. During recent years, several non-recursive Karatsuba multipliers have been proposed for various type of irreducible polynomials [14] , [15] , [21] , [24] , [28] . On one hand, such multipliers cost several more XOR gates delay compared with the fastest bit-parallel multiplier known to date [6] , where no divide-and-conquer algorithm is applied. On the other hand, the space complexities of these multipliers are roughly reduced by 1/4.
Empirically, non-recursive Karatsuba multipliers focusing on specific irreducible polynomials usually have better space and time complexity than the ones for general polynomials. Such polynomials include equally-spaced trinomial (EST) [28] , all-one polynomial (AOP) [15] , etc. Recently, we explore another special form of trinomial x m + x m 3 + 1 combined with a three-term Karatsuba algorithm to obtain an efficient bit-parallel multiplier [25] . The proposed multiplier roughly costs 2/3 circuit gates of the fastest multipliers, while its time delay matches the best known Karatsuba multiplier. In this study, we take inspiration from our previous scheme and investigate the construction of the similar type of multipliers. Consider the GF(2 m ) multiplication defined by an irreducible trinomial x m + x k + 1, where m = nk, n ≥ 2. We name this type of trinomial as n-spaced trinomial. Obviously, this type of trinomial is EST if n = 2. Shou et al. [26] have already investigated the development of the bit-parallel multiplier for this trinomial using a n-term Karatsuba algorithm. But their scheme requires 3 more XOR gate delays compared with the fastest one. In this paper, we apply a n-term Karatsuba algorithm along with the shifted polynomial basis (SPB) to simplify the field multiplication. Mastrovito approach is utilized for polynomial reduction. It is demonstrated that the corresponding Mastrovito matrices for different parts of the field multiplication have relatively simpler forms, which lead to an efficient architecture. Moreover, we also give the explicit formulae with respect to the space and time complexity of the corresponding multipliers. As a result, the lower bound of our proposal costs approximately O( m 2 2 + m 3/2 ) circuit gates compared with the fastest bit-parallel multipliers, while its time delay matches the Karatsuba based multipliers known to date.
The rest of this paper is organized as follows: In Section 2, we briefly review the n-term Karatsuba algorithm, the SPB representation and some pertinent notations. Then, we present a new bit-parallel multiplier architecture for n-spaced trinomial in Section 3. After that, a small example is given. Section 4 presents a comparison between the proposed multiplier and some others. More discussion about the optimal parameter is also given. Finally, some conclusions are drawn.
II. PRELIMINARY
In this section, we briefly review some important notations and related algorithms that used throughout this paper.
A. IRREDUCIBLE n-SPACED TRINOMIAL
We first consider the existence of the irreducible trinomial x m + x k + 1, m = nk which are used to define the finite field GF(2 m ). The following lemma is useful.
Lemma 1 [2] : 
B. SHIFTED POLYNOMIAL BASIS
The shifted polynomial basis (SPB) [13] actually is a variation of the polynomial basis. This notion is originally applied in the field GF(2 m ) generated with irreducible trinomials, and then pentanomials [5] . In this study, we consider the field GF(2 m ) generated by a n-spaced trinomial f (x) = x nk + x k + 1. Let x be a root of f (x), and the set M = {x nk−1 , · · · , x, 1} constitutes a polynomial basis (PB). Then, the SPB can be obtained by multiplying the set M by a certain exponentiation of x: Definition 2 [13] : Let v be an integer and the ordered set
called the shifted polynomial basis with respect to M .
Under SPB representation, the field multiplication can be performed as:
If the parameter v is properly selected, the field multiplication using SPB representation is simpler than that using PB representation, especially for the field define by irreducible trinomial or some type of pentanomials [5] . This characteristic directly lead to a more efficient Mastrovito multiplier which has lower time complexity compared with classic one using PB. Furthermore, it has been proved that the optimal value of v is k or k − 1 for trinomials [13] . To construct an efficient multiplier for n-spaced trinomials, we choose v = k and use this denotation thereafter.
C. n-TERM KARATSUBA ALGORITHM
The classic Karatsuba algorithm multiplies two 2-term polynomials using three scalar multiplications at the cost of one extra addition. Then, Weimerskirch and Paar [8] proposed a slightly generalized algorithm for the polynomial multiplication with arbitrary degree. This algorithm has the same idea as the classic one. We denote such an algorithm as n-term KA (n > 2). Provide that there are two polynomials of degree n − 1 over F 2 :
The n-term KA for polynomial multiplication AB is as follows:
• Compute for each i = 0, · · · , n − 1,
• Compute for each i = 1, · · · , 2n − 3 and for all s, t with s + t = i and n > t > s ≥ 0,
The correctness proof about above formulae can be found in [8] . Merge the similar items for
One can easily check that the above formula costs about O(
2 ) additions. Compared with classic KA, the n-term KA saves more coefficient multiplications at the expense of more coefficient additions. Besides Weimerskirch and Paar's algorithm, Fan et al. [19] and Montgomery [20] proposed more alternative Karatsuba-like formulae. Their formulae aim to decrease as many coefficient multiplications as possible. These formulations usually contain complicated linear combinations of the coefficients, which will lead more gates delay for the bit-parallel architecture. Thus, we prefer to utilize the above algorithm to develop bit-parallel multiplier.
In Section 3, we investigate the construction of nonrecursive Karatsuba algorithm using n-term KA for the n-spaced trinomial. Our main strategy is analogous to that in [24] , which combines Mastrovito approach and n-term KA. Therefore, some notations pertaining to matrices and vectors are used as well. Note that these notations have already been presented in [9] and [24] . Z(i, :), Z(:, j) and Z(i, j) represent the ith row vector, jth column vector, and the entry with position (i, j) in matrix Z, respectively. Z [ i] represents cyclic shift of Z by upper i rows. Z[ i] represents appending i zero vectors to the top of Z.
III. EFFICIENT MULTIPLIER BASED ON n-TERM

KARATSUBA ALGORITHM
In this section, we present an efficient non-recursive Karatsuba multiplier for n-spaced trinomial x nk + x k + 1 using SPB representation. We firstly investigate the structure of the product matrix for polynomial multiplication based on n-term KA. Then, reduced matrices are calculated using Mastrovito approach. Accordingly, we propose the related multiplier architecture. It is shown that corresponding matrix-vector multiplications can be implemented efficiently for n-spaced trinomial. The space and time complexity of the corresponding multiplier is also discussed.
Provide that the finite field GF(2 m ) is generated with an irreducible trinomial x m + x k + 1, m = nk, the field elements are represented using SPB. Applying n-term KA as presented previously, we partition two arbitrary field elements A = 
where
Then, we multiply A and B using the n-term Karatsuba algorithm presented in Section 2 and do following transformation:
. We partition the above expression into two parts, i.e.,
and compute them independently. Thus, the field multiplication C = AB mod f (x) now is rewritten as
In order to apply Mastrovito approach, we have to rewrite both S 1 and S 2 into matrix-vector forms and then reduce those matrices. Please note that m = nk and thus corresponding product matrices are more complicated than those presented in [24] and [25] . The following subsections give the details.
it is clear that S 1 in fact consists of n parts, each of which can be recognized as a shift of
we can develop the matrix-vector form of S 1 .
It is noted that
Such an expression can be written as big matrix-vector multiplication derived from the matrix-vector form of A i B i . Let A i represents the multiplication matrix related to A i h(x) and b i represents the coefficient vector of
, where
The labels on the left side indicate the exponent of indeterminate x for each row in A i , which range from −k to nk − 1. However, we check that the degrees of x in A i h(x)B i are actually in the range [−k, nk − 2]. But the last row of A i is 0, which does not affect the result. The matrices A i,H and A i,L are both k × k triangular Toeplitz matrix, i.e.,
and
Accordingly, these n submatrix-vector multiplications can constitute a bigger matrix-vector multiplication pertaining to S 1 , denoted by A S 1 · b. More explicitly,
For simplicity, we do not write the degree labels of the product matrix here. Notice that deg(
One can check that the degrees of the terms of S 1 are in the range [−2k, 2m − 2k − 2]. Based on Mastrovito scheme, S 1 needs a further reduction by f (x). The following reduction rule is applied:
The reduction can be regarded as the construction of the Mastrovito matrix from A S 1 according to (4) . Let M S 1 denote the Mastrovito matrix related to S 1 . In order to analyze the organization of M S 1 , we introduce a lemma, which is the key step toward the development of the multiplier architecture. 
We notice that the product matrix A here includes 2m − 1 rows with each row corresponding the degree from −2k to 2m − 2k − 2. Clearly, the first k rows and the last m − k − 1 rows correspond to the term degrees that are out of the range [−k, m − k − 1]. Based on (4), the reduction steps consist of reducing the row {−2k, −2k + 1, · · · , −k − 1} by adding them to the row {−k, · · · , −1} and {m − 2k, · · · , m − k − 1}, and reducing the row {m − k, · · · , 2m − 2k − 2} by adding them to the row {0, · · · , m − k − 2} and {−k, · · · , m − 2k − 2}. The explicit reduction process follows the same line as the proof of Observation 3.1, [24] . Then, we partition these rows into two categories, let
We compare the row number and obtain the result immediately. Based on Lemma 3, we immediately give the following proposition with respect to the structure of M S 1 .
Proposition 4: The Mastrovito matrix M S 1 can be constructed as
The proof is the same as the proof of Lemma 3. We directly get this conclusion by substituting A by A S 1 .
It is noted that there are some overlapped terms between M S 1 ,1 and M S 1 ,2 . By adding these two matrices together, we can obtain the explicit form of M S 1 , which is in (7), as shown at the bottom of this page. Moreover, the matrix-vector multiplication S 1 = M S 1 · b can be computed according to the strategy used in [25] and overlapped terms are considered reusing to save more logic gates.
(iii) Sum up all the n entries of each row using binary XOR tree to obtain the final result. Remark: It is clear that the row-vector products in (8) contain all the possible row-vector products in (7). Only nk 2 AND gates are required to compute these expressions.
In addition,
These expressions can be computed in parallel and more logic gates can be saved using sub-expression sharing for binary tree. Such an approach has already been studied in [24] . The authors have shown that if two binary XOR trees share t common items, only t − W (t) XOR gates can be saved, where W (t) is the Hamming weight of t. It is easy to check that the j-th row
includes j terms and originally requires j − 1 XOR gates for binary XOR tree. Minus the saved XOR gates, we can see that number of required XOR actually is Table 1 summarizes the space and time complexity of S 1 mod f (x) for all the steps. One can notice that after calculation of the row-vector products in (8) 
will be obtained using a binary XOR tree with a delay of log 2 k T X . Finally, we have to perform additions among the n entries to obtain the coefficient vector with respect to S 1 . More partial additions can be saved using the same sub-expression sharing. For simplicity, we put the details to the Appendix A.
2) AN EXAMPLE OF S 1 mod f (x)
Firstly we have an irreducible 4-spaced trinomial x 4 + x + 1 over F 2 . Then, we can construct another irreducible 4-spaced trinomial of higher degree according to Lemma 4, i.e., x 12 + x 3 + 1.
Consider the field multiplication using SPB representation over GF (2 12 ) defined by the previous trinomial. We have the SPB parameter k = 3 and SPB is defined as 
Based on equation (2) and previous description, it is obviously that AB = S 1 + S 2 and the explicit form of S 1 and S 2 are as follows:
Accordingly, it is easy to compute the matrices A S 1 , M S 1 ,1 and M S 1 ,2 , which are presented in the appendix.
The Mastrovito matrix related to S 1 mod f (x) is as shown at the top of this page. Therefore, one can check that the exact number of logic gates requried by every step of S 1 mod f (x):
,H * b 3 requires 36 AND gates with one T A gate delay.
• 
sharing, as the binary XOR tree for these expressions can be embedded into those of
. These operations requires 2T X delay in parallel.
• The final additions among 4 entries of each row costs 21 XOR gates using the trick presented in the appendix, which cost another 2T X delay in parallel. As a result, the calculation of M S 1 ,1 · b totally requires 36 AND gates and 45 XOR gates, with T A + 4T X gate delay. This result meets the complexity formulae shown in Table 1 .
We then consider the computation of S 2 mod f (x) in details. Note that 
It is noted that these matrix-vector multiplications are independent and thus can be implemented in parallel. However, S 2 contains n 2 different expressions in all, each of which has a different degree. In order to simplify the reduction process, we first classify these expressions into several categories, where the expressions in the same category can constitute a bigger matrix-vector multiplication. Then we can perform a reduction with each category. In addition, the classification has already been studied in [23] . Here, we can utilize the result directly. Let
The classification lemma is as follows: Lemma 5 [23]: S(n) can be expressed as the plus of g
0 , . . .
Based on the above lemma, it is obvious that S 2 can be partitioned into λ parts and all these parts are independent. More explicitly,
Obviously, g 1 , g 2 , · · · , g λ contain all the nonzero terms of S 2 , where the number of such terms equals (n − 2)k + 2k − 2 + 1 = m − 1 terms if n is even or (n − 1)k + 2k − 2 + 1 = m + k − 1 if n is odd. We can first compute these expressions in parallel, then, perform reductions related to
matrix-vector bitwise multiplications, i.e, E s,t = U s,t * v s,t in parallel. (iii) Classify these n 2 matrices E s,t into λ parts according to Lemma 5 and constitute the small matrices of the same category into λ big matrices E g 1 , · · · , E g λ , which correspond to g 1 , g 2 , · · · , g λ . (iv) Add all the entries of the same row in E g 1 , · · · , E g λ using binary XOR tree, and obtain the coefficients of
Add all these results binary XOR tree to obtain the S 2 mod f (x). Remark: According to (9) , it is clear that after performing bitwise multiplication, E s,t are all (2k − 1) × k matrices. When we classify these matrices and constitute them to λ big matrices, one can check that the number of entries for each row of E g 1 , · · · , E g λ is equal to k. Thus, the coefficients of g 1 , g 2 , · · · , g λ will be obtained with log 2 k T X delay. Whereafter, we can perform the modular reduction for g 1 x (2λ−3)k , g 2 x (2λ−5)k , · · · , g λ x −k . Such reductions also rely on equation (4) . We have following observations for the computation of S 2 mod f (x).
Observation 6:
, we only need to reduce these expressions at most once.
Proof: Apparently, the minimal and maximal degrees of the terms in g 1 x (2λ−3)k , g 2 x (2λ−5)k , · · · , g λ x −k are −k and 2m − 3k − 2, respectively. Apply reducing formulae of (4), we have
The exponents of x in the right side now are all in the range [−k, m − k − 1], no further reduction is needed. Observation 7: When the modular reduction and addition are combined, Step (v) and (vi) can be calculated with at most log 2 n T X delay.
only need to reduced once. But, g i contains different number of nonzero terms according to the parity of n, which lead to different reduction formulations.
For simplicity, we only consider the case of odd n here and put the analysis about other case in Appendix.
If n is odd, we have λ = n−1
, and the degree of g i is nk
2 . When we reduce above expression modulo f (x) = x nk + x k + 1, only two parts are needed to be reduced. Then,
By combining non-overlapped parts of above expressions, the result of g i x (n−2i−2)k mod f (x) is given by
Moreover, it is noted that the term degrees of p
. Therefore, there is no overlapped term among all the p (i) 3 , which cost no XOR gates to add them up. Denoted by r the addition of p
. Consequently, to obtain S 2 mod f (x), we only need to add p
2 ) 2 and r in parallel, which cost log 2 n XOR gate delay. We directly conclude the observation.
We next analyze the space and time complexity related to S 2 . Firstly, 2k · n 2 = (n 2 − n)k XOR gates are needed for pre-computation of A s + A t and B s + B t , (n > t > s ≥ 0) in Step (i), which cost one T X in parallel. Then, the n 2 matrix-vector bitwise multiplications in Step (ii) cost k 2 · n 2 = (n 2 − n)k 2 /2 AND gates with T A gate delay. The classification in Step (iii) does not cost any logic gates.
Step (iv) includes adding all the entries of the same row in E g 1 , · · · , E g λ . Since these matrices are determined by g 1 , g 2 , · · · , g λ , the required XOR gates varies according to parity of n. If n is even, each of g 1 , g 2 , · · · , g n 2 consists of n − 1 sub-polynomials. That is to say, E g i , (i = 1, 2, · · · , n 2 corresponds a combination of n−1 matrices E s,t . Thus the coefficient computation for each g i costs nk 2 − k 2 − m + 1 XOR gates with log 2 k T X delay. If n is odd,
consists of n sub-polynomials. Similarly, it costs nk 2 −k −m+1 XOR gates for each g i with log 2 k T X delay.
Step (v) and (vi) follow the description in Observation 3.2.2. We only add n (or n − 1) vectors together to obtain S 2 mod f (x). The space and time complexity for all the steps is stated in Table 2 . 
2) AN EXAMPLE OF S 2 mod f (x)
To illustrate our classification and reduction strategy, we give a small example here. Consider S 2 presented in former example. According to Lemma 5, S 1 can be rewritten as
The organization of E g 2 is almost the same as E g 1 . It is easy to see that the computation of g 1 , g 2 in Step (iv) cost 32 XOR gates with 2T X delay. In addition, 17 more XOR gates are needed as well for Step (v) and (vi) with 2T X delay. Combined with the number of logic gates required in
Step (i), (ii), it totally requires 54 AND and 85 XOR gates for S 2 mod f (x), with T A + 5T X delay.
C. THEORETIC COMPLEXITY
After the computation of S 1 and S 2 modulo f (x), other m XOR gates are needed to add two results together. From Table 1 and 2, it is clear that the delay of S 2 mod f (x) cost one more T X than S 1 mod f (x). Thus, in parallel implementation of S 1 , S 2 modulo f (x), the delay is T A + (1 + log 2 k + log 2 n )T X (or T A + (1 + log 2 k + log 2 (n − 1) )T X for even n). Plus one more T X that cost in the final addition, we obtain the time complexity of our proposed architecture as
The space complexity is
The symbol ''*'' represent the case of odd n. The formulation for the number of XOR varies according to the parity of n. We note that these formulae contain sums of hamming weights related to k − 1 or n − 2. In fact, the expression δ i=0 W (i) can be roughly written as
, where δ is a nonzero integer. Thus, the hamming weight formulations related to n roughly equal O(m log 2 n), while the formulations related to k are roughly equal to O(m log 2 k). Omit the linear parts, the number of required XOR gates can be rewritten as:
The above formula reveals the lower bound of the space complexity of our proposal. Based on (10) and (11), it is obvious that with the increase of the parameter n, the number of required AND gates is decreasing. If n = m, #AND achieves its lower bound, i.e., m 2 +m 2 . But at this time, the number of required XOR gates is more than 7m 2 4 . Therefore, the optimal parameter n should be the one that minimizes both the number of XOR and AND gates. We combine the two formulations with respect to #AND and #XOR, define a function:
, we obtain the minimal value of M (n), which indicate the best asymptotic space complexity of our proposal. In this case, we see that k is almost equal to n. The space complexity is Figure 1 shows the space complexity tendency with the increase of n. It is clear that n could not always increase. Combined with the lowest asymptotic space complexity analysis, we can see that our proposal is more suitable for x nk + x k + 1, where n is smaller than k. 
IV. SPEEDUP STRATEGY
As shown in previous section, the time delay of our proposal is T A + (2 + log 2 k + log 2 n )T X . Since log 2 k + log 2 n ≤ log 2 m + 1, the upper bound of the delay is T A + (3 + log 2 m )T X . This result is worse than the multiplier using classic Karatsuba algorithm. The main reason is the delay of S 2 is bigger than that of S 1 . Indeed, we can add the intermediate values in advance during the computation process of S 1 , S 2 to speed up the whole architecture. For better comprehension, we define some additional notations. • q S 2 ,0 , q S 2 ,1 , · · · , q S 2 ,n−1 represent the coordinate vectors corresponding to the polynomials p
and r 1 after we compute the entries additions of Step (v). For example,
According to Table 1 and 2, it is easy to see that the computation of q S 1 ,0 , q
Our speedup strategy is adding these vector q S 1 ,i and q S 2 ,i directly before completing S 1 and S 2 . Since the computation 1 If n is even, there only n − 1 coordinate vectors corresponding to
of q S 2 ,i cost one more T X than q S 1 ,i , we can perform one more addition for each two vectors, i.e., q S 1 ,i + q S 1 ,i+1 for i = 0, 2, · · · , n − 2 (or i = 0, 2, · · · , n − 3 if n is odd). After this addition, we obtain n 2 column vectors. Plus n (or n − 1) coordinate vectors q S 2 ,0 , q S 2 ,1 , · · · , q S 2 ,n−1 , there are at most 3n 2 vectors need to be added, which requires only log 2 3n 2 T X . The computation sequence of our architecture is arranged as shown in Fig.1 .
As a result, the whole time delay is
Furthermore, based on a related Lemma of [11] , we have 1 + log 2 3n 2 = log 2 3n . Thus, the time delay formulation can be simplified as T A + ( log 2 k + log 2 3n )T X . VOLUME 6, 2018
V. COMPARISON AND DISCUSSION
According to the descriptions in the previous section, it is clear that the time delay of our proposal using speedup strategy is T A +( log 2 k + log 2 3n )T X . However, it is especially attractive if log 2 k + log 2 3n = log 2 3n · k = log 2 3m . (12) At this time, the corresponding time delay is T A + log 2 3m )T X , which approximately equals the fastest 2-term Karatsuba based multiplier [24] . In fact, we have checked all the irreducible x nk + x k + 1, k > 1 of degree m = nk ∈ [100, 1023] over F 2 , and found about 54% such trinomials satisfy (12) , and the rest of them requires at most one T X than than the fastest Karatsuba multiplier so far. Table 3 gives a comparison of different implementations of bit-parallel multipliers in the fields generated by trinomials
More explicitly, we omit the expression O(m log 2 n) in (11), as n is usually smaller than k shown in Section 3.3. Based on this table, it is easy to see that our proposal has better space complexity while maintains relatively low time complexity. The best of our result only costs about m 2 2 + O(m 3/2 ) circuit gates compared with the previous architectures without using a divide-and-conquer algorithm. On the other hand, the time complexity of the proposed multiplier is very closed to the fastest result utilizing classic Karatsuba algorithm.
VI. CONCLUSION
In this paper, we investigate the application of a n-term Karatsuba algorithm and develop a new type of bit-parallel multiplier for a class of irreducible trinomials. The proposed architecture shows that specific type of trinomials combined with Karatsuba algorithm variations can reduce the space complexity further compared with classic Karatsuba multipliers. We next work on the construction of n-term Karatsuba multiplier for general trinomials.
APPENDIX A THE SUB-EXPRESSION SHARING FOR ENTRIES ADDITION IN S 1
Clearly, both P i and P i are k × 1 vectors. Therefore, (7) can be rewritten as:
So we only need to compute entries additions for k intermediate coordinate vectors and all the entries additions can be computed through reusing these values. Table 4 indicates the overlapped values and the number of saved XOR gates. Note that the additions between these vectors without sub-expression sharing require 2(n − 1)k − k n−2 i=1 i XOR gates. By subtracting the number of saved XOR gates, the number of required XOR gates actually is
APPENDIX B RELATED MATRICES OF THE EXAMPLE IN SECTION 3.1.2
As we know the form of A i,L and A i,H , it is easy to obtain the explicit formulae with respect to A i (i = 0, 1, 2, 3), and A S 1 .
For the size of the above matrix, we do not present the line number in the left side. One should note that the rows of A S 1 correspond the term degree [−6, 17] .
and A S 1 , as shown at the top of the next page. 
After reduction process, the explicit form of M S 1 ,1 and M S 1 ,2 are presented as shown at the top of the next page.
APPENDIX C PROOF OF OBSERVATION FOR EVEN n
If n is even, we have λ = --------------- ----------------------------------------------------- x −k in parallel, which cost log 2 (n − 1) XOR gate delay.
---------------------------------------------------------------------
                              , M S 1 ,2 =                               a 0 0 0 --- 0 0 0 --- 0 0 0 --- 0 0 0 a 1 a 0 0 --- 0 0 0 --- 0 0 0 --- 0 0 0 a 2 a 1 a 0 --- 0 0 0 --- 0 0 0 --- 0 0 0 --------------- --- --------------- --- --------------- --- --------------- 0 0 0 --- 0
------------------------------------------------------
