simplicity, its polynomial version is widely adopted to design VLSI parallel multipliers in GF (2 n )-based cryptosystems [13] - [34] .
Two parameters are often used to measure the performance of a GF (2 n ) parallel multiplier, namely, the space and time complexities. The space complexity is represented in terms of the total number of 2-input XOR and AND gates used. The corresponding time complexity is given in terms of the maximum delay faced by a signal due to these XOR and AND gates. Symbols "T A "
and "T X " are often used to represent the delays of one 2-input AND gate and one 2-input XOR gate, respectively. The existing bit parallel GF (2 n ) multipliers may be simply classified into the following three categories according to the asymptotic space complexity of the multiplication algorithm: quadratic, subquadratic and hybrid multipliers. A number of quadratic multipliers have been proposed in the literature in which different basis representations of GF (2 n ) elements are used, e.g., polynomial, shifted polynomial, normal, dual, weakly dual, and triangular bases.
Their time complexities are lower than those of subquadratic multipliers. The main advantage of subquadratic multipliers is that their low asymptotic space complexities make it possible to implement VLSI multipliers for large values of n. But when the size of operands is small, e.g., 32 -bit, the space complexity may not remain as the critical factor considered by a cryptographic processor designer. Instead, the computational speed becomes the key factor. Based on this consideration, the hybrid approach is often used to design practical multipliers [6] [18] [21] [23] [32] . These multipliers first perform a few KOA iterations to reduce the whole space complexities, and then a quadratic multiplication algorithm on small input operands to achieve relatively high speed performance. By selecting different stop conditions for the KOA iterations, the hybrid approach can provide a trade-off between the time and space complexities. For the purpose of comparison, reference [21] implemented four parallel GF (2 233 ) multipliers on Xilinx FPGAs, namely classical, hybrid Karatsuba, Massey-Omura, and Sunar-Koç, and analyzed their time and space complexities in detail. It was shown that for polynomial bases representation the hybrid Karatsuba is the best choice, while for normal bases the Sunar-Koç. An improved structure of the hybrid Karatsuba multiplier of [21] was presented later in [32] , where different possibilities of implementing Karatsuba multipliers were also studied.
Some other related work on KOA multipliers include the following. In [15] , the exact space and time complexities of KOA multipliers were derived. A generalization of the KOA was proposed in [22] , and the other generalization, i.e., the Winograd short convolution algorithm, was presented in [25] . In reference [26] , a non-redundant KOA multiplier was proposed. Its space complexity is lower than that of the original KOA multiplier. References [27] and [28] presented some improved t-term (4 < t < 19) GF (2)[x] Karatsuba-like formulae. These formulae may be mixed with the 2-term and 3-term formulae within recursive constructions, leading to low space complexity subquadratic GF (2)[x] multipliers. FPGA and ASIC KOA implementations can be found in, for example, [19] , [21] , [29] , [31] , [32] and [34] . The interested readers are referred to [34] for a more detailed comparison of some hardware designs.
In the work, we will propose a new algorithm for fast hardware implementations of the polynomial KOA. The proposed algorithm uses a simple and straightforward method to split input operands [3] [4]. The theoretical XOR gate delay of the proposed subquadratic Karatsuba-
multiplier is reduced significantly. For example, it is reduced by about 33% and 25% for n = 2 t and n = 3 t (t > 1), respectively. To the best of our knowledge, this parameter has never been improved since the original KOA was first used to design GF (2 n ) multipliers in 1990 [13] .
elements. To explain the general idea of KOA easily, we will assume that n = 2m = 2 t (t > 1) in the following.
First, the previous KOA implementations split polynomials A and B into the "most significant half" and the "least significant half" as follows:
where
Then the product AB is computed recursively using
We note that "−" is the same as "+" in GF (2), and a 2-input XOR gate can be used to realize a "−" or "+" operation. For VLSI implementations of (1), the expressions in the two square brackets are calculated concurrently, and one XOR gate delay, i.e., 1T X , is required. Then the "−" operation is performed at a cost of 1T X . Therefore, two XOR gate delays 2T X are required to compute the expression in the curly bracket besides the gate delays to compute the three partial (1) are XORed by adding coefficients of common exponents of x together. The VLSI module used to perform this XOR operation is called the overlap module [32] . In order to explain overlaps of common exponents of x clearly, we present the following table, which shows ranges of x's exponents in these three polynomials. 
From the table, it is clear that overlaps occur only when n ≥ 4 (or m ≥ 2), and there is no overlap when n = 2 (or m = 1).
Because of these overlaps, one XOR gate delay is required in the overlap module to compute the summation of the three polynomials
and A L B L in (1). Therefore, a total of 3 XOR gate delays, i.e., 3T X , are required in (1) besides the cost of the recursive computation of the three partial products.
In order to compute the exact complexities of the above binary polynomial KOA, we introduce some symbols from [7] . Let S and D stand for "Space" and "Delay", respectively. We use S ⊗ (n) and S ⊕ (n) to denote the numbers of multiplication (AND) and addition (XOR) operations,
and D ⊕ (n) the gate delays introduced by multiplication and addition operations, respectively.
Our earlier discussion shows that the XOR gate delay
It is easy to see that 2T X is required to compute the product of two polynomials of degree 1, i.e., D ⊕ (2) = 2.
Thus, we have established the recurrence relation of the XOR gate delay. Similarly, we may obtain the recurrence relations of S ⊗ (n), S ⊕ (n) and D ⊗ (n). These recurrence relations describe the time and space complexities of the original KOA [32] .
After solving the above recurrence relations using the formula derived in [7] , we obtain the following complexity results for the binary polynomial KOA [15] , [32] .
B. Motivation
Besides KOA, a Toeplitz matrix-vector product approach was presented recently to construct subquadratic GF (2 n ) multipliers [7] . It takes advantage of the shifted polynomial basis [8] and applies the coordinate transformation technique of [9] and [10] . Both the space and time complexities of the resulting multiplier are better than those of the best KOA-based subquadratic multipliers. For example, with n = 2 t (t > 1), the space complexity is about 8% better, while the time complexity is about 33% better, respectively.
Since these Toeplitz matrix-vector product formulae are obtained by transposing [5, Th. 6, p.
17] corresponding polynomial KOA-like formulae, the following question arises naturally: is it possible to reduce the time or space complexity of the KOA-based subquadratic GF (2)[x] VLSI multiplier further? In the next section, we will answer this question positively, namely, we will improve the theoretical XOR gate delay of the KOA-based subquadratic GF (2)[x] multiplier.
The improved KOA algorithm can be used to design multipliers in both ring GF (2)[x] and finite field GF (2 n ), while the Toeplitz matrix-vector product method cannot be used directly to design
II. NEW METHOD FOR FAST IMPLEMENTATIONS OF GF (2)[x] KOAS
We first introduce the splitting method in [3] and [4] . Instead of splitting input operands into the "most significant half" and the "least significant half", the method split operands according to the parity of x's exponent. That is to say, we may rewrite A and B as follows 
Clearly, formula (3) also includes three partial products. For VLSI implementations of (3) Moreover, the expressions in the three square brackets can be computed concurrently, and these addition operations require one XOR gate delay 1T X . Since the "−" operation also needs 1T X , we know that computing AB via (3) needs only a total of 2T X besides the cost of the recursive computation of the three partial products. Compared to the 3T X gate delays required in formula (1), one XOR gate delay 1T X is saved for each recursive iteration. Consequently, the following recurrence relations, which describe the algorithm complexities, can be established.
Their solutions are as follows:
Compared to the complexities of the original GF (2)[x] KOA listed in (2), the proposed method reduces the XOR gate delay D ⊕ (n) from (3 log 2 n − 1) to 2 log 2 n, or by about 33% for n = 2
Similar to generalizations of the original KOA, which is also called 2-way split, we may derive some KOA-like formulae for j-way splits (j > 2). As an example, we now present the
KOA formula for n = 3k = 3 t (t > 1). It is based on the following 6-multiplication
Let y = x 3 and split A as follows
Then we have
where "(y)"s in expressions A i (y) and B i (y) are omitted.
There are four partial products in the first curly bracket, and they are polynomials in y of degrees 2k − 2, 2k − 1, 2k − 1 and 2k − 1, respectively. Since the constant terms of the last three partial products are zeroes, we know that computing the expression in the first curly bracket requires 2k + (2k − 2) + (2k − 1) + (2k − 1) = 8k − 4 XOR gates. Similarly, it is easy to see that the total number of the XOR gates required in the last two curly brackets are 2k + (2k − 2) + (2k − 1) + (2k − 1) = 8k − 4 and 2k + 3(2k − 1) = 8k − 3, respectively.
But the summation A 0 B 0 + A 1 B 1 , which appears in the last two curly brackets, can be reused.
Therefore, 2k − 1 XOR gates can be saved, and the total number of the XOR gates required in the above formula is (8k − 4) + (8k − 4) + (8k − 3) − (2k − 1) = 22k − 10 besides the cost of the recursive computation of the six partial products. Based on the above discussion, we obtain the following recurrence relations that describe the complexities of this formula. Their solutions will be presented in Table II in the next subsection.
n − 10; and
A. Comparisons
Table II compares asymptotic complexities of the proposed formulae with the previous KOA and Toeplitz matrix-vector product (TMVP) formulae over the ground field GF (2), where #AND and #XOR denote the total numbers of AND and XOR gates, respectively. The size of operands is assumed to be n = 2 t or 3 t (t > 1). These comparisons are made from a theoretical viewpoint.
For practical designs of VLSI multipliers, it is a better choice to merge the proposed method into the hybrid approach discussed in the introduction section.
As shown in the table, the proposed method and the previous KOA have the same space complexities, but the XOR gate delay of the proposed method outperforms the previous KOA when t > 1. We list complexities of the TMVP in the table because both KOA and TMVP can be used to design GF (2 n ) subquadratic parallel multipliers, which is an important application field of these two algorithms. But we must emphasize that these two algorithms are distinct, and each of them have their own application fields [6] . Take the GF (2 n ) subquadratic parallel multiplier as an example. Since there is no known value of n for which an irreducible polynomial of weight w < 6 does not exist [11] , we need only to select either an irreducible trinomial or an irreducible pentanomial of degree n to generate GF (2 n ). In order to adopt the TMVP approach in the design stage, the coordinate transformation technique must be used to obtain the desired Toeplitz matrix [7] . The corresponding transformation matrices for irreducible trinomials and a special type of irreducible pentanomials
have been derived when the GF (2 n ) elements are represented in the shifted polynomial basis [7] . But no explicit transformation matrices are currently available for other bases, e.g., the polynomial basis. On the other hand, the KOA-based GF (2 n ) subquadratic parallel multiplier consists of two steps: (1) the KOA multiplication, and (2) a modulo reduction operation using an irreducible polynomial. The second step, which depends on the form of the field generating irreducible polynomials, has been studied by many authors. Therefore, a hardware engineer can use these theoretical results directly to design an GF (2 n ) subquadratic parallel multiplier. The interested reader is referred to a recent survey paper [12] for more details.
B. An Example
We now present an example to compare the proposed method with the original KOA.
where (3 log 3 n)TX + TA degree 1 in x. Then the original KOA computes the product AB using
There are three products of polynomials of degree 1 in (4), and they can be computed recursively using the KOA at a cost of
can be computed using
To show the role of the overlap in (4), let group the three products in (4) and write them as polynomials of degree 2 in x as follows:
Clearly, one XOR gate delay 1T X is required to compute the overlap summations (k 0 + d 2 ) and
Since we need 2T X to perform the XOR operations in the curly bracket of (4) From (3), the proposed method computes AB using
Now define four polynomials of degree 2 in y as follows: We need 1T X to perform "+" operations in the last two equations. Since the proposed method is identical to the original KOA when the two input polynomials are of degree 1, i.e., formula (5), we need 2T X to compute the three products of polynomials of degree 1 in y in the above four equations. Thus, we need a total of 3T X to obtain all GF (2) elements p i , q i , r i and s i in the above four equations, where i = 0, 1 and 2. Now, the product AB can be computed using 
Clearly, one XOR gate delay 1T X is required to obtain the summations in the five square brackets.
Therefore, the total number of XOR gate delays required to compute AB is 3+1=4, and 1T X is saved compared to the original KOA.
The following arithmetic circuit illustrates the two-level recursion formula (7) . The circuits in the three rectangular dotted boxes are the same, and each of them implements the original KOA formula (5), which computes the product of two input polynomials of degree 1. Due to the parallelism, the six XOR operations "+" in the six doted circles contribute no gate delay to the total XOR gate delays of formula (7) . The interested reader may compare this circuit diagram to Figure 8 .1 of [6, p. 222] which illustrates the original KOA two-level recursion formula (6).
III. CONCLUSIONS
We have proposed a new method to implement the polynomial KOA for VLSI multipliers.
It eliminates overlaps in the previous designs. The XOR gate delay of the proposed GF (2)[x]
KOA is significantly better than that of the previous KOA. Besides the theoretical significance, the proposed method is also suitable for practical VLSI applications, e.g., designs of hybrid GF (2 n ) multipliers. Fig. 1 . An arithmetic circuit illustrating formula (7).
