Abstract-Some applications like cryptography involve a large number of multiplications of binary polynomial. In this paper, we consider two-, three-, and four-way methods for parallel implementation of binary polynomial multiplication. We propose optimized three-and fourway split formulas which reduce the space and time complexity of the best known methods. Moreover, we present a block recombination method which provides some further reduction in the space complexity of the considered two-, three-, and four-way split multipliers.
INTRODUCTION
F INITE fields are part of important applications like coding theory and cryptography. For example, protocols over elliptic curves [9] , [10] require several hundreds of multiplications and additions either in a prime or extended binary finite field of size . Efficient finite field arithmetic, specifically field multiplication and addition, is thus essential to obtain improved implementations of cryptographic protocols.
In this paper, we focus on bit parallel hardware implementation of multiplication over extended binary fields F . Such a field can be defined as the set of univariate binary polynomials modulo an irreducible polynomial of degree . An addition of two elements in F consists of pairwise modulo-two addition of the coefficients, and this can be easily implemented with parallel XOR gates. The multiplication in F consists of a regular binary univariate polynomial multiplication followed by a reduction modulo an irreducible polynomial . When is a sparse polynomial, i.e., a trinomial or a pentanomial, the reduction consists of few shifts and additions and is thus quite simple to implement. The multiplication of polynomials is generally more costly. For the considered range of , the most efficient methods to multiply two binary polynomials are the subquadratic method based on Karatsuba [11] or three-way split formulas [12] which have a complexity of bit operations with < < .
During the past few years improvements have been made on the Karatsuba and three-way split formulas. Specifically, Bernstein [2] and Zhou-Michalik [14] have independently improved the space complexity of Karatsuba formula by optimizing the reconstruction part of the formula. Bernstein in [2] has also extended this approach to two recursions of the Karatsuba formula, leading to a four-way split formula with a better space requirement than the original Karatsuba formula. In [4] , Cenk et al. have applied the technique of Bernstein to improve the space complexity of the three-way split formula with six recursive multiplications. Finally, Fan et al. [6] have optimized the delay of two and three-way split formulas by using a different kind of splitting of the polynomials.
In this paper we propose some further optimizations of three and four-way formulas. Concerning the three-way split, we remove some redundant computation in the formula of [4] to reduce the space complexity. We then combine this idea with the approach of [6] to obtain a three-way split formula which has the same delay as the one in [6] but with a smaller space complexity. For the four-way formula, we combine the four-way approach of [2] and the modified splitting of [6] , this yields a formula which has the same delay as [6] but with a lower space complexity.
We also investigate an extension of the block recombination approach proposed in [8] which reduces the space complexity of a subquadratic space complexity architecture for Toeplitz matrix-vector product. Since subquadratic approaches for Toeplitz matrix-vector product and subquadratic algorithms for polynomial multiplication are very similar, we can extend the approach of Toeplitz matrix-vector product to the considered two-, three-and four-way split multiplier. This extension needs to reevaluate the different cost of the different blocks of the multiplier. Then we will be able to recombine a parallel architecture performing two polynomial multiplications in parallel followed by an addition and then apply this recombination to different two-, three-and four-way split multipliers. This results in multipliers with smaller space complexities while having the same delay.
The remainder of this paper is organized as follows: in Section 2 we review the best known two-, three-and fourway split formulas for binary polynomial multiplication. In Section 3 we propose some optimizations for three-way and four-way split formulas. We then present in Section 4 a block recombination approach for polynomial multiplication. We finally compare the complexity of our proposed methods to the best known methods and give some concluding remarks in Section 5.
REVIEW OF SUBQUADRATIC METHODS FOR POLYNOMIAL MULTIPLICATIONS
Let us consider two binary univariate polynomials and of degree at most . The product can be computed by a direct expansion of their respective expression < This method, known as the schoolbook method, requires bit multiplications and bit additions. When and the bit operations are performed in parallel fashion, this approach results in a quite large circuit. For this size of , subquadratic methods are generally more suitable for practical applications. These methods split each polynomial in two or three and then express the product in terms of a number of products of smaller degrees. When this process is applied recursively, it results in , , bit operations for the whole multiplication. In this section we review the best known two and three-way split approaches.
Remark 1.
In the sequel, we will not discuss the reduction operation modulo the irreducible polynomial which defines the finite field F . We just mention that when has a sparse form, i.e., for example is a trinomial with < , then the reduction can be performed by only a few shifts and additions. Indeed let be a product of two degree polynomials then:
since . This means that after applying this process twice we would get a polynomial of degree < .
Review of Two-Way Formulas
Here we consider the two-way approach, also known as the Karatsuba method, for binary polynomial multiplication. We first review the original Karatsuba formula and then present some recent optimizations on the space and the time complexities.
Karatsuba Formula
We consider two -term and in F where is supposed to be a power of 2. The original Karatsuba approach consists of splitting and in two parts and then compute in two steps:
Recursive product. We perform three half size polynomial products
Reconstruction. We reconstruct as
Following [2] , [14] we perform this reconstruction in three steps:
and this requires bit operations. The three multiplications and are performed by applying the same method recursively. When each product and is performed in parallel and recursively, the resulting multiplier has the complexity given in (1) with a recursive form (in the left side) and a nonrecursive form (in the right side). Note that S represents the number of bit additions and S the number of bit multiplications and and represent the delay of a bit level addition and multiplication, respectively. (1) shown at the bottom of the page.
Remark 2. The Karatsuba approach applies over more general fields or rings: in those more general cases the formula is the same as the one previously reported except for the reconstruction part which involves two subtractions. Since in F we have these minus signs disappear in the case of binary polynomials. This remark also applies to the others formulas reported in the remainder of this paper.
Overlap Free Method for Two-Way Split Multiplication
Fan et al. in [6] use a different way to split the polynomials and in the Karatsuba formula leading to a reduction in the delay of the resulting parallel multiplier. Indeed, following
, we separate the even and odd coefficients of and as follows
The terms and , for , are considered as degree polynomials in evaluated at . The resulting overlap free two-way split formula is as follows:
Recursive product. We perform three half size polynomial products Reconstruction. We then reconstruct as
The space requirement (S and S ) is slightly larger than the former Karatsuba formula (1), (4) shown at the bottom of the page. but the delay is reduced. Indeed, the data flow of the formulas (2) and (3) is shown in Fig. 1 . Since the blocks "Eval. at " do not involve any gates, the critical path delay of Fig. 1 is equal to D D . When applied recursively, this results in a delay of D . This is about a 33% improvement in the delay complexity with respect to the original Karatsuba method.
Review of Three-Way Multiplication
In this section, we review the best known three-way split formulas with six recursive multiplications. Specifically, we review the formula proposed in [4] which has the smallest space complexity and the formula of Fan et al. [6] which has the smallest delay among all formulas. As before and are two binary polynomials of degree in F where is a power of 3.
The formula based on the regular splitting [4] decompose these two polynomials and in three parts and replace by a new indeterminate . This results in two polynomials and of degree two in . We can then use formula for multiplication of degree two polynomials to derive the formula shown in Table 1a . Fan et al. in [6] have proposed to modify the three-way split formula by using a different splitting. Specifically, the authors in [6] split the polynomials and in three parts as Then, we can view and , as degree polynomials in evaluated at and we consider and as polynomials of degree two in . Fan et al. in [6] have then applied formula for degree two polynomial and have derived the formula described in Table 1b .
We have also reported in Table 1 the resulting overall cost, in terms of bit addition and bit multiplication of the two three-way split formulas, along with delay of the multipliers. These complexity results show that in term of space complexity the formula of [4] is the best choice. The formula of [6] have better delay due to the interleaving form of the reconstruction part of the formulas.
Review of Four-Way Split Formula With Nine
Recursive Multiplications
In [2] , Bernstein has shown that it is possible to obtain further improvement in polynomial multiplication when by 
S S S S S S
S considering a four-way split formula. We review here this approach. Let and be two polynomials of degree , which are split in four parts and , where and have degree for . The idea of Bernstein is to apply two recursions of the two-way split formula and then optimize the reconstruction part of the resulting formula. This results in nine recursive products of polynomials of size :
Following [2] , we apply two recursions of the Karatsuba reconstruction which results in the following expression of in terms of .
Bernstein has proposed to arrange the previous expression as follows
As noticed by Bernstein, the above formula requires fewer bit additions than two recursions of the Karatsuba formula (Section 2.1.1). The recursive expression of the delay of Bernstein's four-way multiplier is D D , and this is better by one than the delay of two recursions of Karatsuba formula. The resulting nonrecursive expression of the four-way split formula, when is an even power of 2, is as follows D
PROPOSED OPTIMIZATION FOR THREE AND FOUR-WAY FORMULAS
In this section we propose some optimizations for three and four-way split formulas. The three-way split optimization removes some redundant computations performed in the formulas presented in Section 2.2 resulting in a better space complexity for each case. We then present a four-way split formula which combines Bernstein's four-way approach and the overlap free approach. This formula achieves the delay of the two-way formula of Fan et al. with a smaller space complexity.
Improvement on the Space Complexity of Three-Way Multiplication
In this section, we present an optimized version of the formula presented in Table 1a . We keep the six recursive products unchanged and optimize only the reconstruction part of the formula. We recall that the six products in Table 1a result in six terms of degree and the product is reconstruct as follows
The computation in , which has a cost of bit additions as shown in Table 1a , can be slightly optimized. For this, after splitting we write and we note that, based on Fig. 2 , there are two sums of two terms which appear twice during the computation of
We can modify the reconstruction in order to save each of these re-occurring two term sums. For this, we split the three products and in two parts and where and are degree polynomials for Table 2 .
If we consider the overall optimized formula which comprises the recursive products of Table 1a plus the proposed  optimized reconstruction of Table 2 , we obtain the recursive and non-recursive forms of the complexities given in (6) . Note that the modification done on the formula does not affect the delay, so it is unchanged.
Improvement of the Overlap Free Three-Way Formula
In this section, we present an optimization of the overlap free three-way split formula (Table 1b) . We do by leaving the recursive multiplications of the formula unchanged. Let be the six products. We solely modify the sequence of operations in the reconstruction: we propose to perform the reconstruction of as follows
We then denote and . We note that the computation of consists of interleaving the coefficients of and . This does not require any logic gate. Then we write and we note that for the computation of we can save some bit operations. Indeed, it is shown in Fig. 3 that some bit additions appear twice in the computation of .
Based on this fact, in the computation of , we perform only one of these redundant bit additions. The resulting optimized bit operations for are given below:
The final operation is done in a straightforward way: the operation consists of interleaving the coefficients of and and then we just have to add the resulting polynomial to .
Space complexity. The recursive products have the same cost as given in Table 1b , i.e., S bit additions and S bit multiplications. The proposed reconstruction formula requires no bit operations for , but bit additions for the computation of and for the final addition in . Consequently, the proposed three-way split multiplier has the following space complexity We first remark that the delay to compute is equal to the delay of the computation of plus which results in D . The delay required to compute is equal to the delay to compute and this is equal to D . Eventually, the critical path delay for the computation of is equal to D D which can be rewritten in the following non-recursive form: D
Proposed Four-Way Split Formula
We now propose a four-way split formula which improves the delay of Bernstein's four-way formula, while having a slightly larger space complexity. The proposed formula is obtained by combining the four-way formula of Bernstein and the overlap free method of [6] . Indeed, we use the following special splitting for the two polynomials and of degree :
where and are each a polynomial of degree in . Then we can consider and as polynomials of degree 3 in . We can then apply the four-way split formula of Bernstein (Section 2.3) to compute the product. To obtain the expression of we need to replace with , this is done just before proceeding to the reconstruction:
We provide a detailed four-way split formula in Table 3 . Space complexity. The costs of the recursive products are unchanged compared to Section 2.3: only the costs of the steps of the reconstruction, which are different, are to be determined. In Table 3 we detail these costs and derive the recursive form of the overall complexity. The corresponding non-recursive form of the complexity is given below:
Time complexity. The data-flow of the reconstruction part of the proposed four-way formula is depicted in Fig. 4 . In this data-flow graph the Interleaving blocks and shifting blocks do not involve any logic gate. A close look at this data-flow graph shows that any path starting from the input or of the multiplier and goes through any of the and then down to encounter, at most, four XOR gates. We deduce that the critical path delay is D D which results in the non-recursive form D for an even power of 2.
BLOCK RECOMBINATION FOR POLYNOMIAL MULTIPLICATION
In this section we present a block recombination approach which reduces the space complexity of the multipliers presented in Sections 2 and 3. This approach is an extension of the approach presented in [8] for Toeplitz matrix-vector multiplication (TMVP). Indeed, the subquadratic formula for TMVP and polynomial multiplication are very similar; they differ slightly since there are no computations corresponding to the computations on the Toeplitz matrix and the reconstruction computations are different. We thus follow the same process as in [8] :
We decompose the parallel multipliers of Sections 2 and 3 into a number of blocks. We then adapt the optimization of [8] concerning twomultiplications-and-add architecture to the case of polynomial multiplication.
We then apply this optimization to recombine the block of the considered multipliers. This recombination is slightly different from the case of TMVP since the blocks involved are different: there are fewer blocks for the component formation and more blocks in the reconstruction than for TMVP computation.
Decomposition of Polynomial Multiplication Algorithms
Any of the polynomial multiplication algorithms presented in Sections 2 and 3 can be divided into three separate computations: the component polynomial formation (CPF), the component multiplication (CM) and the reconstruction (R). Here, we explain these computations for the two and threeway split formulas described in Sections 2 and 3. The fourway split approach can be seen as an optimized version of the two-way formula. Consequently, we will not differentiate two and four-way split multipliers and, later in this paper, complexity results of these multipliers are presented in the same table.
Component polynomial formation (CPF). Let be a polynomial of size in the two-way case and in the three-way case. We recursively define the component polynomial formation as follows (9) shown at the bottom of the page.
We can similarly define and functions for overlap free formulas: this requires us to replace the regular two-way and three-way splitting by the corresponding overlap-free splitting.
In the sequel, we will often refer to the component representation of a degree polynomial as the bit array (resp. ) which has a bit length of (resp. ).
Component multiplication (CM). The component multiplication consists of the bit-wise multiplication of the component polynomial formations and of two polynomials and
This operation can be implemented with parallel AND gates in the case of two-way split approaches and parallel AND gates in the case of three-way split approaches. Reconstruction (R). The reconstruction of a vector of length (resp. ) consists of recursively applying the reconstruction part of the considered two or four-way (resp. three-way) split formula of Sections 2.1 and 2.3 (resp. Sections 2.2 or 3.1) to obtain a polynomial of degree . For example, the function of the Karatsuba formula of Section 2.1.1 and the threeway split formula of Table 1a are as follows (10) shown at the bottom of the page. (11) shown at the bottom of the page. Similar functions and can be defined for each formula given in Sections 2 and 3. In the sequel, we will recombine the blocks of the multiplier in order to reduce the space complexity. But first, we establish the complexity of each block of the multiplier for various formulas presented in Sections 2 and 3.
Lemma 1 (Block complexities). We consider a multiplier based on one of the formulas of Sections 2 or 3. Let S be the number of bit additions involved in the multiplication, then the block complexities are as follows:
Two and four-way formulas Three-way formulas
S S S S S S S S
Proof. For the complexity of the function, from (9), we have
S S S S S S
as stated in Lemma 1. For the CM function, it is a direct consequence of the fact that a component element has a bit length of (resp. ). For the complexity of the function: it is equal to the complexity of the multiplier minus the complexity of two s and one . Explicit complexity based on the different formula stated in Sections 2 and 3 are reported in Table 4 . We remark that > > > > the repartition of the gate counts into the blocks and is slightly different than the repartition into the blocks and of the TMVP multiplier [8] . Specifically, here, there is one big block while the three others are smaller, but in the case of TVMP there were two big blocks and , and two small blocks and . ◽
Block Recombination of Two-Multiplications-and-Add
In this section, we state the basic block recombination operation, which we will use later to recombine the two and threeway multipliers. Specifically, we consider the problem of performing two instances of polynomial multiplications in parallel followed by an addition. In other words, if and are polynomials of degree each, we want to compute A straightforward approach to design a parallel architecture which computes is shown in Fig. 5 .
In order to reduce the space complexity of this architecture, we recombine the blocks of Fig. 5 based on the following lemma.
Lemma 2. Let and be two arrays of bits (resp. bits). Let (resp. ) be the reconstruction function, then we have
The proof is very similar to the proof of Property 1 in [8] in the case of TMVP: the only difference is that we have a reconstruction formula which is slightly different. We thus just skip this proof here.
The previous lemma shows that the sequence of operations, two-reconstructions-then-add, can be reversed, i.e., we can first perform an addition of the original input components and then perform the reconstruction on the sum. In Fig. 6 , we apply this property to recombine the lower two layers of blocks of the two-multiplication-and-an-add architecture (Fig. 5) . We call an addition of two components and a component addition (CA). This CA can be computed with parallel XOR gates. We thus have placed a block of this kind in the right side of Fig. 6 .
In Fig. 6 we also provide the sizes of different blocks: an addition of input components (CA) requires XOR gates, where
. Similarly, we note that the reconstruction block has a space complexity in the interval (see Table 4 ). The recombination replaces a block consisting of at least XOR gates by a block that has only XOR gates. This means that we save at least XOR gates.
Two-Way Split Block Recombination
In this section we present a block recombination approach for the two-way split multiplier in order to reduce the space complexity. Let and be two degree polynomials where is a power of 2. We split and in two parts and , where and have degree for We then expand the product as follows
We can design an architecture based on the previous expression of which independently performs each product through a two-way split multiplier. After that we reconstruct by performing the additions and shifts corresponding to the expression . The resulting architecture is depicted in Fig. 7 .
We then note that the computation of in Fig. 7 consists of two parallel multiplications followed by an addition. We can apply the block recombination presented in Section 4.2 which reverses the order of the reconstruction and the addition. We also notice that we can remove some redundant blocks and by keeping only one block for each and for and . This results in the architecture shown in Fig. 8 . Complexity of the recombined architecture (Fig. 8) . We first evaluate the space complexity of the recombined architecture. We note that the final addition block requires XOR gates since the degree of the polynomials and is at most . We then express the space complexity in terms of the complexity of each block:
S S S S S S
In (14), we can replace the terms S S S and S by their explicit expressions in terms of given in Lemma 1 and Table 4 . This leads to the complexity results reported in the middle two columns of Table 5 .
We now evaluate the delay of the two-way split multiplier obtained through block recombination. The critical path is . This results in the delays reported in the right most column of Table 5 corresponding to the formulas presented in Sections 2 and 3.
Four-Way and Three-Way Recombination
In the Section 4.3, we have presented a two-way block recombination. We sketch here two extensions of this approach:
Three-way recombination. Let and be two polynomials of degree at most where is a power of 3. We split and in three parts and where and have degree for . After expanding the product , we obtain:
Four-way recombination. Let and be two polynomials of degree , where for . We split and in four and , where and have degree for . We expand the product as follows:
Based on the above three-way (resp. four-way) expressions of , we can design a three-way (resp. four-way) block recombined multiplier as follows:
1. We apply the function on each polynomials and . 2. We apply the CM function for all in parallel. 3. The addition corresponding to each are performed in component representation. 4. Each is then reconstruct through a block. 5. The final result is finally obtain after performing the overlapping addition of (15) (resp. (16)).
We express the space complexity in terms of the complexities of the different blocks and plus the cost of the additions in the final step. We also derive the delay of the multiplier by finding the corresponding critical path of the considered multiplier. These complexities are given in Table 6 .
In the above table is equal to the delay of a three-way (resp. four-way) multiplier with input size (resp. ) bits. If we now replace the block complexities given in Lemma 1 Table 4 in the above expressions, we obtain the complexities summarized in Table 7 .
Remark 3. The four-way block recombination is a particular case of -way block recombination. We have evaluated the resulting complexities for various values of , and have noticed that the least total gate count (AND and XOR combined) is obtained when . So the above fourway approach is the best approach when considering space complexity.
COMPLEXITY COMPARISON AND CONCLUSION
In this section, we provide comparisons of the proposed formulas with best known approaches for binary field multiplication: formulas for polynomial multiplication reviewed in Section 2 along with binary field multiplication methods based on Toeplitz-matrix vector product (cf. [5] , [8] ). In Section 5.1, we compare the two-way split approaches, this includes results for the four-way split approach as the latter is an extension of the two-way approach. The three-way split approaches are compared in Section 5.2 followed by some concluding remarks. Table 8 summarizes the complexity of the two-way split approaches, i.e., non-recombined and recombined formulas, for the multiplication of polynomials.
Comparison of Two-Way Split Approaches
If we consider non-recombined polynomial multipliers in Table 8 , we see that the four-way approach of [2] has a better space and time complexities compared to the two-way approach of [2] , [14] . To the best of our knowledge this four-way approach has the smallest overall space complexity (AND and XOR gates combined) among all two or four-way formula. In terms of the time complexity, the two-way approach of [6] offers the best result among the existing non-recombined multipliers. Our proposed four-way formula achieves the same time complexity of [6] but with fewer gates. If we consider the space-time product complexity, the proposed four-way formula has the best result among all non-recombined twoway split formulas.
If we now compare all recombined and non-recombined polynomial approaches listed in Table 8 , we see that, when we only consider the overall space complexity, the best approach is the four-way recombination of the four-way formula of [2] , which requires gates operations. We also notice that the four-way recombined four-way split formula of Section 3.3 matches the best timing result of [6] while having a smaller space complexity.
If we now compare complexities of multipliers based on TMVP with multipliers based on polynomial formulas, we notice that the four-way recombination of four-way formula of [2] achieves the best results in terms of total number of gates. Specifically, the four-way recombination of the fourway formula of [2] requires gates while the recombined TMVP formula necessitates . If we now compare the most efficient approach in terms of delay, we notice that the recombined TMVP formula remains the best approach: it requires gates and has a delay of while the four-way recombination of the proposed four-way formula in Section 3.3 necessitates gates with a delay of . But, in design environments like ASIC the space of an XOR gate may be larger than AND gates [13] : in ASIC, an XOR gate are generally twice the size of an AND gate. 1 In this case, we obtain the following complexities:
The recombination of TMVP formula of [8] has a space complexity of AND equivalents, The four-way recombination of the formula of Section 3.3 requires AND equivalents, and this makes the proposed four-way recombination of the formula of Section 3.3 slightly better. In Table 9 , we have reported explicit complexities for three values of of cryptographic interest. We then notice that:
The total gate complexity of the proposed four-way formula in Section 3.3 improves the Fan et al. approach of [6] by . The four-way recombination of the formula of Section 3.3 improves the total gate counts of Fan et al. approach [6] by . And when we consider the ASIC environment where an XOR gate is twice the space requirement of an AND gate this improvement ratio goes up to .
The four-way recombination of four-way formula of [2] improves its non-recombined form by in terms of gate counts, and by in ASIC environment. The four-way recombination of the proposed four-way formula improves the gate count of the recombined TMVP multiplier by , and by in ASIC environment.
These comparisons show that the recombination provides some significant gain in term of space complexity, and particularly in ASIC environment.
Comparison of Three-Way Split Approaches
In Table 10 we give the complexity of polynomial multiplication and TMVP based on three-way split methods.
Let us fist compare the non-recombined polynomial formulas. Based on the results in Table 10 , we remark that the proposed method in Section 3.1 has the smallest space requirement. We also notice that the proposed delay efficient method in Section 3.1 matches the time complexity result of the three-way split formula of [6] and has a smaller number of gates.
If we now consider the recombined polynomial formulas, we see that the recombination of the proposed two methods (Sections 3.1 and 3.2) reduces their total gate counts without increasing their delays.
If we compare the polynomial formulas with the recombined TMVP three-way formula we notice that the latter compares favorably.
Indeed, the proposed block recombination of the space optimized three-way formula has, roughly, the same space complexity as the block recombined TMVP formula while having a larger delay. Similarly, the recombination on the delay efficient three-way formula reach the same delay as the recombined TMVP approach, but with a larger space complexity.
These remarks are confirmed by the explicit complexities shown in Table 11 for the two cases and . Indeed, the proposed space optimized three-way split formula improve the space requirement of [4] by while the proposed delay efficient formula improves the space requirement of Fan et al. three-way multiplier [6] by . The block recombination enhances these optimizations since we obtain a reduction of the area requirement by 10% for the space efficient formula and by 8% for the delay efficient formula.
Remark 4.
We have compared both polynomial and TMVP approaches: indeed, both of them can be used to implement multipliers in finite field defined by an irreducible trinomial or pentanomial. But, we note that mention that, compared to Toeplitz matrix vector product, polynomial multiplication is found in more applications. For example, the subquadratic optimal normal basis multiplier presented in [3] requires polynomial multiplication only. Another example is the set of countermeasures preventing active and passive side channel analysis on ECC proposed by Baek and Vasyltov in [1] : this approach is built upon binary polynomial arithmetic.
Concluding Remarks
Multiplication of binary polynomials with sub-quadratic arithmetic complexity is often used in today's cryptographic systems, such as those based on elliptic curves and pairing [10] , [7] . To this end, in this paper we have considered efficient bit parallel designs of sub-quadratic space complexity polynomial multipliers. Specifically, we have proposed improvements in terms of gate counts and delay to the existing best two-and three-way multipliers. The improvements have been achieved by identifying and removing some re-occurring computations. We have also proposed improvements to Bernstein's formula for the four-way split multiplication. More importantly, for the first time ever we have applied the block recombination approach to polynomial multiplication. This has provided a new area-time trade-offs and the least area-time product complexity among various multiplication schemes considered in the paper.
ACKNOWLEDGMENTS
This work was supported by PAVOIS ANR 12 BS02 002 02. 
