Abstract-In this paper, we present a new method for parallel binary finite field multiplication which results in subquadratic space complexity. The method is based on decomposing the building blocks of the Fan-Hasan subquadratic Toeplitz matrix-vector multiplier. We reduce the space complexity of their architecture by recombining the building blocks. In comparison to other similar schemes available in the literature, our proposal presents a better space complexity while having the same time complexity. We also show that block recombination can be used for efficient implementation of the GHASH function of Galois Counter Mode (GCM).
INTRODUCTION
F INITE fields have a wide range of applications in number theory, coding theory, and cryptography. Binary fields are specially attractive for high-speed cryptographic applications since they are inherently suitable for hardware implementations. A binary field IF 2 n is generally constructed as the set of binary polynomials modulo an irreducible polynomial P of degree n. An element of IF 2 n is then a binary polynomial of degree smaller than n. Addition and multiplication in a binary field consist of polynomial addition and multiplication modulo an irreducible binary polynomial P of degree n.
In general, field multipliers can be categorized into three different categories: bit-level, digit-level, and bit-parallel. Parallel multipliers present the fastest category of designs. Mainly, two types of parallel multipliers exist in the literature. The first type are the quadratic complexity architectures which require quadratic space complexity (Oðn 2 Þ gates for their implementation), cf. [10] , [14] . The second type are the subquadratic space complexity designs which require smaller number of gates for their implementation (Oðn Þ with 1 < < 2), cf. [3] , [13] , [17] , [1] . The latter presents practical architectures for hardware implementation of large field sizes, specially used in elliptic curve cryptographic applications [9] , [12] (163 < n < 571).
There exist a limited number of algorithms to design subquadratic space complexity multipliers. The most wellknown algorithm is based on the integer Karatsuba multiplication scheme, which has been widely applied to develop subquadratic space complexity multipliers [8] . Architectures proposed in [3] , [13] are based on this method. Another well-known algorithm is the Winograd short convolution algorithm [19] used for the same purpose in [17] . Chinese Reminder Theorem (CRT) is another example of such algorithms that results in subquadratic complexity multipliers [1] .
Recently, Fan and Hasan proposed a new scheme to design subquadratic space complexity multipliers using Toeplitz matrix-vector product [4] . In their scheme they have used shifted polynomial basis for field elements representation to model the multiplication as a Toeplitz matrix product by a vector. Their method can also be used to design subquadratic space complexity polynomial, dual, weakly-dual, and triangular basis multipliers. Then they proposed to perform two-way split or three-way split approaches to break down each matrix-vector product into a number of smaller size matrix-vector products. In [4] , recursive use of this approach results in subquadratic space complexity multipliers which outperform the ones using Karatsuba, Winograd, and CRT.
In this paper, we propose an extension of the work of Fan and Hasan. We first decompose the Fan-Hasan multiplier based on Toeplitz-matrix vector product (TMVP) into a number of different blocks. Each block performs independent computations (component matrix and vector formation, component multiplication and reconstruction). We then propose to recombine these blocks in order to reduce the space complexity. We first apply this recombination to a two-TMVPs-and-add architecture. We reorder the blocks and replace one reconstruction block by a smaller bitwise addition block. The space complexity is then reduced. After that, we use this recombination in a single TMVP multiplier. We express this product in terms of two-TMVPs-and-add. We obtain a multiplier which has less space and time complexities.
The rest of this work is organized as follows: in Section 2, we briefly review binary field multiplication and by recalling the multiplier of Fan and Hasan and we decompose it in four different blocks. In Section 3, we present our block recombination method in the special case of twoTMVPs-and-add architecture. Then we apply this method to reduce the space complexity of a single TMVP architecture and compare the results with previous similar approaches. We continue, in Section 4, by presenting an application of block recombination in the design of a space efficient architecture for GHASH computation. Finally Section 5 presents some concluding remarks.
BRIEF REVIEW OF PARALLEL BINARY FIELD MULTIPLICATION
A binary field IF 2 n is generally defined as the set of binary polynomials modulo an irreducible polynomial P of degree n. An element in IF 2 n is then a binary polynomial of degree less than n.
are two elements of IF 2 n , we can compute the product C ¼ A Â B mod P as
Expanding the expression of B and considering that A ðiÞ ¼ AX i mod P . This can be written through a matrix vector product
Direct hardware implementation of this matrix-vector product results in a quadratic area complexity circuit (i.e., it requires Oðn 2 Þ gates). The two commonly used strategies to design an efficient hardware multiplier via the above matrix vector product are the following:
1. The choice of the polynomial P must provide an efficient computation of the columns A ðiÞ . Until now irreducible all one polynomials (AOP) and trinomials seems to be the best possible choices [18] , [7] . However, none of them exists for all degrees of n. Consequently, other types of irreducible polynomials have been considered. Specifically, pentanomials of this form [15] 
2. The second strategy consists of modifying the matrix
in order to obtain a Toeplitz matrix. Recall that an n Â n Toeplitz matrix is a matrix ½t i;j 0 i;j nÀ1 such that t i;j ¼ t iÀ1;jÀ1 for 1 i; j n À 1. We will see in Section 2.1 that a Toeplitz matrix-vector product can be computed efficiently through a subquadratic complexity circuit (i.e., it requires Oðn Þ gates where 1 < < 2). Generally we can obtain the Toeplitz form of the matrix in (1) by performing some row operations or column operations. In other words, we get this Toeplitz structure by using different bases of representation as it was shown by Hasan and Bhargava in [6] . This can be performed efficiently on fields defined by trinomials or specific pentanomials [4] . For the remainder of this paper, we assume that binary field multiplication C ¼ A Â B has already been expressed as a Toeplitz matrix-vector product
Fan-Hasan Subquadratic Toeplitz Matrix-Vector Multiplier
In this section, we recall the method used to build a subquadratic circuit which computes a Toeplitz matrixvector product [4] . If 2jn, Fan and Hasan proposed to use a two-way split approach shown in Table 1 to compute a matrix vector product T Á V , where T is an n Â n Toeplitz matrix and V is a vector of size n. The two-way split approach breaks down a Toeplitz matrix-vector product of size n into three Toeplitz matrix-vector products of size n 2 . If 3jn, they proposed to use the three-way split approach which is also shown in Table 1 . The three-way split approach results in six Toeplitz matrix-vector products of size n 3 . If n is a power of 2 or a power of 3, Fan and Hasan also proposed to recursively use the formulas given in Table 1 to perform T Á V . Using this recursive process through parallel computation, the resulting multiplier would have the complexities given in Table 2 (cf. [4] ). In this table D A represents the delay of a two-input AND gate and D X the delay of a two-input XOR gate. It is also possible to design subquadratic TMVP multipliers for size n ¼ 3 i 2 j by combining the two-way and three-way split approaches in the recursive computations.
Block Decomposition of Fan-Hasan Multiplier
In this section, we decompose the Fan-Hasan multiplier into a number of independent blocks. We will then evaluate the [19] , [4] complexity of each block. We decompose the recursive formulas of Table 1 in four independent computations
. Component matrix formation (CMF). We call component matrix formation the recursive matrix computation of Table 1 . For example, for the two-way split approach, the first recursion on T computes T 0 þ T 1 ; T 1 , and
This means that the component matrix formation can be expressed as
In the sequel, we will often refer to CMF ðT Þ as the component representation of the Toeplitz matrix T . . Component vector formation (CVF). This corresponds to the recursive computation on the vector V in Table 1 . For example, for the two-way split approach, we see that recursion is applied to Table 1 , we see that CMF ðT Þ and CV F ðV Þ are multiplied component by component at the end of the recursion (this corresponds to the recursive multiplication yielding P 0 ; P 1 , and P 2 ). . Reconstruction (R). The last operation is the reconstruction of the product W ¼ T Á V from the component multiplication of CT F ðT Þ and CV F ðV Þ. For example, for the two-way split case, letŴ be equal to the component multiplication of V and T . If we splitŴ ¼ ½Ŵ 0 ;Ŵ 1 ;Ŵ 2 , then the formula in Table 1 states that
RðŴ 2 ÞÞ. Then we can split the Fan-Hasan multiplier in four distinct blocks (CMF, CVF, CM, and R) as shown in Fig. 1 For the CMF, the formula of Table 1 for the first recursion
This means that in the first recursion we calculate the following ðt 3 þ t 1 ; t 2 þ t 0 ; t 4 þ t 2 Þ; ðt 2 ; t 3 ; t 4 Þ; ðt 5 þ t 3 ; t 2 þ t 4 ; t 6 þ t 4 Þ. Then the formulas of Table 1 are applied to the three 2 Â 2 matrices. This results in circuit depicted in the CMF block in Fig. 2 . The same is done for the CVF block. Indeed if we apply the formula of Table 1 to V we get
Then the formulas are applied to each of the three resulting vectors. We then obtain the CVF block shown in Fig. 2 . The component multiplication is then just a level of parallel AND operations with inputs CMF ðT Þ and CV F ðV Þ. Then for the reconstruction we apply formulas of Table 1 at each group of three bits of the nine bits output by CM. Each group gives a vector of length two, and the reconstruction is then applied to these three vectors of size two. The resulting block decomposition of the Fan-Hasan multiplier is given in Fig. 2 .
Remark 1. For hardware implementation, it is interesting to
know the maximal number of gates connected to a signal in the Fan-Hasan TMVP circuit. Clearly this number is bounded by the total number of gates of the multiplier. This means that, for Fan-Hasan multiplier, a signal is connected to at most Oðn log 2 ð3Þ Þ for a two-way split multiplier and Oðn log 3 ð6Þ Þ for a three-way split multiplier. This upper bound is, however not tight enough. We have determined a more precise the number of gates connected to a given signal in order to obtain a better upper bound. In each case, we have found that it is logarithmic in n, i.e., the number of gates connected to a signal is at most OðlogðnÞÞ. A precise bound is given in the following theorem and its proof is given in Appendix, which can be found on the Computer Society Digital Library at http://doi.ieeecomputersociety.org/ 10.1109/TC.2010.275. Theorem 1. Let s be a signal in the Fan-Hasan TMVP multiplier, then it is connected to at most 2 log 2 ðnÞ þ 1 gates in a two-way split case and 3 log 3 ðnÞ þ 1 for a three-way split case.
Block Complexity in Fan-Hasan Multiplier
We can evaluate the complexity of the Fan-Hasan architecture by counting separately the contribution in the space (S) and delay of time (D) complexities of above mentioned four blocks. The complexities are named as follows . S A direct method to compute the matrix additions T 0 þ T 1 and T 1 þ T 2 needs 2ðn À 1Þ additions. As mentioned in [4] , some terms can be reused in the computation of T 0 þ T 1 and T 1 þ T 2 . Specifically, in the computation of T 1 þ T 2 we only need to compute the first column of T 1 þ T 2 . Indeed the first row of T 1 þ T 2 also appears in the first column of T 0 þ T 1 . Therefore, the total number of XOR gates required for the two matrix additions is equal to 3n 2 À 1. We obtain the following inductive relations for S
The following lemma of [4] helps us to get the noninductive expression of S 
Applying Lemma 1 to the inductive relation to (2), we obtain the following
( We can use the same method for both two-way split and three-way split approaches for the component vector formation of V , the component multiplication and the vector reconstruction. Their space and time complexities are summarized in Table 3 . either V and T is equal to n log 2 ð3Þ . In three-way split approach the number of bits of the component representation of either V and T is equal to n log 3 ð6Þ .
BLOCK RECOMBINATION OF FAN-HASAN MULTIPLIER
We would like to use the block decomposition of the FanHasan multiplier in order to reduce its space complexity by recombining the underlying blocks. We modify only the first step in the recursive computation of the Fan-Hasan multiplier. For the two-way split approach, we split T in four blocks and V in two blocks, and we perform T Á V using direct computation of a matrix vector product shown in (3).
We note that the Toeplitz matrix-vector product is reduced to the computation of two separate instances of two-TMVPsand-add, i.e.,
Recombination of Two Parallel TMVPs and Add
We consider here the problem to compute two parallel TMVPs followed by an addition. Specifically, if T and T 0 are n Â n Toeplitz matrices, and V and V 0 two vectors of size n, we want to compute
A simple way to design an architecture to perform this operation consists of two parallel subquadratic TMVP multipliers and n XOR gates to add the two products. This architecture is shown in Fig. 3 . In order to reduce the space complexity, we try to combine parts of the two multipliers. This can be achieved by joining the reconstruction part of the multipliers using the following property of vector reconstruction. Property 1. LetŴ andŴ 0 be two vectors of size n log 2 ð3Þ (resp. n log 3 ð6Þ for the three-way split case). Let R denote the reconstruction function. Then we have
Proof. We first prove it by induction for the two-way split case. When n ¼ 1, we have U ¼ RðÛÞ ¼Û for all U which implies that RðŴ Þ þ RðŴ 0 Þ ¼Ŵ þŴ 0 ¼ RðŴ þŴ 0 Þ. Let us now show it for n ¼ 2 s assuming that (4) is true for n ¼ 2 sÀ1 . We decomposeŴ andŴ 0 in three parts of size
We then apply the formula defining R given in Table 1 to these expressions ofŴ andŴ 0
Now we add these two identities and then we apply the induction hypothesis resulting in
This ends the proof for the two-way split case. We give the proof of the three-way split case in the appendix, which can be found on the Computer Society Digital Library at http://doi.ieeecomputersociety.org/10.1109/ TC.2010.275. t u
We apply the previous property to reorganize the block decomposition of the two-TMVPs-and-add architecture. For the sake of simplicity we consider only the case of two-way split multipliers. Similar recombination could be performed on the three-way split multiplier. Property 1 allows us to perform the block recombination depicted in Fig. 4 for the two-way split case. In Fig. 4 , the structure on the left side is functionally equal to the new structure on the right side.
Reconstruction blocks are relatively space consuming since they require ð2n log 2 ð3Þ À 2nÞ XOR gates each. The recombination of Fig. 4 reduces the number of reconstruction blocks from two to one. At the same time the number of XOR gates in the component addition (CA) block becomes n log 2 ð3Þ since the size of component representation is equal to n log 2 ð3Þ bits. The space complexity of the architecture in the left part of Fig. 4 is equal to 4n log 2 ð3Þ À 3n XOR gates. The space complexity for the recombined architecture in the right part of Fig. 4 is equal to 3n log 2 ð3Þ À 2n XOR gates. Consequently, this recombination reduces the space complexity by ðn log 2 ð3Þ þ nÞ XOR gates. Now, we include this recombination scheme in the twoTMVPs-and-add architecture. The resulting recombined architecture for two-TMVPs-and-add is given in Fig. 5 .
Complexity. We evaluate the complexity of the architecture depicted in Fig. 5 . In the new architecture there exist two CMFs, two CVFs, two CMs, one CA, and one reconstruction block. If we denote S Using the complexity of each block given in Table 3 , we obtain the complexities listed in Table 4 . In this table, we also give the complexities of the architecture in Fig. 3 for comparison. We can see that, for the two-way split case, the recombination performed in Fig. 5 reduces the number of XOR gates by ðn log 2 ð3Þ À nÞ having the same delay. The space and delay complexities for the three-way split approach can be calculated in the same way, the results are also summarized in Table 4 .
Example 2. In Fig. 6 , we give the recombined hardware architecture which computes
where Toeplitz matrices T and T 0 and vectors V and V 0 have size n ¼ 4.
Remark 3. The recombination done for the two-TMVPsand-add architecture can be generalized to a more general k-TMVPs-and-add architecture where k ! 2. Indeed, Property 1 can be extended to the following more general case
whereŴ i are vectors of size n log 2 ð3Þ for the two-way split case (resp. n log 3 ð6Þ for the three-way split case). Conse- 
TABLE 4 Complexities of Two-TMVPs-and-Add Architectures
quently, we can recombine a k-TMVPs-and-add architecture moving the addition blocks before the reconstruction block. We organize the component addition blocks through a binary tree. We obtain the architecture given in Fig. 7 .
We can evaluate the complexity of this architecture in term of n and k by counting the number of each block, and then using the block complexities of Table 3 . We obtain for example for the two-way split case the following results
This reduces the space complexity of the k-TMVPs-andadd architecture by ðk À 1Þðn log 2 ð3Þ À nÞ XOR gates compared to nonrecombined architecture.
The modification on the two-TMVPs-and-add architecture can be formulated in an algorithmic format as presented in Algorithm 1 below for the two-way split approach. The algorithm for the three-way split approach can be obtained using the same methodology. Algorithm 1. Two-TMVPs-and-AddðT ; V ; T 0 ; V 0 Þ Require: T ; V ; T 0 and V 0 where T and T 0 are n Â n Toeplitz matrices and V and V 0 are vectors of length n ¼ 2 s each. 
The following theorem proves the correctness of Algorithm 1.
Theorem 2. Let T and T 0 be two n Â n Toeplitz matrices and V and V 0 be two vectors of length n with entries in IF 2 and n ¼ 2 s . Then Algorithm 1 returns the correct value of
Proof. This can be proved by induction on s ¼ log 2 ðnÞ. For
which is correct. Now we assume that Algorithm 1 returns the correct value for n ¼ 2 i where i ¼ 0; 1; 2; . . . ; s and we show that under this assumption it returns the correct value for n ¼ 2 sþ1 as well. For this case the algorithm recursively computes
Two-TMVPs-and-
0 Þ; By induction, since matrices and vectors have size 2 s , the recursive calls to Algorithm 1 return the correct values. This means that
Then, the final returned vector satisfies
. The problem of computing of two-TMVPs-andadd could be seen as a single matrix vector product. Indeed, if we consider an n Â 2n matrix T which can be split into n Â n Toeplitz submatrices T 1 and T 2 as follows
be a vector of size 2n. Then the matrix-vector product T Á V can be computed as
This is a two-TMVPs-and-add computation. We can thus compute it by applying the method previously presented in this section. Thus Algorithm 1 can be used to a perform product of block Toeplitz matrix and vector. . Algorithm 1 can be generalized to k-TMVPs-andadd with k ! 2. This extends directly from the results on the two-TMVPs-and-add case.
Recombination for a Single Two-Way Split TMVP Multiplier
We consider now the problem to compute one single Toeplitz matrix-vector product. Specifically, let T be an n Â n matrix and V be a vector of size n. We would like to efficiently compute
Our strategy is to modify the first recursion of the Fan-Hasan multiplier, in order to reduce the space complexity by block recombination. We split T in 4 blocks and V in 2 blocks and we perform T Á V through a direct computation. We obtain
Now if we perform the four TMVPs of size n=2 using the Fan-Hasan multiplier we obtain the architecture shown in Fig. 8 . Now if we use the recombination of Section 3.1 we obtain the architecture of Fig. 9 .
We remark that several blocks are performing exactly the same computation. Specifically, there are two blocks which perform the component matrix formation of T 1 , two blocks perform the component vector formation of V 0 , and two blocks perform the component vector formation of V 1 . We can thus keep only one of each of the above mentioned blocks. After doing this we obtain the architecture depicted in Fig. 10 .
Let us now evaluate the complexity of the architecture of Fig. 10 . The space complexity of this approach is expressed in terms of the space complexity of the different blocks: S 
Using 
Then using the formula in Table 3 and Remark 2, we obtain and the time complexity is equal to D A þ ð3 log 3 ðnÞ þ 1ÞD X .
Complexity Comparison
In this section, we compare the multipliers of Fig. 10 and Fig. 12 to other similar multipliers in the literature. The comparison is done with method for best known method for polynomial multiplication (Karatsuba, Winograd) without reduction modulo irreducible polynomial, Toeplitz matrix-vector product and CRT which include the reduction. The complexity results are listed in Table 5 .
As can be seen from the table, the Fan-Hasan architectures outperform CRT, Karatsuba and Winograd methods regarding both space and time complexities. In the two-way split approach our proposed method presents the same time complexity as that of Fan-Hasan. It also matches the total number of gates used by Fan-Hasan. Since our proposal uses fewer number of XOR gates, it is more suitable for ASIC implementations as the area of an XOR gate is larger than that of an AND gate 1 in CMOS libraries. Regarding the three-way split approach, it can be seen from the table that our proposal and the Fan-Hasan scheme present the least time complexities. Regarding space complexities our proposed method has the smallest area usage compared to all other architectures listed in the table.
APPLICATION TO GHASH
In this section, we present an application of the block recombination of the two-TMVPs-and-add architecture presented in Section 3.1 to the GHASH function. GHASH is used in Galois counter mode (GCM) [11] to compute the Message Authentification Code (MAC). GHASH is defined as
where C ¼ ðC 1 ; . . . ; C N Þ is such that C i 2 IF 2 n and H 2 IF 2 n . The element H remains unchanged for a single block of messages. The most costly computation for GHASH is multiplication in IF 2 n . To speed up, two multiplications can be performed in parallel. We recall here a recently proposed method for parallel GHASH implementation presented in [16] . We then present our proposal for parallel implementation of GCM with smaller area requirements.
Parallel GHASH through Polynomial
Decomposition in Even/Odd Parts [16] The authors in [16] have noticed that GHASHðCÞ can be expressed as where we assume that N is even. In other words if we call C odd ðH 2 Þ the odd part of GHASHðCÞ and C even ðH 2 Þ, the even part of GHASHðCÞ, then we have
C even ðH 2 Þ and C odd ðH 2 Þ are independent GHASH computations using the constant H 2 . They can be computed in parallel. Algorithm 2 computes GHASHðCÞ in H using this approach.
Algorithm 2. ParallelEval
A hardware architecture corresponding to Algorithm 2 is depicted in Fig. 13 . This architecture requires two IF 2 n multipliers in parallel which independently compute C odd ðH 2 Þ and C even ðH 2 Þ. We have decomposed the two multipliers into elementary blocks in order to easily evaluate the space complexity.
Two-TMVPs-and-Add Approach
Our proposal to compute GHASHðCÞ consists of using a Most Significant Digit expression of GHASHðCÞ
This expression can be computed through a sequence of length N=2 of two-multiply-and-add in IF 2 n . Algorithm 3 computes GHASHðCÞ using this method.
Algorithm 3. GHASH by two multiplications and add
The corresponding hardware architecture uses the approach previously presented in Section 3.1. Specifically the two multipliers are recombined in order to reduce the space complexity.
Comparison of the Two Approaches
We now evaluate the complexities of the architectures shown in Figs. 13 and 14 using the block complexity of Table 3 .
. Complexity of the architecture of Fig. 13 . The space complexity of the architecture of Fig. 13 is given by
R È þ 2n and S ¼ 2S M : All the multiplications are done with operand H 2 . Consequently, the component matrix formation of H 2 can be precomputed, and two CMF blocks can be removed from Fig. 13 . The number of XORs becomes
n þ 2n and the number of AND remains the same. The delay is equal to the delay of a Fan-Hasan multiplier plus one D X . The resulting complexities of the optimized version of Fig. 13 for the two-way split and three-way split are given in Table 6 . Table 6 . Table 6 shows that our architecture has the same delay as the architecture presented in [16] in each case while the area decreases by n log 2 ð3Þ À n (resp. n log 3 ð6Þ À n) for the two-way split (resp. the three-way split) approach.
Remark 5. The authors in [16] presented their approach for k parallel multiplications. For the sake of simplicity we have focused on the case k ¼ 2. Algorithm 3 and its corresponding architecture (Fig. 14) can be extended to k parallel multiplication and additions. For the general case, we would have similar results regarding the improvement on space complexity.
CONCLUSION
The Toeplitz matrix vector multiplier of Fan and Hasan [4] can be decomposed into different independent blocks. In this paper, we have used this decomposition to design an architecture which performs two-TMVPs-and-add with smaller area requirements. Moreover, we have used this block recombination method for a single TMVP architecture. We have modified the first recursive computation in the Fan-Hasan multiplier and then recombine the resulting blocks. We again obtained a multiplier with a lower space complexity. Finally, we have applied our block recombination approach to develop an efficient parallel implementation of GHASH function of GCM. 
