Abstract. In this paper a new low complexity parallel multiplier for characteristic two finite fields GF (2 m ) is proposed. In particular our multiplier works with field elements represented through both Canonical Basis and Type I Optimal Normal Basis (ONB), provided that the irreducible polynomial generating the field is an All One Polynomial (AOP). The main advantage of the scheme is the resulting space complexity, significantly lower than the one provided by the other fast parallel multipliers currently available in the open literature and belonging to the same class.
Introduction
Finite fields have recently attracted a lot of attention due to the increasing number of cryptography and coding theory applications that require high performance finite field capabilities ( [9] ). Several new architectures have been proposed in order to fulfill the constraints imposed by specific purposes ( [2, 8, 10] ). Although different solutions can be compared from several points of view, time complexity and space complexity are, usually, the two most important parameters. The former is defined as the elapsed time between input and output of the circuit implementing the multiplier, and it is usually expressed as a function of the field degree m, the delay of an AND gate T A and the delay of an XOR gate T X . The latter, on the contrary, is defined as the pair of numbers Σ A and Σ X , of AND and XOR gates used respectively. Although a manifest improvement in space complexity over the best known algorithm is still possible, because of an achievable asymptotic space complexity given by O(m log 2 m log 2 log 2 m) ( [1] ), these two parameters are characterized by an evident trade off. In fact, reducing the number of gates causes, in general, a corresponding increase in the execution time. So, if performance is the most critical parameter, we can accept a greater space complexity, in exchange for a reduction of the corresponding time delay. Conversely, in other applications such as those based on smart cards, mobile phones, or other portable devices, a reduced space complexity is often the most important design aspect.
Because of these reasons we will focus on a special class of fast multipliers, characterized by a generator of type AOP, which can take advantage of the trade off between time and space complexity to achieve a space complexity significantly lower than those offered by the traditional bit-parallel multipliers of the same class ( [3, 4, 5, 6, 7] ), with a small increase in the corresponding time delay. In other words, a limited rise in the time complexity is accepted in order to obtain a more consistent reduction in the corresponding circuit area.
Therefore the paper is organized as follows: section two introduces some useful preliminaries; section three provides an architectural description of the multiplier when the field elements are represented through a Canonical Basis, while section four focuses on Type I ONB representations. The last section summarizes the results obtained and draws some conclusions.
Preliminaries
Characteristic two finite fields GF (2 m ) provide a plethora of methods to represent field elements according to their particular application. Specifically, the two most classical schemes reported in literature are Canonical Basis (also called Standard Basis) and Optimal Normal Basis, though other strategies have recently been proposed ( [2] (2) . In this case the expansion is therefore given by a(γ) = m−1 i=0 a i γ 2 i (for more information see [9] ). In order to reduce the complexity of the field multiplication special classes of irreducible polynomials have been suggested ( [7] , [10] ). Among them, the AOP generators have been shown to be particularly interesting. An AOP is a polynomial characterized by the form p(x) = 1 + x + x 2 + . . . + x m , which is irreducible if and only if m + 1 is prime and 2 is primitive modulo m + 1 ( [9] ). For instance, for m ≤ 100 there are thirteen useful values: 2, 4, 10, 12, 18, 28, 36, 52, 58, 60, 66, 82, and 100. Moreover, each N-polynomial generating a Type I ONB is also an AOP ( [9] ). For this reason in the following we will focus on AOPs, discussing the advantages of this class in the context of both Canonical Basis and Type I ONB representations. 
Canonical Basis
. This product can be computed in two different phases:
1. computation of the ordinary product of two polynomials (
Multiplication of Polynomials over GF(2)
First, we observe that the degrees of the polynomials a(x) and b(x) are both ≤ m − 1, therefore the degree of the polynomial (x) will be, in turn, ≤ 2m − 2. Formally we have:
This polynomial can be computed by means of a divide-and-conquer approach originally proposed to increase the speed of integer multiplications ( [11] ). Actually this strategy, which we will slightly improve and extend respect to the results obtained in [8] , in turn reminiscent of the Karatsuba-Ofman algorithm, has been also successfully applied in case of trinomial generators ( [12] ).
More precisely, let us to observe that in this context m is surely even, thanks to the sufficient conditions that make p(x) irreducible. Therefore we can assume m = 2N . As a consequence the polynomials a(x) and b(x) can be rewritten as
respectively, where
and analogously
Therefore, the product (x) can be computed as
which, introducing the following auxiliary polynomials
we can also express as
Eq.(3) compute the product a(x) · b(x) by means of three multiplications of polynomials of degree N − 1, together with shifts and "lettings-down" of α powers. Specifically, the architectural structure of the multiplier can be organized as follows:
-two circuits, composed of N XOR gates each, for the parallel computation of
-three circuits, composed by N 2 AND and (N − 1) 2 XOR gates each, for the parallel computation of
; the XOR tree depth is log 2 (N − 1) , provided that the polynomials involved have at most degree N -one circuit, composed of 2N −1 XOR gates, for the computation of
] -one circuit, composed of 2N − 2 XOR gates, for the computation of (x) by means of the eq. (3), where each term, at this point, has been already pre-computed
As far as the time complexity is concerned, it should be noted that the overall circuit is able to produce the output (x) according to a time delay of T A + T X ( log 2 (N − 1) + 3). In fact, after a period of time equal to T X , the intermediate values A(x) + B(x) and C(x) + D(x) will be available; therefore, when other T A + T X ( log 2 (N − 1) + 1) seconds have elapsed, the circuit will have also computed
B(x)D(x), A(x)C(x) and A(x)C(x)+B(x)D(x), while waiting for other T X seconds, also the computation of the term A(x)C(x)+B(x)D(x)+[A(x)+B(x)]·[C(x)+D(x)]
will be completed. Therefore the result (x), which now needs other T X seconds to be reached, just requires a time complexity equal to
The overall characteristics of the algorithm, whose details have been presented in Table 1 , are respectively:
which can be compared with those provided by a direct parallel multiplication
It is evident how the former strategy exchanges a part of its time complexity in order to gain a 3 4 factor in the corresponding number of gates. Anyway, the values in (4) can be also further manipulated and expressed as (see also Table 1 )
where (C) d represents the complexity C of the multiplier, i.e. Σ A , Σ X and Θ , when the polynomials in input have degree at most d − 1, that is d coefficients. Eq. (6), (7), and (8) show that the product of two polynomials of degree ≤ m − 1 can be performed by means of three multiplications of two polynomials Table 1 . Time and space complexity to multiply polynomials over GF (2) .
of degree equal (at most) to about the half the original ones, plus a little overhead needed to combine the partial results and to obtain the final output. Moreover these three multiplications can be computed in a parallel way, and this is the reason why within the time complexity (8) does not appear the factor 3, present, in contrast, in (6) and (7). It should be also pointed out that this additional overhead is relatively small, being limited to 4m − 4 XOR gates in (7) and characterized by an additional time delay equal to 3T X in (8).
Moreover, provided that also m/2 is even, this strategy can be further applied, in order to gain a further reduction in the gate count. For instance, assuming that m is a power of 2, after k iterations we will obtain:
These results show a clear trade off between time and space complexity. Therefore, to significantly reduce the number of gates we have to increase the corresponding number of iterations, although, as a side-effect, the time delay of the multiplier will also rise, just linearly in the same number of iterations. Of course, an interesting question is: how much can we iterate the algorithm, provided that we want to reduce the space complexity as much as possible? It is easy to see that the optimal stop condition for this recursion is m/2 k = 4, a value for which a parallel and direct multiplication is more advantageous over the recursive scheme. In fact, iterating the algorithm we obtain (Σ A ) 4 = 12 and (Σ X ) 4 = 15, from which (Σ T OT ) = (Σ A ) 4 +(Σ X ) 4 = 27, while (Θ ) 4 = T A +4T X . On the contrary, using a direct strategy we have (Σ A ) 4 = 16 and (Σ X ) 4 = 9, from which (Σ T OT ) = (Σ A ) 4 + (Σ X ) 4 = 25, while (Θ ) 4 = T A + 2T X . Therefore, taking into account this stop condition, the corresponding complexities, in case of m = 2 t , will be: which slightly improves the results reached in [8] . For a quantitative comparison see also Table 2 , where it should be clear how our scheme pays a greater number of AND gates, if compared with [8] , but in order to reduce both the overall number of gates Σ T OT and the time complexity Θ. Now we have to generalize the previous results, in order to make the scheme suitable for generating AOPs. In fact, it is possible to employ the same strategy also when m is not a power of 2. To make the design very modular, we do not optimize the structure of the multiplier distinguishing the two cases, m even and odd, as done in [12] . In contrast, we simply expand the circuit registers to handle, at each step, polynomials of odd degree, that is with an even number of coefficients. As a consequence the following generalization can be derived and used to multiply polynomials of any degree (m ≥ 4):
At the end of this first phase, the circuit outputs the coefficients of the product polynomial (x), that is the bit vector ( 0 , 1 , . . . , 2m−2 ). The subsequent step will be the computation of field element c(x) as the remainder c(x) = (x) mod p(x).
Reduction Phase
Let (x) = ( 0 , 1 , . . . , 2m−2 ) the polynomial given by the ordinary product of a(x) and b(x). The current phase prescribes the computation of field element
as the remainder of the polynomial division of (x) by the generator polynomial p(x). To speed up this computation it is possible to take advantage of the structure of the generator p(x). Thanks to the regular form of this polynomial it is easy to express the coefficients of the field element c i in terms of coefficients i . Specifically, it can be shown that the field element c (x) ≡ (c 0 , c 1 , . . . , c m−1 ) can be computed as
Of course this step can be accomplished according to a time complexity equal to (Θ ) m = 2T X , while the relating space complexity is given by (Σ X ) m = 2m − 2.
As a consequence, the characteristics of the overall multiplier taking in input the two bit vectors (a 0 , a 1 , . . . , a m−1 ) and (b 0 , b 1 , . . . , b m−1 ) and producing, at the output of the circuit, the product element (c 0 , c 1 , . . . , c m−1 ), will be given by:
(12) It should be noted that the final space complexities are notably lower than those currently available in literature and belonging to the same class ( [3, 6, 7] ). For a direct comparison see also Table 3 , where it is evident how our scheme does exchange time complexity in order to gain a more consistent reduction in both the number of AND and XOR gates. Moreover, this gain grows as m grows. For instance, if m = 226, our multiplier provides a factor reduction, in the overall gate count, equal to 2.7, with respect to the best method ( [3] ), paying a corresponding time expansion factor of 2. On the other hand, in case of m = 2026, the area reduction factor becomes 7.7, while the corresponding time expansion rises only up to 2.36.
As an example, in Figure 1 is reported the scheme of the overall multiplier, when the generating polynomial is p( 10 . In this case the two inputs a(x) and b(x) have been rewritten as
Therefore, the field element c(x) = a(x) · b(x) ∈ GF (2 10 ) can be computed by means of three multiplication circuits for polynomials of degree 4, to obtain (A + B) · (C + D), A · C and B · D, plus some XOR gates, needed to recombine partial results (block Recombination), and to perform the reduction phase (block Reduction Phase). To make fully modular the circuit design (which could be an advantage, especially if m 10), we do not directly deal with these polynomials of degree 4. Instead we extend these polynomials by a single bit, in order to obtain polynomials of degree 5. This provides us with the possibility to further iterate the algorithm and to directly employ modules architecturally equivalent to the previous ones. In fact, each of these three products can be computed, in turn, by means of other three multiplication circuits for polynomials of degree 2, for the parallel computation of (A + B ) · (C + D ), A · C and B · D , plus the XOR gates needed for the recombination. Conversely, the latter 9 polynomial multiplications are not further iterated, because of the lower time and space complexities provided by a direct multiplication.
Type I Optimal Normal Basis
The previous scheme can be also adopted in case of Type I ONB, following the smart strategy proposed in [3] . Specifically, let p(
, and let a(γ) and b(γ) be two elements of GF (2 m ), represented through the m-bit vectors (a 0 , a 1 , . . . , a m−1 ) and (b 0 , b 1 , . . . , b m−1 ), with respect to the root γ of p(x). Given that p(x) is also an AOP, the root γ satisfies the property γ m+1 = 1, in fact As a consequence, the set
can also be used as a basis for GF (2 m ). More precisely, (14) is nothing but a shifted version of the Canonical Basis, therefore the elements of GF (2 m ) represented in Type I ONB can be quickly converted in Canonical Basis, and viceversa, by means of a simple permutation of the components. In fact, thanks to the relation γ m+1 = 1, we can write the conversion
by means of the permutation P defined as
Therefore, the elements to be multiplied in Type I ONB will be simply converted in Canonical Basis, through the permutation P , before entering the multiplier. The output of the circuit, computed according to the complexities given in (12) and still represented in Canonical Basis, will be restored in Normal Basis thanks to the inverse permutation P −1 . It should be noted that these two additional permutations do not increase the overall time and space complexity of the multiplier. In fact, P , and its inverse P −1 , can be directly implemented by wiring the fan-in and fan-out of the circuit, without modifying any complexity. Therefore, our scheme is able to maintain the previously discussed gate count reduction also in case of Type I ONB. This reduction is significant, especially if compared with the one provided by the other fast parallel schemes currently available in literature ( [4, 6, 3] ), as reported in Table 4 . Finally, also in this case the gain factor becomes more consistent as soon as m grows, as previously seen for Canonical Basis.
Conclusions
In this paper we have proposed a new low space complexity scheme for fast parallel multiplication of field elements represented through both Canonical and Type I Optimal Normal Bases. Specifically, the discussed strategy shows how to avoid quadratic space complexity, paying only a limited increase in the corresponding time delay. As reported in Table 3 and 4, the proposed scheme offers a circuit complexity significantly lower compared to the other fast parallel schemes present in the open literature ( [3, 4, 5, 6, 7] ). This characteristic makes the employment of this multiplier particularly suitable for applications characterized by specific space constraints, such as those based on smart cards, token hardware, mobile phones or other portable devices.
