In this article, we present a new sequential multiplier for extended binary finite fields. Like its existing counterparts, the proposed multiplier has a linear complexity in flip-flop or temporary storage requirements, but a sub-linear complexity in gate counts. For the underlying polynomial multiplication, the proposed field multiplier relies on the Horner scheme.
For a small size field like the one used in the wellknown symmetric key system Advanced Encryption Standard (AES) [1] , multiplication can be easily realized with a simple look-up-table (LUT), where the product of each combination of input is pre-stored, preferably in a read-only-memory (ROM). This table based method is, however, impractical for large size fields used in modern asymmetric key cryptosystems like elliptic curve cryptography (ECC) [6, 9] . Asymptotically, a look-up table for multiplication can take as much as O(n2 2n ) bits of ROM and the time delay would be essentially equal to the table access time, which is normally a function of the table size. Techniques exist to reduce the table size at the expense of an increased number of table accesses along with logic circuits.
For high speed multiplication over a large field, an alternative approach is to use only logic circuits or gates in some parallel way and to perform a multiplication on-the-fly, e.g., a fully bit parallel multiplier. For ECC, where the value of n is only a few hundreds, a practical bit-parallel multiplier would take O(n 1+ ) logic gates, where 0 < ≤ 1, and have a time delay of O(log n) layers of gates.
In constrained environments, where silicon space is of prime concern, a fully bit parallel multiplier for a large field is generally too big to fit the design. On the other end of the spectrum of choices, it is possible, at least in theory, to realize a multiplier with as few as only two gates-one XOR and one AND-where the gates will be used repeatedly and many intermediate bit-level results will need to be stored. We will refer to this type of multipliers as supersequential, since they would take O(n 1+ ) iterations or clock cycles. The storage requirement for a super-sequential multiplier would be at least O(n 1+ ). Because of large timing and memory requirements, super-sequential multipliers are likely to be least attractive for both constrained and high speed applications and this is perhaps why no practical design 
of the super-sequential multiplier has been reported in the literature.
Between the above two extremes, namely fully parallel and super-sequential multipliers, one can find various sequential multipliers, either at bit or digit-level. A sequential multiplier operates in an iterative way over a number of clock cycles, typically O(n) cycles for bit sequential or O(n/d) cycles for digit sequential architecture, where d is the digit size. A bit sequential multiplier has a shorter critical path than its digit sequential counterpart. For a bit or digit sequential multiplier, since intermediate results are stored in registers or read-write memories, the storage requirement is at least O(n) bits. Besides registers, a sequential multiplier also requires logic gates, generally in the amount of O(n), i.e., linear to n.
In our work, we differentiate ROM from registers in the usual way. In other words, while the content of a ROM is fixed and specific to a certain arithmetic operation, the content of a register can change. More importantly, registers are time-shared, i.e., a set of registers can be used by multiple operations, e.g., additions, multiplication and inversion, provided that these operations are not executed simultaneously. On the other hand, like ROM but unlike registers, most of the logic gates are assumed to be specific to a certain operation and not shared. As a result, even if an arithmetic unit, such as a multiplier, can be designed to take advantage of time shared registers, there is still a need to reduce the amount of logic gates, especially for area constrained applications.
In this article, we propose a new sequential multiplier. Like its existing counterparts the proposed one requires O(n) bits of registers, but only O(n 1− ), 0 < ≤ 1, logic gates. To the best of our knowledge, no sequential multiplier has been reported earlier that has a gate count sub-linear to n. To give an idea about where the proposed multiplier fits with respect to various others, in Table 1 we list their space and time complexities in an asymptotic or general way.
The remainder of this article is organized as follows. In Sect. 2, we briefly review schemes for a fully bit parallel and a digit-sequential multiplier. In Sect. 3, we present our sequential multiplier which is based on the Horner method and has a sub-linear gate complexity. Finally, a complexity comparison and concluding remarks are given in Sect. 4.
Review of extended binary field multipliers
Field F 2 n can be viewed as the set of binary polynomials in t modulo an irreducible polynomial P(t) of degree n. Let A(t) and B(t) be two elements of F 2 n . The multiplication of these two elements is C(t) = A(t) × B(t) mod P(t), for which we can use the following two steps:
and then reduce C (t) modulo P(t) to get
Bit parallel architectures
In this type of architectures, all the bits of the product C = AB mod P are generated in parallel. The design of the circuit is based on pure combinatorial logic and does not make any use of storage or flip-flops. Computations are done with no reuse of circuits. The approach we recall here performs the polynomial multiplication (1) and the reduction (2) separately. The reduction modulo P is generally quite simple when P is sparse (i.e., P is a pentanomial or a trinomial) and in this situation the reduction operation can be implemented with O(n) XOR gates and a delay of O (1) . Consequently, we will focus here to the polynomial multiplication which is the most costly and the most complicated part of a finite field multiplier.
• Quadratic multiplier Let C = 2n−2 i=0 c i t i be the product C = A × B where A and B are degree n − 1 polynomials in t. The coefficients c i are given in terms of a i and b i as follows 
The computation of each c i is performed in parallel with AND gates (for the products a b j ) and a binary tree of XOR gates. The resulting gate complexity of this polynomial multiplier is equal to n 2 AND gates and n(n − 1) XOR gates. The critical path of the multiplier is log 2 (n) D X + D A where D X and D A are delays for a two input XOR and AND gates, respectively.
• Subquadratic multiplier [2] . A second strategy, which was first proposed in [4] , is based on the method of Karatsuba. Specifically, the multiplication is performed by applying the following formula recursively
As stated in [4] , this approach requires 6n log 2 (3) + 8n − 2 XOR gates, n log 2 (3) AND gates and has a delay of 3 log 2 (n)D X + D A . Recently, some optimizations have been proposed: Leone [7] has noticed that it is possible to slightly reduce the space complexity of the Karatsuba multiplier if we stop the recursion when we reach polynomials of degree 4 or 8 and then apply quadratic method. More recently, Fan et al. [3] have proposed using an odd/even splitting which reduces the delay to 2 log 2 (n)D X + D A .
In Table 2 , we recall the complexity of the schoolbook method and then give the complexity of multipliers based on the Karatsuba combined with the optimization approaches of Leone [7] and Fan et al. [3] .
Digit sequential multiplier
In sequential multipliers, computations are done with reuse of circuits and the final result is obtained over several clock cycles. Here, we review the digit sequential version of the multiplier of Song and Parhi [12] . Song and Parhi fix a digit size d and define m = n d and then they rewrite the two polynomials A and B as follows Algorithm 1 Left-to-right digit sequential multiplication [12] Require:
The left-to-right (L-to-R) method for digit-serial multiplication is based on the following expansion of the product
The previous expression results in the following L-to-R multiplication algorithm. It consists of a sequence of multiplications by t d followed by the accumulation of AB i in C modulo P.
In Algorithm 1, the multiplication by t d is performed by shifting the coefficients of C. The multiplication B i × A is done using m parallel quadratic space complexity polynomial multipliers of size d. These digit multipliers compute the product B i A j for j = 0, . . . , m − 1. Each product B i A j has a degree of 2d − 2: thus the upper part of this product is added to C j+1 and the lower part to C j . Then, the updated value of C has m + 1 digits, since in C m we store the upper part of the product B i A m−1 . If we assume that P = t md + m−1 i=0 P i (t)t id , then we reduce C m modulo P using the following expression Fig. 1 Left-to-right digit sequential multiplier of [12] In practical applications, P can be taken as a pentanomial or a trinomial, so the P i 's are in general either equal to zero or sparse. Here, for the sake of simplicity, we will further assume that the only non-zero P i is P 0 . This means that the reduction step consists of C m t md = P 0 C m and can be performed with at most 3(d − 1) XOR gates (indeed, if P is a pentanomial, then P 0 has four non-zero coefficients).
The resulting digit sequential multiplier is depicted in Fig. 1 . The Digit Mult. boxes represent the quadratic parallel polynomial multipliers for degree d polynomials.
The complexity of this digit sequential multiplier is given below. Note that S ⊕ represents the number of XOR gates, S ⊗ represents the number of AND gates and D represents the delay of the critical path of the architecture. We remark that the total computational delay is the number of clock cycles times the clock period, whose duration is at least the critical path delay.
Remark 1 We have restricted our study to a quite specific P but in general when P has degree n and is a pentanomial, a similar approach can be applied with no additional significant complexity. Our choice was motivated only by the simplicity of this special case in the digit sequential design.
Sub-linear gate complexity using Horner's method
In this section, we present a multiplier with sub-linear gate complexity based on the Horner method. Our approach takes advantage of a subquadratic multiplier as well as a digit sequential multiplier. As before, let A(t) and B(t) be two polynomials of degree < n. Let m and d be two integers such that m × d = n. We rewrite A and B in a digit form as follows:
We multiply A and B using Horner's method. Here, we assume that the irreducible polynomial P has the form P = X dm + P 0 where P 0 has a degree less than d and has at most four non-zero coefficients. This assumption is not restrictive, a similar multiplier with the same order of area and delay complexity can be designed with a more general pentanomial P. The resulting method is described in The architecture based on Algorithm 2 is shown in Fig. 2 . In this architecture, elements A and B are each stored in a register of m cells, each cell containing d bits. The product C is also stored in a register of m cells and is initialized with m zeros. Registers containing A and C are shifted cyclically once in every clock cycle. On the other hand, B is shifted once after every m clock cycles. At each clock cycle, one of the following cases is performed and in Fig. 2 these three cases are indicated in red, blue and green.
• Case 1 (Red) A digit A j for 0 ≤ j < m −1 is output from the A register and the digit B i is output from B. The A and C registers are left shifted but no shift is operated in Algorithm 2 Horner's method with reduction modulo P • Case 3 (Green) This is the last step before the output of the product C. At this step, the C register is left shifted. The digits A m−1 and B 0 are outputs from the A and B registers and then multiplied through the digit multiplier. Then, the lower part of the product is added to C m−1 and the upper part is reduced modulo P, i.e., is multiplied by P 0 and added to C 0 and C 1 .
(a) (b) (c) (d) Fig. 3 First four steps of the multiplication of A = (t + 1) + (t + 1) × t 24 and B = t + t 7 × t 24 modulo P = t 4×8 + t 7 + t 6 + t + 1 with m = 4 and d = 32 (note that after the demultiplexer, the followed paths are indicated with bold lines)
To avoid any confusion between Case 2 and Case 3, in Fig. 2 we have used two Mult. by P 0 boxes which perform a multiplication by P 0 in the two distinct paths.
The main differences between Algorithm 1 and Algorithm 2 and their respective architectures are the following:
1. In Algorithm 2 and in the corresponding architecture in Fig. 2 , the digit multiplications are done in sequence and not through m parallel digit multipliers. 2. In the proposed multiplier, the digit multiplication uses a subquadratic multiplier, and the digit size d in Fig. 2 is generally larger than the one in Fig. 1 .
In Fig. 3 , we illustrate the functioning of the proposed multiplier by presenting the first four cycles of the sequential multiplier for n = 32.
Complexity evaluation Below we list, along with a brief explanation, the number of flip-flops, AND gates and XOR gates and the critical path of the architecture in terms of m and d. For the sake of simplicity, we assume that each path in Fig. 2 is of width d.
• 3m cells of d bits each for A, B and C. Thus, the architecture in Fig. 2 [12] • The architecture in Fig. 2 Using the subquadratic complexity results based on [3] and [7] as given in Table 2 , we obtain the following complexity for the multiplier in 
Complexity comparison and conclusion
We finally obtain a sub-linear gate complexity by fixing the value of d and m as d = √ n and m = √ n. The resulting complexity is given in Table 3 . In the same table, we recall the complexity of the digit sequential multiplier of Song and Parhi [12] reviewed earlier in Sect. 2 (for which we assume that m = n/d). and d is small compared to n).
The results in Table 3 suggest that the proposed sequential multiplier has a gate count which is much smaller than that of the multiplier of [12] . On the other hand, the proposed multiplier has a longer critical path resulting in a much larger overall delay. Thus the proposed sub-linear gate complexity multiplier offers a new area-time trade-off. For example, if we consider n = 256 (i.e., d = m = 16), the multiplier of [12] would require 8,596 XORs, 8, 192 ANDs, 16 clock cycles with 9 levels of gates on the critical path, whereas the corresponding values of the proposed multiplier are 397 XORs, 143 ANDs, 256 clock cycles and 11 levels of gates.
