Abstract. We study different possibilities of implementing the Karatsuba multiplier for polynomials over F2 on FPGAs. This is a core task for implementing finite fields of characteristic 2. Algorithmic and platform dependent optimizations yield efficient hardware designs. The resulting structure is hybrid in two different aspects. On the one hand, a combination of the classical and the Karatsuba methods decreases the number of bit operations. On the other hand, a mixture of sequential and combinational circuit design techniques includes pipelining and can be adapted flexibly to time-area constraints. The approach-both theory and implementation-can be viewed as a further step towards taming the machinery of fast algorithmics for hardware applications.
Introduction
Arithmetic in finite fields is a central algorithmic task in cryptography. There are two types of groups associated to such fields: their multiplicative group of invertible elements, and elliptic (or hyperelliptic) curves. These can then be used in group-based cryptography, relying on the difficulty of computing discrete logarithms. Here we focus on fields of characteristic 2. The most fundamental task in arithmetic is multiplication. In our case, this amounts to multiplication of polynomials over F 2 , followed by a reduction modulo the fixed polynomial defining the field extension. This reduction can itself be performed by using multiplication routines or by a small hardware circuit when the polynomial is sparse. A trinomial can be used in many cases, and it is conjectured that otherwise a pentanomial can be found (see [6] ). As to the other arithmetic operations, addition is bitwise XORing of vectors, squaring a special case of multiplication (much simplified by using a normal basis), and inversion more expensive and usually kept to a minimum.
Classical methods to multiply two n-bit polynomials require O(n 2 ) bit operations. The Karatsuba algorithm reduces this to O(n log 2 3 ), and fast Fourier transformations to O(n log n loglog n). The Cantor multiplier with a cost of O(n(log n) 2 (loglog n) 3 ) is designed for fields of characteristic 2, but we do not study it here (see [3] and [4] ). Traditional lore held that asymptotically fast methods are not suitable for hardware. We disprove this view in the present paper, continuing our work in [7] .
Our methods are asymptotically good and thus efficient for large degrees. Sophisticated implementation strategies decrease the crossover points between different algorithms and make them efficient for practical applications. Much care is required for software implementations (see [5] , chapter 8, and Shoup's NTL software). The Karatsuba method has the lowest crossover point with the classical algorithm.
In hardware, the methods used are either platform independent or platform dependent. The first group consists of algorithmic optimizations which reduce the total number of operations, whereas the second approach uses specific properties of implementation environments to achieve higher performance.
The Karatsuba algorithm, for multiplication of large integers, was introduced in [10] . This algorithm is based on a formula for multiplying two linear polynomials which uses only 3 multiplications and 4 additions, as compared to 4 multiplications and 1 addition in the classical formula. The extra number of additions disappears asymptotically. This method can be applied recursively to 2 m -bit polynomials, where m is an integer. Here we optimize and adapt the Karatsuba algorithm for hardware realization of cryptographic algorithms.
FPGAs provide useful implementation platforms for cryptographic algorithms both for prototyping where early error finding is possible, and as systems on chips where system parameters can easily be changed to satisfy evolving security requirements.
Efficient software implementations of Karatsuba multipliers using general purpose processors have been discussed thoroughly in the literature (see [12] , [1] , [11] , [8] , chapter 2, and [5] , chapter 8), but hardware implementations have attracted less attention. The only works known to us are [9] , [14] , and our previous paper [7] . [9] and [14] suggest to use algorithms with O(n 2 ) operations to multiply polynomials which contain a prime number of bits. Their proposed number of bit operations is by a constant factor smaller than the classical method but asymptotically larger than those for the Karatsuba method. [7] contains a hybrid implementation of the Karatsuba method which reduces the latency by pipelining and by mixing sequential and combinational circuits.
The present work is to our knowledge the first one which tries to decrease the resource usage of polynomial multipliers using both known algorithmic and platform dependent methods. We present the best choice of hybrid multiplication algorithms for polynomials with at most 128 bits, as long as the choice is restricted to three (recursive) methods, namely classical, Karatsuba, and a variant of Karatsuba for quadratic polynomials. The "best" refers to minimizing the area measure. This is an algorithmic and machine independent optimization. In an earlier implementation ( [7] ) we had designed a 240-bit multiplier on a XC2V6000-4FF1517-4 FPGA. We re-use this structure to illustrate a second type of optimization, which is machine-dependent. Our goal is a 240-bit multiplier with small area-time cost. This measure may be thought as the time on a single-bit processor. We now put a single 30-bit multiplier on our FPGA and use three Karatsuba steps to get from 240 = 2 3 ·30 to 30 bits. This requires judicious application of multiplexer and adder circuitry, but the major computational cost still resides in the multiplier. 27 = 3 3 small multiplications are required for one 240-bit product, and these inputs are fed into the single small multiplier in a pipelined fashion. This has the pleasant effect of keeping the total delay small and the area reduced, with correspondingly small propagation delays. Using this 240-bit multiplier we cover in particular the 233-bit polynomials proposed by NIST for elliptic curve cryptography in [13] .
One reviewer wrote: The idea of using such a generalization of Karatsuba's method is not new, but it is usually dismissed for operands of relatively small sizes because of lower performance in software implementations. The fact that some area on an FPGA is saved is an interesting and new remark: the kind of remark usually "obvious" after one has seen it, but that only few seem able to see in the first place.
The structure of this paper is as follows. First the Karatsuba method and its cost are studied in Section 2. Section 3 is devoted to optimized hybrid Karatsuba implementations. Section 4 shows how a hybrid structure and pipelining improves resource usage in our circuit from [7] . Section 5 analyzes the effect of the number of recursion levels on the performance, and Section 6 concludes the paper.
The Karatsuba algorithm
The three coefficients of the product (a 1 x + a 0 )( 
We call this the 2-segment Karatsuba method or K 2 . Setting m = ⌈n/2⌉, two n-bit polynomials (thus of degrees less than n) can be rewritten and multiplied using the formula:
where f 0 , f 1 , g 0 , and g 1 are m-bit polynomials respectively. The polynomials h 0 , h 1 , and h 2 are computed by applying the Karatsuba algorithm to the polynomials f 0 , f 1 , g 0 , and g 1 as single coefficients and adding coefficients of common powers of x together. This method can be applied recursively. The circuit to perform a single stage is shown in Figure 1 . The "Overlap circuit" adds common powers of x in the three generated products. For example if n = 8, then the input polynomials have degree at most 7, each of the polynomials f 0 , f 1 , g 0 , and g 1 is 4 bits long and thus of degree at most 3, and their products will be of degree at most 6. The effect of the overlap module in this case is represented in Figure 2 , where coefficients to be added together are shown in the same columns. Figures 1 and 2 show that we need three recursive multiplication calls and some additions: 2m for input adders, 2(2m − 1) for output adders, and 2(m − 1) for the overlap module; where m = ⌈n/2⌉. If M (2) n is the total number of bit operations to multiply two n-bit polynomials, then
When n is a power of 2, with the initial values of M
1 = 1 we get:
The gain in Karatsuba's method is visually illustrated in Figure 8 .2 of [5] . The delay of the circuit for n ≥ 2 is at most
times the delay of a single gate. On the other hand, a classical multiplier for n-bit polynomials requires 2n 2 − 2n + 1
gates and has a propagation delay of 1 + ⌈log 2 n⌉.
To multiply two quadratic polynomials, we use the following formula from [2] which we call 3-segment Karatsuba or K 3 . It uses 6 multiplications and 12 additions when used for fields of characteristic 2, compared to 9 multiplications and 4 additions in the classical method:
Similar to (2) we can write the recursive costs of K 3 as:
where m = ⌈n/3⌉. Since log 2 3 ≈ 1.5850 < 1.6309 ≈ log 3 6, this approach is asymptotically inferior to the original Karatsuba method. One result of this paper is to determine the range of usefulness for this method (namely some n ≤ 81) on our type of hardware.
Hybrid design
For fast multiplication software, a judicious mixture of table look-up and classical, Karatsuba and even faster (FFT) algorithms must be used (see [5] , chapter 8, and [8] , chapter 2). The corresponding issues for hardware implementations have not been discussed in the literature, except that our previous paper [7] uses classical multipliers for polynomials with up to 40 bits.
We present a general methodology and execute it in the special case of a toolbox with these algorithms: classical, K 2 , and K 3 . The general idea is that we have a toolbox A of recursive multiplication algorithms. Each algorithm A ∈ A computes the product of two polynomials of degree less than n, for any n. The cost of A consists in some arithmetic operations plus recursive multiplications. For simplicity, we assume that the optimal hybrid multiplication routine using A is built from the bottom up. For each n ≥ 1, we determine the best method for n-bit polynomials, starting with a single arithmetic operation (namely, a multiplication) for constant polynomials (n = 1). For n ≥ 2, we compute the cost of applying each A ∈ A to n-bit polynomials, using the already computed optimal values for the recursive calls. We then enter into our table one of the algorithms with minimal cost.
We now execute this general approach on our toolbox A = {classical, K 2 , K 3 }. The costs are given in (2) and (8) . Whenever necessary, polynomials are padded with leading zeros. The results are shown in Table 1 .
The first column gives the number n of bits, so that we deal with polynomials of degree up to n − 1. The second column "rec" specifies the first recursive level, that is the algorithm from A = {classical, K 2 , K 3 } to be used, abbreviated as {C, 2, 3}. The column "cost" gives the total number of arithmetic operations. The next column states the "ratio" of practice to theory, namely c · cost/n log 2 3 , where the constant c is chosen so that the last entry is 1. The asymptotic regime visibly takes over already at the fairly small values that we consider. The final column gives the cost of algorithm from [14] , which is Karatsuba-based. We know of no other implementation that can be easily compared with ours.
For example, the entry n = 41 refers to polynomials of degree up to 40. The entry 2 in column "A" says that K 2 is to be employed at the top of the recursion. Since m = ⌈41/2⌉ = 21, (2) says that three pairs of 21-bit polynomials need to be multiplied, plus 8 · 21 − 4 = 164 operations. One has to look up the algorithm for 21 bits in the table. Continuing in this way, the prescription for 41 bits is:
164 144 85 total = 164 + 3 · (144 + 6 · 85) = 2126.
In the recursive call of K 2 at n = 41, the inputs are split into two pieces of 20 and 21 bits. It is tempting to single out one of the three recursive multiplications as a 20-bit operation, and indeed this view is taken in [14] . They pad input polynomials with enough zero coefficients and apply the Karatsuba method in a recursive manner. Operations involving a coefficient known to be zero are neglected. In our designs, we use three 21-bit multiplications, for a small loss in the operations count but a huge gain in modularity: we only implement a single 21-bit multiplier, thus simplifying the design and enabling pipelining. Section 4 exemplifies this (with 30 rather than 21 bits).
We note that designers of fast arithmetic software have used the general methodology sketched above, in particular formulating it as breakpoint between different algorithms. The classical algorithm can also be viewed recursively, which is used for some results in Table 2 below.
The goal of our hybrid design is to minimize the total arithmetic cost. The same methodology can, of course, also be applied to multi-objective applications, say minimizing A and AT. A concern with them would be to limit the number of table entries that are kept.
Hardware structure
According to (4) and (6), the delay of a fully parallel combinational Karatsuba multiplier is almost 4 times that of a classical multiplier. It is the main disad- vantage of the Karatsuba method for hardware implementations. In [7] , we have suggested as solution a pipelined Karatsuba multiplier for 240-bit polynomials, shown in Figure 3 . Fig. 3 . The 240-bit multiplier in [7] The innermost part of the design is a combinational pipelined 40-bit classical multiplier equipped with 40-bit and 79-bit adders. The multiplier, these adders, and the overlap module, together with a control circuit, constitute a 120-bit multiplier. The algorithm is based on a modification of a Karatsuba formula for 3-segment polynomials which is similar to but slightly different from (7). (We were not aware of this better formula at that time.)
Another suitable control circuit performs the 2-segment Karatsuba method for 240 bits by means of a 120-bit recursion, 239-bit adders, and an overlap circuit.
This multiplier can be seen as implementing the factorization 240 = 2 · 3 · 40. Table 1 implies that it is usually best to apply the 2-segment Karatsuba, except for small inputs. Translating this into hardware reality, we now present a better design based on the factorization 240 = 2 · 2 · 2 · 30. The resulting structure is shown in Figure 4 .
The 30-bit multiplier follows the recipe of Table 1 . It is a combinational circuit without feedback and the design goal was to minimize its area. In general, k pipeline stages can perform n parallel multiplications in n + k − 1 instead of nk clock cycles without pipelining.
We have implemented our design, the structure of [7] , and a purely classical implementation, on an XC2V6000-4FF1517-4 FPGA. The classical design has a classical 30-bit multiplier and applies the three classical recursion steps. The results after place and route are shown in Table 2 . The second column shows the number of clock cycles for a multiplication. The third column represents the area in terms of number of slices. This measure contains both logic elements, or LUTs, and flip-flops used for pipelining. The fourth column is the multiplication The synchronization is set so that the 30-bit multipliers require 1 and 4 clock cycles for classical and hybrid Karatsuba implementations, respectively. The new structure is smaller than the implementation in [7] but requires more area than the classical one. This drawback is due to the complicated structure of the Karatsuba method but is compensated by speed as seen in the time and AT measures. In the next section we further improve our structure by decreasing the number of recursions. [7] (Fig. 3) 54 1660 0.655µs 1087 Hybrid Karatsuba (Fig. 4) 55 1513 0.670µs 1014
Hybrid polynomial multiplier with few recursions
In the recursive Karatsuba multiplier of [7] , the core of the system, namely the combinational multipliers, is idle for about half of the time. To improve resource usage, we reduce the communication overhead by decreasing the levels of recursion. In this new 240-bit multiplier, an 8-segment Karatsuba is applied at once to 30-bit polynomials. We computed symbolically the formulas describing three recursive levels of Karatsuba, and implemented these formulas directly.
The new circuit is shown in Figure 5 . The multiplexers mux1 to mux6 are adders at the same time. Their inputs are 30-bit sections of the two original 240-bit polynomials which are added according to the Karatsuba rules. Now their 27 output pairs are pipelined as inputs into the 30-bit multiplier. The 27 corresponding 59-bit polynomials are subsequently combined according to the overlap rules to yield the final result. Time and space consumptions are shown in Table 3 and compared with the results of [7] . The columns are as in Table 2 . We see that this design improves on the previous ones in all respects. [7] (Fig. 3) 54 1660 0.655µs 1087 Hybrid Karatsuba (Fig. 5) 30 1480 0.378µs 559
Conclusion
In this paper we have shown how combining algorithmic techniques with platform dependent strategies can be used to develop designs which are highly optimized for FPGAs. These modules have been considered as appropriate implementation targets for cryptographic purposes both as prototyping platforms and as system on chips.
We improved the structure proposed in [7] in both time and area aspects. The time has been improved by decreasing the number of recursion stages. To minimize the area we have further improved the results of [14] , as witnessed in Table 1 , by applying the Karatsuba method in a hybrid manner. The benefits of hybrid implementations are well known for software implementations, where the crossover points between subquadratic and classical methods depend on the available memory and processor word size. There seems to be no previous systematic investigation on how to apply these methods efficiently for hardware implementations. In this paper we have shown that a hybrid implementation mixing classical and two Karatsuba methods can result in significant area savings.
Comparisons with the work of [7] are shown in Table 3 . The asymptotic methods are better than classical multipliers both with respect to time and area measures. An obvious open question is to optimize a larger class of recursive algorithms than our K 2 and K 3 .
