Abstract
Introduction
Motivation Mobile devices are penetrating our every day life. More and more sensitive information is exchanged between mobile nodes and between mobile and fixed communication endpoints. This data exchange is normally protected by cipher mechanisms. But due to the scarce resources of mobile nodes, exhaustive use of cryptographic means is infeasible. This holds especially true for public key cryptography, which is normally used to establish a secure channel between the communicating parties as well as for providing digital signatures. Hardware accelerators for public key cryptography operations are ideal means to reduce the calculation time as well as the energy consumption. But, a straight forward realization of cryptographic operations results in a relatively large area consumption, which makes the application of hardware accelerators economically infeasible. Thus, our design constraints were:
• Calculation time,
• Energy consumption, and
• Area consumption.
We decided to use Elliptic Curve Cryptography (ECC) since it guarantees the same security level as RSA does but with significant shorter keys. In addition to this the ECC operations are faster than those of RSA [1] . We selected B-233 over Galois field GF (2 233 ) which is recommended by NIST [2] and well suited to be implemented in hardware.
Despite ECC is less computational intensive than RSA it still requires a significant effort in terms of energy and time. In this paper we concentrate on the area efficient realization of basic mathematical operations, which are used in ECC. Division of polynomials usually is done in two steps: first identifying the inverse of the divisor using the irreducible polynomial, and second multiplying the inverse with the dividend. Multiplication and division of polynomials require the major part of the calculation time. In this paper we are concentrating on polynomial multiplication, since our long term goal is to implement a Montgomery multiplier for the 'kP' operation. The Montgomery method requires only one polynomial division for 'kP', so that the major effort comes from the multiplication.
Contribution and structure of this paper In this paper we show that an iterative application of the Karatsuba method provides very good results with respect to the following three parameters: calculation time, area consumption and energy consumption. With our iterative hardware solution, the chip area needed to calculate the product of two 233 bit long operands, is 2.1 mm 2 whereas the standard application of Karatsuba's method needs 6.2 mm 2 . Our approach also reduces the energy consumption to 60 per cent of the original approach. The price we have to pay for these achievements is the increased execution time. In our implementation a polynomial multiplication takes 3 clock cycles whereas the original one needs only one clock cycle.
The rest of this paper is structured as follows. Section 2 contains a short description of implemented methods. We propose to use the Karatsuba's formula for polynomial multiplication iteratively. The detailed description of our approach is given in Section 3. Section 4 discusses the hardware realization of our approach and provides measurement results. We conclude the paper with a short discussion of our results and an outlook on further research steps.
State of the art
In this section we describe methods for polynomial multiplication in polynomial basis. We implemented these methods and different combinations of them to realize our own approach and to benchmark our solution. 
Polynomial multiplication
The product of two polynomials 
The straight forward implementation of formula (1) 
Karatsuba based methods

Original
Karatsuba's method For polynomial multiplication with original Karatsuba method [3] both operands have to be fragmentized into two equal parts. If the length n of operands is odd, they have to be padded with leading '0'. So, operands can be written as 2 :
2 We denote as ai the i th bit and as a i the i th segment of operand A(x).
The polynomial B(x) is represented in the same way. The Karatsuba's formula for the product
In order to calculate the partial products
Karatsuba's formula can be applied recursively. In this case it requires in total 58 . 1 3 log 2 s s = partial multiplications, where s is the number of segments. This method can be used to speed up software as well as hardware implementations. Usually in software implementations the Karatsuba's approach is applied until both operands have a size of one word.
In Bailey and Paar [4] a new scheme how to apply Karatsuba's idea was proposed. In this scheme the operands are divided into three parts. Throughout the rest of this paper we denote this method as Bailey's method. It requires 6 partial multiplications of n/3-bit long operands. This method can be combined with the original Karatsuba formula for operands, whose length is divisible by six.
Iterative Application of the Karatsuba Approach
The major point in our approach is to apply the original version Karatsuba's method iteratively. We denote this as Iterative-Karatsuba method. The major benefits of this approach are:
-a smaller area consumption of the hardware accelerators due to the fact that partial multiplications can be performed serially -a reduced number of XOR operations compared with the recursive variant of Karatsuba's method.
We explain our idea of the iterative application of Karatsuba's formula using an example in which the operands are split up into four segments. First of all, we use the original Karatsuba formula to obtain the expression for a product, in which only 1-segment long operands for partial multiplication are used.
So, at the beginning we have two operands, each of them 4n-bit long. We fragment each operand into two 2n-bit long parts:
The result of applying Karatsuba's formula is: 4  2  3  2  3   2  02  13  02  13  2  3  2  3  0  1 
Every 2-segments element is:
. So, for each partial multiplication from (6) and (7) we use the Karatsuba's formula again. The final result is given in formula (8)
Each of the operands is 1-segment long, so that the resulting partial product is (2n-1)-bit long. We denote the bits from n-1 to 0 of the product 
Using the notation introduced in (9) we can represent formula (8) as given in table 1.
Table1. Representation of formula (8)
partial products segments of result 
. We then start to calculate the segments of products using the already received results. For example:
Step 1
Step 
And so on to
Step 9
This iterative calculation of the C(x) reduces the area of our hardware multiplier. We need only one partial multiplier for 1-segment long operands. After each new clock this multiplier delivers the next partial product. In that way the segments of product C(x) are collected. For the above given example this means after 9 clock cycles all segments contain the correct product of the polynomial multiplication.
Additionally we exploit another 'iterative possibility': we do not need to calculate all segments of C(x) separately. We can use c 0 to determine c 1 after the first clock, c 1 for c 2 after second clock, and so on (see table 2 ). This iterative calculation reduces the number of XOR operations to 29 compared to 42 XOR operations if the calculation of every c i is done separately.
In a similar way we applied our iterative approach to Bailey's method, which we call Iterative-Bailey throughout the rest of this paper. The design of the Iterative Karatsuba accelerator consists of three major parts (see Fig. 1 ):
• Selection block feeds certain parts of both operands into the Partial Multiplier, for each new clock signal.
• Partial Multiplier block calculates the partial product of the operands delivered by the selection block and provides the results to the product accumulation block.
• Product Accumulation block computes the final product from the partial products it receives from the partial multiplier. The theoretical basis and exact operation sequence is discussed in detail in Section 3.
Figure 1: Block diagram of our Iterative-Karatsuba multiplier
The performance, chip area and energy consumption of a polynomial multiplier are dominated by the partial multiplier which is used. The larger the input signals of the partial multiplier may be, the faster the partial multiplier is. But this also results in a relatively large area consumption. So, the design decision to make seems to be straight forward: calculation time versus chip area. This is true as long as only the partial multiplier is considered. But for the polynomial multiplier also the area of the selection and the product accumulation block have to be taken into account. The chip area needed for the accumulation block depends on the area of the partial multiplier in an inverse proportional manner, i.e. the smaller the partial multiplier the larger the accumulation block. This results from the fact that in case of small partial multipliers more intermediary results have to be stored for the final calculation of the polynomial product. For example the size of the accumulation block is 0,649 mm 2 if the partial multiplier accepts 128 bit long operands, and 1,466 mm 2 if the maximum length of the operands is 32 bits.
In order to determine the most appropriate design for a polynomial multiplier we realized several partial multipliers. We realized 3 one-clock partial multipliers for our iterative Karatsuba as well as for our iterative Bailey approach. These partial multipliers accept operands with a maximal length of 128, 64 and 32 bits respectively. They were synthesized with a library of our in-house 0.25 m CMOSTechnology [5] . Table 3 shows the area, the time and energy consumption for each of these six partial multipliers. These values stem from the Design analyzer tool from Synopsys [6] . In order to benchmark our approach we realized polynomial multipliers using the following approaches:
• Iterative Karatsuba
• Iterative Bailey
• Original Karatsuba (recursive)
• Original Bailey (recursive)
For the first two approaches, i.e. for our own iterative approaches, we realized three polynomial multipliers using different partial multipliers (see table 3 ) in order to see how the partial multiplier influences the overall parameters. We named these multipliers so that the name indicates the applied method. For example, the name iterative_Karatsuba_8segments means: Iterative-Karatsuba fragmentizing incoming operands into 8 segments.
In the two recursive multipliers the original Karatsuba and the Bailey formula are applied down to one-bit operands. Both multipliers deliver the polynomial product after one clock cycle. They differ in the length of the input operands. The Karatsuba multiplier expects always two 256 bit long input values whereas the Bailey multiplier expects two 243-bit long input values.
Since we are going to use these multipliers for EC B-233 the two input values will be only 233-bit long. Therefore the operands were padded with leading 0's if it was necessary. The result of the multiplication is always 465-bit long.
We synthesized all polynomial multipliers using a library of our in-house 0.25 m CMOS-Technology [5] . We obtained the data represented in these tables with different kinds of reports from the Synopsys "Design Analyzer" [6] . The parameters of the implemented polynomial multipliers are given in Table 4 . Our results clearly indicate that an iterative application of the original Karatsuba and Bailey approach significantly reduces the chip area. If the number of iterations is kept small, our approach also helps to reduce the energy consumption. In those designs the decision is less area and less energy versus slower execution time. Increasing the number of iterations helps to reduce the chip area needed, but it also leads to an increased power consumption and an increased calculation time. So, these implementations are beneficial only if cost is the dominating parameter. 
Conclusions and Outlook
In this paper we discussed the iterative application of Karatsuba's method for polynomial multiplications as a means to reduce the chip area and energy needed to run elliptic curve cryptography on mobile devices. In order to evaluate our approach we analyzed different methods for polynomial multiplication in GF(2 n ), and implemented different polynomial multiplication algorithms. For our own approach we realized several partial multipliers. Weused them to implement a set of iterative polynomial multipliers with the goal to identify the one which is best suited for application in mobile devices. Our results clearly indicate that our iterative approach leads to significantly better results with respect to area and energy consumption than the original straight forward application. Our next step is the finalization of our Montgomery ‚kP' multiplier. In this multiplier we will use the Fermat theorem, since it allows to determine the inverse for the division multiplication and squaring. The Fermat theorem is slower than the Extended Euklidian Algorithm or the method proposed by Shantz [7] , but it requires less area. Since the Montgomery method requires only a single division, we think that the smaller area outweighs the slower performance.
