Abstract-We describe algorithms for point multiplication on Koblitz curves using multiple-base expansions of the form
Ç

INTRODUCTION
I N 1985, Koblitz [1] and Miller [2] independently proposed the use of the additive finite abelian group of points on elliptic curves defined over a finite field for cryptographic applications. The Koblitz curves [3] , or anomalous binary curves, are
defined over IF 2 , where a 2 f0; 1g. The number of points on these curves when considered over IF 2 m can be computed rapidly using a simple recurrence relation, and there are many prime values of m for which the number of points is twice a prime (when a ¼ 1) or four times a prime (when a ¼ 0). Five Koblitz curves are recommended for cryptographic use by NIST [4] . The main advantage of Koblitz curves is that the Frobenius automorphism of IF 2 acts on points via ðx; yÞ ¼ ðx 2 ; y 2 Þ and is essentially free to compute. Because satisfies ð 2 þ 2ÞP ¼ ðP Þ for all points P 2 E a ðIF 2 m Þ where ¼ ðÀ1Þ 1Àa , we can consider as a complex number satisfying x 2 À x þ 2 ¼ 0, i.e., ¼ ð þ ffiffiffiffiffiffi ffi À7 p Þ=2. Thus, computing kP , where k 2 Z Z and P 2 E a ðIF 2 m Þ, can be done using a representation of k involving powers of instead of the usual binary representation using powers of 2, yielding a point multiplication algorithm similar to the binary "doubleand-add" method in which the point doublings are replaced by applications of the Frobenius [3] , [5] . Solinas [5] showed how the nonadjacent form (NAF) and window-NAF methods can be extended to -adic expansions. The resulting point multiplication algorithms require on average ðlog 2 kÞ=3 point additions or ðlog 2 kÞ=ðw þ 1Þ point additions using width-w window methods requiring precomputations based on P . A result of Avanzi et al. [6] reduces this to ðlog 2 kÞ=4 at the cost of one additional point halving, but the practicality of this method has not yet been demonstrated. Double-base integer representations have been used to devise efficient point multiplication algorithms [7] , [8] , [9] . For example, it can be shown that the number of terms of the form AE2 a 3 b required to represent k is bounded by Oðlog k= log log kÞ. These representations can be computed efficiently and the resulting point multiplication algorithms are the only known methods for which the number of required point additions is sublinear in log k.
In this paper, which is based on a preliminary version that appeared in CHES 2006 [10] , we extend the double-base idea to -adic expansions for point multiplication on Koblitz curves by representing k as a sum of terms AE a ð À 1Þ b . Our algorithm requires no precomputations based on the point P , no point doublings, and fewer point additions than -adic NAF ð-NAFÞ for the five recommended Koblitz curves from [4] . Our algorithm for computing the doublebase representation of k is very efficient; it requires only the unsigned -adic expansion of k plus a few table lookups. A precomputed table of optimal representations for a small number of -adic integers is required, but these are independent of the multiplier k and the base point P . We have developed a novel FPGA implementation of both the conversion and point multiplication algorithms that demonstrates the efficiency of our method. This implementation takes advantage of the naturally arising parallelism in our point multiplication algorithm in order to further reduce the latency.
We conjecture that the average density of our representations is sublinear in log k and provide numerical evidence showing that the density is lower than that of -NAF expansions. Although we do not have a proof that the number of point additions required by our algorithm is sublinear, our numerical experiments indicate that the number of point additions required is within a small constant multiple of log k=ðlog log kÞ. This constant depends on the size of the precomputed table and was between 3 and 1, with smaller values occurring for larger table sizes. In addition, we provide a proof that sublinearity is obtained using similar expressions involving three bases of the form AE a ð À Þ b ð 2 À À 1Þ c . Although this provably sublinear algorithm is not as well suited for practical applications, this work is nevertheless of interest, as it represents the first rigorously proven sublinear point multiplication algorithm using complex bases.
Avanzi and Sica [11] have independently investigated the application of the double-base idea to Koblitz curves using bases and 3. It is not clear how their algorithm performs in practice. They include a proof that the density of their multiplier representations is sublinear under the unproven but reasonable assumption that the irrationality measure of log 2 3 and argðÞ= is 2, but the proof has been shown to have a gap [12] .
The remainder of this paper is organized as follows: In Section 2, we present our provably sublinear point multiplication algorithm. We present a similar algorithm using only two complex bases in Section 3. Although we cannot prove sublinearity for this algorithm, we conjecture that the density of the representations is in fact sublinear, and provide numerical evidence in Section 3.2 indicating that the density of our representations is lower than that of -NAF representations. A description of our FPGA implementation and numerical data demonstrating its efficiency are presented in Section 4. We show in Section 5 how parallel processing can be used for reducing computation latency and demonstrate the efficiency of parallelization with our FPGA implementation. Finally, we conclude with an outlook on possible directions for further research.
MULTIDIMENSIONAL FROBENIUS EXPANSIONS
We start with the following definitions: Definition 1. A complex number of the form e þ f, where e; f 2 Z Z, is called a Kleinian integer [13] .
The main idea of the new point multiplication algorithm over Koblitz curves is to extend the existing and widely used -NAF expansion of the scalar to a new form that will speed up the computations. The improvements obtained in this paper are based on the following representation, which we will call 2D or 3D Frobenius expansions (or f; À 1g-expansion and f; À;
2 À À1g-expansion, for short):
Such representations are clearly highly redundant. If we rearrange the summands in the above formula, then, using two bases, we can represent the scalar k as
where maxða i;l Þ is the maximal power of that is multiplied by ð À 1Þ l in (2) . Using three bases, we can represent k as
where maxða i;l1;l2 Þ is the maximal power of that is multiplied by
Algorithm 1 computes kP given a f; À 1g-expansion of k. In order to simplify, we denote the terms corresponding to ð À 1Þ l in (4) with r l ðkÞ, i.e., r l ðkÞ ¼ P maxðai;lÞ i¼1 s i;l ai;l .
The corresponding algorithm for f; À ; 2 À À 1g-expansions will be described later, along with a proof that the number of point additions is sublinear in log k. Essentially, kP is computed via a succession of 1D -adic expansions.
Algorithm 1 Point multiplication using f; À 1g-expansions.
The representation of k given in (4) is the cornerstone of our algorithm, so some comments on it are in order:
1. The multiplications by À (and À 1) cost one Frobenius mapping (free in our computational model) and one point addition or subtraction. The multiplications by 2 À À 1 cost two Frobenius mappings and two point additions/subtractions. Therefore, the total number of point additions/ subtractions, ASðkÞ, is given by
in the case of f; À 1g-expansions and
in the case of f; À ; 2 À À 1g-expansions. The smallest possible value of maxðb i Þ and maxðc i Þ, 0, corresponds to the classical (1D) -NAF expansion, for which it is known that the expected number of point additions/subtractions is ðlog 2 kÞ=3. It is clear that by allowing larger values for maxðb i Þ and maxðc i Þ one would decrease the corresponding number of summands, d. Therefore, it is vital to find out the optimal values for maxðb i Þ as a function of the size of the scalar. 2. Finding an algorithm that can return a fairly short decomposition of k as the sum of f; À 1g-Kleinian integers is absolutely essential. The most straightforward idea seems to be the greedy algorithm described in Algorithm 2. A greedy algorithm for computing f; À ; 2 À À 1g-expansions is an easy generalization of this algorithm.
Algorithm 2 Greedy algorithm for computing f; À 1g-expansions.
The complexity of the greedy algorithm depends crucially on the time spent to find the closest f; À 1g-Kleinian integer to the current Kleinian integer. Unfortunately, we were not able to find a significantly more efficient method to do this than precomputing all Kleinian integers AE x ð À 1Þ y for x, y less than certain bounds and finding the closest one using exhaustive search. In Section 2.1, we present an efficient algorithm for computing f; À 1g-expansions with slightly more weight than those produced by the greedy algorithm and an algorithm for computing f; À ; 2 À À 1g-expansions with weight provably sublinear in log k.
Comparison to Double-Base Number Systems
The similarities between (2) and the double-base number system (DBNS), in which one represents integers as the sum or difference of numbers of the form 2 a 3 b , a, b nonnegative integers (called {2, 3}-integers), are apparent. In the case of DBNS, one can prove the following result. Theorem 1. Every positive integer, n, can be written as the sum of at most Oðlog n= log log nÞ {2, 3}-integers and (one) such representation can be found by using the greedy algorithm.
The key point in proving this theorem is the following result of Tijdeman [14] .
Theorem 2. Let x and y be two {2, 3}-integers, x > y. Then, there exist effectively computable constants, c 1 and c 2 , such that x ðlog xÞ c1 < x À y < x ðlog xÞ c2 :
The proof of Theorem 1 uses only the first inequality. Theorem 2 provides a very accurate description of the difference between two consecutive {2, 3}-integers. More to the point, it can be generalized easily to any set of fp 1 ; p 2 ; . . . ; p s g-integers if p s is fixed. The proof depends on the main result of Baker [15] from the theory of linear form in logarithms. 
where
The constant CðkÞ is huge, even in the case of linear forms in two logarithms, approximately expð6 Á 10 9 Þ. By using some results aimed specifically at the case of two logarithms [16] , one can reduce CðkÞ to expð10 7 Þ, but this is still enormous. However, practical simulations suggest that this constant is likely to be much smaller, perhaps less than 100.
There are two very essential points that are often overlooked in the formulations of the above theorems [17] A paper by Avanzi and Sica [11] contains a proof that Conjecture 1 is true if one uses f; 3g-Kleinian integers under the unproven but reasonable assumption that the irrationality measure of log 2 3 and argðÞ= is 2. Unfortunately, the proof, even with the assumption on irrationality measures, has a gap [12] . The gap is due to a reasonable but unproven assumption about the uniform distribution of the arguments of the complex numbers of the form a 3 b [12] . The use of two complex bases, used in this paper, increases the theoretical difficulties in proving the conjecture but provides much more practical algorithms.
However, in the case of three bases, we can prove without any assumptions the following:
as the sum of at most Oðlog NðÞ= ðlog log NðÞÞÞ f; À ; 2 À À 1g-Kleinian integers, such that the largest power of both À and 2 À À 1 is Oðlog NðÞÞ for any real constant where 0 < < 1=2.
Proof. We assume that b ¼ 0; otherwise, one applies the same proof for the real and imaginary parts of , which leads to doubling the implicit constant hidden in the big-O notation. Let be a real constant, where 0 < < 1=2. We determine the -adic representation of a, the real part of , using digits 0 and 1. 2 À À 1g-expansion of the form (5) using Theorem 4. 4: Apply in succession -NAF based point multiplications based on (5) to compute Q.
The analysis of Algorithm 3 is simple.
Step 1 requires Oðlog NðÞÞ point additions and step 2 requires Oðlog 2 NðÞÞ point additions. Because < 1=2, the total number of point additions for steps 1 and 2 is oðlog NðÞÞ. According to Theorem 4, step 3 requires Oðlog NðÞ= ðlog log NðÞÞÞ point additions. The total number of point additions for Algorithm 3 is therefore Oðlog NðÞ= ðlog log NðÞÞÞ. Thus, one can compute kP in Oðlog k= ðlog log kÞ point additions by computing kðmodð m À1Þ= ð À 1ÞÞ and applying Algorithm 3 to compute P .
Note that the first two steps of Algorithm 3 are independent of k. If a fixed base point P is to be used, the points Q i1;i2 may be precomputed.
The parameter can be chosen in a variety of ways. The total number of point additions required by all three steps is roughly log NðÞ þ log 2 NðÞ þ 2 log NðÞ=ð log log NðÞÞ; for 163 < NðÞ < 571, taking such that 0:365 < < 0:368 minimizes this quantity. Smaller values of reduce the number of points Q i 1 ;i 2 that must be precomputed and stored at the cost of increasing the number of point additions that must be performed in step 3. On the other hand, larger values of decrease the number of point additions in step 3 at the cost of having to precompute and store more points.
A PRACTICAL BLOCKING ALGORITHM
Although, as proved in Theorem 4, using f; À ; 2 À À 1g-expansions does lead to a sublinear point multiplication algorithm, the resulting algorithm is likely not suitable for practical purposes. Nevertheless, assuming the truth of Conjecture 1, we can devise an efficient algorithm that computes f; À 1g-expansions with sublinear density of Kleinian integers. This algorithm is based on the following theorem.
Theorem 5. Assuming Conjecture 1, every Kleinian integer ¼ a þ b can be represented as the sum of at most Oðlog NðÞ= log log NðÞÞ f; À 1g-Kleinian integers such that the largest power of À 1 is Oðlog NðÞ= log log NðÞÞ.
Proof. We shall assume that b ¼ 0; otherwise, one applies the same proof for the real and imaginary parts of , which leads to doubling the implicit constant hidden in the big-O notation. Next, we represent a, the real part of , in base-expansion with digits 0 and 1. The length of this expansion is Oðlog NðÞÞ. Now, we break this representation into log log NðÞ blocks, each of length Oðlog NðÞ= log log NðÞÞ. Every block of digits corresponds to a Kleinian integer and, if Conjecture 1 is true, then every block can be represented as the sum at most O log NðÞ= log log NðÞ log log NðÞ= log log NðÞ ð Þ ¼ O log NðÞ log 2 log NðÞ À log log NðÞ log log log NðÞ ¼ O log NðÞ log 2 log NðÞ f; À 1g-Kleinian integers. As the number of blocks is log log NðÞ, we conclude that this representation consists of at most Oðlog NðÞ= log log NðÞÞ f; À 1g-Kleinian integers. The highest power of À 1 that can occur in such a representation is governed by the highest power that can occur in every block and it is at most Oðlog NðÞ= log log NðÞÞ. t u
This theorem leads to an efficient method to compute f; À 1g-expansions with sublinear density under Conjecture 1. The idea, described in Algorithm 4, is to apply the blocking strategy described in the proof and compute optimal f; À 1g-expansions for each block.
Algorithm 4 Blocking algorithm for computing f; À 1g-expansions. INPUT: A Kleinian integer ¼ e þ f, block size w, precomputed table of the minimal f; À 1g-expansion of every Kleinian integer
{Process blocks of length w} 5:
Find minimal f; À 1g-expansion of P wÀ1 i¼0 d iþjw i from the precomputed 
3.
One can choose the size of the blocks based on available memory. The larger the block size, the lower the density of the f; À 1g-expansions produced. 4. If the block size is not too big, one can precompute the minimal f; À 1g-expansion of every Kleinian integer of the form P wÀ1 i¼0 d i i , d i 2 f0; 1g, thereby ensuring as low a density as possible. This precomputation can be done using exhaustive search and needs to be done once per elliptic curve, as it does not depend on k nor the base point P .
Example
Consider the representation of 6,465 into a f; À 1g-expansion by using the two different algorithms we have described. Assume that we intend to compute ð6;465ÞP for some point
As in the case of computing the -NAF expansion, we first do a partial reduction of 6,465 modulo ð 17 À 1Þ=ð À 1Þ as in [5] , yielding ¼ À104 þ 50. The greedy algorithm, Algorithm 2, returns
The blocking algorithm, Algorithm 4, using a block size w ¼ 7 returns the same representation. Fig. 1 illustrates the idea behind the blocking algorithm.
13 . This 14-bit expansion of is broken into two 7-bit blocks. The right block corresponds to 2 þ 5 , and 2 ð À 1Þ 2 is its optimal f; À 1g-expansion. The left block corresponds to 2 þ 4 þ 5 þ 6 , and ð À 1Þ þ ð À 1Þ 6 is its optimal expansion. Finally, multiplying the expression for the left block by 7 yields the f; À 1g-expansion of . The usefulness of this idea is perhaps clearly visible if one uses a geometric representation of the f; À 1g-expansion. Table 1 shows the 2D representation of the f; À 1g-expansion computed above. The expansion is very sparse-of the 63 possible terms that could occur, assuming 8 is the maximum power of and ð À 1Þ 6 is the maximum power of À 1, only three actually occur in the expansion. Furthermore, note that when computing kP using this representation, each row in the table represents a 1D -adic expansion, and that these are very sparse.
Numerical Evidence
In this section, we present results from software implementations of Algorithms 2 and 4, the greedy and blockingbased f; À 1g-expansion algorithms. The objective is to compare the density of the f; À 1g-expansions computed by our algorithms with -NAF expansions. Our algorithms and the algorithm for computing the -NAF [5] of k were implemented in C, using the GMP library for multiprecision integer arithmetic. Tests were run on an Intel Xeon 2.8-GHz CPU running Linux.
In Table 2 , we present the average time required to compute the f; À 1g-expansion of k compared with the time to compute the -NAF of k for the sizes of k corresponding to the five Koblitz curves recommended by NIST [4] . We used the blocking algorithm, Algorithm 4, with w ¼ 10 and maxðb i Þ ¼ 6 to compute the ð; À 1Þ-expansions, and 500,000 random values of k for each size.
Our implementation of
Theorem 5 states that our conversion algorithm outputs expansions of k with sublinear density even if the maximum power of À 1 is bounded by some constant maxðb i Þ as long as any sublinear expansion exists. For practical purposes, we need to know what value of maxðb i Þ gives us minimal weight expansions on average.
In Fig. 2 , we plot the average number of point additions required to compute kP using a f; À 1g-expansion of k, i.e., the number of nonzero terms in the expansion plus maxðb i Þ À 1, as a function of maxðb i Þ. These average values were computed using 1,000 random values of k of each size. This graph illustrates that each size of k has an optimal value of maxðb i Þ ranging from 4 to 12. The precise value of maxðb i Þ to use corresponds to minimizing the overall time required to compute kP given a f; À 1g-expansion. As shown in Section 4.4, maxðb i Þ ¼ 3 turns out to be optimal for our FPGA implementation of point multiplication on E 1 ðIF 2 163 Þ.
In Table 3 , we list the average number of point additions required to compute kP on the five NIST-recommended Koblitz curves [4] when using -NAF, our greedy f; À 1g-expansion algorithm (Algorithm 2), and our blockingbased Algorithm 4 using block lengths of w ¼ 5; 10; 16 and maxðb i Þ ¼ 6. In all cases, the data are taken as the average over 500,000 random values of k. Our algorithm requires significantly fewer point additions than -NAF in all cases. In addition, the number of point additions required is within a small constant multiple of log k=ðlog log kÞ, for some constant depending on w, providing evidence in support of Conjecture 1. It is interesting to note that this constant appears to be small, between 3 and 1 according to our data, with smaller values occurring for larger table sizes. Unfortunately, a proof of this observation is currently beyond our reach. Affine coordinates, A, and Ló pez-Dahab coordinates, LD [18] , are used for representing points on E a ðIF 2 m Þ. In A, a point is represented with two coordinates as ðx; yÞ and, in LD, with three coordinates as ðX; Y ; ZÞ. The LD triple represents the point ðX=Z; Y =Z 2 Þ in A [18] . When P ¼ ðx; yÞ in A, ÀP ¼ ðx; x þ yÞ. Point addition in A is performed as presented, e.g., in [19] , and it costs I þ 2M þ S þ 8A, where I, M, S, and A denote inversion, multiplication, squaring, and addition in IF 2 m , respectively. Point addition in LD is performed as presented in [20] , and it requires 13M þ 4S þ 9A. A mixed coordinate point addition Q þ P , where Q is in LD and P in A, requires only 9M þ 5S þ 9A [21] . This cost reduces by M on Koblitz curves and by A if both P and ÀP are available. Because our implementation stores also ÀP , the cost is 8M þ 5S þ 8A. The cost would reduce by one further A if a ¼ 0. Point subtraction is simply a point addition with ÀP , i.e., Q À P ¼ Q þ ðÀP Þ. The A 7 ! LD mapping does not require any operations in IF 2 m while LD 7 ! A costs I þ 2M þ S. The cost of a Frobenius mapping is 3S in LD and 2S in A. An inversion in IF 2 m is computed using the ItohTsujii inversion requiring m À 1 squarings and blog 2 ðm À 1Þc þ H w ðm À 1Þ À 1 multiplications, where H w ðm À 1Þ is the Hamming weight of m À 1 [22] . Hence, I ¼ 9M þ 162S when m ¼ 163.
HARDWARE IMPLEMENTATION
Different coordinates are used in Algorithm 1 as follows: Point addition in A is used in computing P l so that mixed coordinate point addition can be used in S S AE P l computations (row 4 in Algorithm 1). Because the results of row multiplications are in LD, Q Q þ S is computed in LD.
The design in [10] was implemented in Xilinx Virtex-II, but here, we implemented the design in Altera Stratix II FPGAs [23] so that Altera Stratix II EP2S180 DSP Development Board [24] could be used for prototyping. The implementation includes a field arithmetic processor (FAP) for arithmetic in IF 2 m , control logic for controlling the FAP, and a converter for converting k to a f; À 1g-expansion. The FAP is considered in Section 4.1 and the control logic in Section 4.2. Section 4.3 discusses the conversion unit, i.e., the implementation of the blocking algorithm, Algorithm 4.
Field Arithmetic Processor (FAP)
The FAP includes a multiplier, a squarer, an adder, a storage element, and a control logic. A storage element for 163-bit elements of IF 2 163 is required in order to store points and temporary variables during computation of kP . As Stratix II offers embedded memory blocks that can be used without consuming logic resources, the storage element is implemented in M4K RAM blocks. One dual-port RAM can be configured into a 256 Â 18-bit mode in M4K. All 163 bits of an element must be accessed in parallel in the FAP architecture. Hence, 163 18 AE Ç ¼ 10 M4Ks are required. Write and read operations require both one clock cycle, i.e., W ¼ R ¼ 1.
The squarer is a shifter that performs A 163 is simply a bitwise exclusive-OR (XOR). Both addition and (successive) squarings are performed in one clock cycle, i.e., A ¼ S ¼ S Ã ¼ 1. Field multiplication is critical for the overall performance. The multiplier is a digit-serial implementation of the Massey-Omura multiplier [25] . In a bit-serial MasseyOmura multiplier, one bit of the output is calculated in one clock cycle, and hence, m cycles are required in total. One bit c i of the result C ¼ A Â B is computed from A and B by using an F -function. The F -function is field specific, and the same F is used for all output bits c i as follows: c i ¼ F ðA ( <i ; B ( <i Þ, where ( < i denotes cyclical left shift by i bits [25] .
In a digit-serial implementation, D bits are computed in parallel. Hence, m D
AE Ç
cycles are required in one multiplication. In this FAP, D ¼ 24. The F -function is pipelined in order to increase the maximum clock frequency by adding one register stage. As loading the operands into the shift registers requires one clock cycle and pipelining increases latency by one clock cycle, the latency is M ¼ 163 24
Control Logic
Logic controlling the FAP consists of a storage for k, a control finite state machine (FSM), and a ROM for control sequences. The implementation handles k in a coded form. The coding is performed using : fs; dg symbols, where s 2 f0; 0; 1; 1g and 0 d d max . 0 is a symbol reserved for a row change. Coding is started from the most significant bit of the first row and it proceeds as follows: s is the signed bit starting a symbol and d is the number of Frobenius mappings following s, i.e., the number of consecutive zeros plus one (the Frobenius mapping associated with the start bit of the next symbol). If the run of consecutive Frobenius [4] mappings is longer than d max , the run must be divided into two symbols and s ¼ 0, for the latter one. Each , with the exception of the row change symbol, transforms into an operation S d ðS þ sP Þ on E a ðIF 2 m Þ. Let ZðkÞ denote the maximum number of consecutive Frobenius mappings required by k. Then, the number of -symbols, e, required to represent k, is given by e ! H w k þ maxðb i Þ, with equality if and only if d max ! ZðkÞ.
The control FSM takes -symbols as input, and according to s and d of , it sets addresses of the control sequence ROM. The control sequences consist of successive FAP instructions directly controlling the FAP. There are control sequences for P lþ1 P l À P l (Frobenius mapping and point addition in A), point addition and subtraction (mixed coordinate point addition), Frobenius mapping, row change (point addition in LD), and LD 7 ! A mapping. They are all stored in a ROM implemented in an M4K block.
If implemented so that, for each operation, the operands would be first read from the memory, then the operation calculated and finally the result saved to the memory, the latency would be the latency of the operation (M, S, or A) plus two clock cycles (R þ W). The no-change read mode of Xilinx BlockRAMs was used in [10] for reducing the latency of control sequences. Stratix II devices do not support such read mode, but the latency can be reduced also with read-during-write mode supported by Stratix II memory blocks [23] . The reduction is, however, slightly smaller. When the control sequences were carefully handoptimized, different operations have the following latencies in clock cycles: the mixed coordinate point addition, L M ¼ 112, the Frobenius mapping, L F ¼ 7, row change (point addition in LD), L LD ¼ 167, the computation of P l , L Pl ¼ 171, and the LD 7 ! A mapping, L LD 7 ! A ¼ 145. The first point addition of each row is simply S AEP l and the first row change operation is given by Q S. Both of these operations have a latency of L C ¼ 6. In the beginning, an initialization is performed including, e.g., the transferring of P into the FAP. The latency of the initialization is L I ¼ 11. Thus, it follows that the latency of the kP computation becomes
and by assuming that d max ! ZðkÞ, i.e., e ¼ H w ðkÞ þ maxðb i Þ, (6) simplifies to
Conversion Unit
The conversion unit, which converts an integer k into a f; À 1g-expansion, is a straightforward implementation of Algorithm 4, our blocking-based method. The main part of this unit is an ALU, which has two integer multipliers, each of which makes use of one 18-bit Â 18-bit embedded multiplier to create 102-bit Â 102-bit products. The ALU also includes adders, shifters, and the rounding function required by the partial reduction algorithm [5] . The conversion unit uses the ALU and two intermediate registers for reducing every integer k to an equivalent r 0 þ r 1 , then gets the -adic expansion by a shift-and-add circuit, which produces one bit per cycle, from the least significant bit to the most significant bit.
For our implementation, we selected the block size w ¼ 10, so every 10 bits of the -adic expansion are used as an index into a lookup table. This table has one entry for each possible index ðb 9 b 8 . . . b 0 Þ, b i 2 f0; 1g, where each entry is the optimal f; À 1g-expansion of P 9 i¼0 b i i allowing a maximum exponent of 6 for À 1. At most three terms of the form AE a ð À 1Þ b are required for each representation, so each entry in the table consists of three tuples of the form ðd n ; i n ; j n Þ representing d n i n ð À 1Þ j n . Hence, each entry requires 27 bits and the whole lookup table requires a 27-Kbit RAM. Note that, according to the data in Section 3.2, using a block size of 5 would still give us a significant improvement over -NAF, and in this case, the table would require less than 1 Kbit.
Because integer operations are slower than the IF 2 m operations in the FAP, the conversion unit will be the bottleneck if the two units use the same clock. So dual-port RAMs are used in order to separate these units into different clock domains. The lookup results are written into the dual-port RAMs using one port, and the FAP will read them out from the other port later.
Results
The design was written in VHDL and implemented in an Altera Stratix II EP2S180F1020C3 FPGA [23] , which is on the Altera Stratix II EP2S180 DSP Development Board. The design was synthesized with Quartus II 6.0 SP1. In total, the design requires 8,799 Adaptive Logic Modules (ALMs) and 20 M512 and 36 M4K memory blocks including logic for interfacing and buffers separating clock domains. The FAP and its control logic require 6,084 ALMs and 16 M4Ks, and they operate at the maximum clock frequency of 152.28 MHz. The converter requires 1,947 ALMs, 8 M4Ks, and 4 DSP blocks and operates at 93.54 MHz. It takes 335 clock cycles, or 3.58 s, to convert one 163-bit integer.
Average timings of the design are presented in Table 4 . The latency L kP is given by (7) , and timings are calculated using a clock of 152 MHz. The time consumed in the conversion is neglected because the f; À 1g-expansion for the next kP can be computed simultaneously while the FAP processes the previous kP . Table 4 shows that the best performance is achieved when maxðb i Þ ¼ 3, which is smaller than estimated in Section 3.2, because the latencies of point additions differ. In Section 3.2, all point additions were assumed equal. The implementation on Xilinx Virtex-II computes kP in 35.75 s [10] , which shows that the longer latency caused by the lack of the no-change read mode in Stratix II is compensated by a higher clock frequency and the designs run slightly faster on Stratix II. Numerous publications considering implementation of elliptic curve cryptography on FPGAs have been published, e.g., in [26] , [27] , [28] , [29] , [30] , [31] , [32] , [33] , [34] , [35] , [36] , [37] , and [38] . To the authors' knowledge, the only published FPGA-based implementations using -NAF expansions were presented in [32] , [34] , and [35] , and they all use the NIST K-163 curve, E 1 ðIF 2 163 Þ. Okada et al. [35] published an implementation that computes a kP operation in 45.6 ms with an Altera EPF10K250AGC599-2, and Lutz and Hasan [34] presented a design that computes kP in 75 s on a Xilinx Virtex-E XCV2000E. Järvinen et al. [32] presented an implementation using parallel processing blocks that can compute up to 166,000 three-term point multiplications per second with an average computation time of 114.2 s on a Stratix II S180C3 FPGA. Comparing designs implemented on different FPGAs is difficult because it is hard to tell which portion of a performance difference is caused by different implementation platforms and which by some novel implementation technique. Anyhow, the design we have presented is faster than any of the above-mentioned designs. Moreover, it will be shown in Section 5 how our method can be further accelerated with parallel processing.
PARALLEL IMPLEMENTATION
This section presents an efficient method for reducing the latency of kP by using parallel processing. It is well known that reducing the latency of kP with parallelism is hard because of limited intrinsic parallelism in the operation if it is computed with traditional methods such as the binary double-and-add method. However, when k is a f; À 1g-integer, parallelism can be efficiently used for reducing the latency of our algorithm because rows of the table representing k can be computed in parallel.
Parallel computation is performed as follows: The implementation consists of p identical processing blocks (FAP and control logic), which can exchange data with each other. Processing blocks compute their subcomputations independently and the results of the subcomputations are combined with p À 1 point additions resulting in kP . The critical path consists of the most expensive subcomputation and dlog 2 pe point additions because parallelism can be utilized also in combining the subcomputations.
Let the area of a single processing block be A 0 . Because identical parallel processing blocks are used, the area of p processing blocks is approximately 1 pA 0 . Area efficiency of parallel implementations is measured via the latencyarea product R ¼ CA, where C is the critical path and A is the area. The best area efficiency is obtained with a setup that minimizes R.
The maximum number of parallel processing blocks is p max ¼ maxðb i Þ þ 1, i.e., the number of rows. Because the rows corresponding to ð À 1Þ l with small l have a higher average number of nonzero terms than the rows with large l as can be seen in Fig. 3 , using p max parallel processing blocks leads to an inefficient R although C is small. In practice, selecting p < p max and assigning rows to processing blocks by their computational cost leads to both small C and R.
When a row is computed, the base point P l ¼ ð À 1Þ l P needs to be computed first. If the base point is computed with sequential ð À 1ÞP computations, in total l point subtractions and Frobenius mappings are required. This is not a problem if p ¼ 1 because all rows are computed in the same processing block, but when p > 1, one would need to compute also the base points that are not needed in order to get P l with high l. For example, in order to get P 4 ¼ ð À 1Þ 4 P , four point subtractions and Frobenius mappings are required, i.e., P 1 , P 2 , and P 3 need to be computed, although they may not be needed. The cost of base point computations can be reduced by utilizing the fact that 
1. Some of the interface logic can be shared, but some extra logic is required in order to connect the processing blocks. Nonsequential formulas including only one point addition or subtraction cannot be found for l > 4, but (8)-(10) can be still used for reducing the computational cost. For example,
which requires only three point subtractions instead of 11. Generally, ð À 1Þ l P requires dl=4e point subtractions.
Similar equations cannot be found for f; À 1g-Kleinian integers and E 0 ðIF 2 m Þ, where ¼ 1 and 2 þ 2 ¼ À, but for f; þ 1g-Kleinian integers, similar formulas are given as follows:
Analogously, there are no formulas for f; þ 1g-Kleinian integers and E 1 ðIF 2 m Þ.
The assignment of rows is performed as presented in Algorithm 5, which produces a computation schedule including the indices of the rows. Cost of the row l is H w ðr l ðkÞÞ À 1 þ dl=4e, where H w ðr l ðkÞÞ is the number of nonzero terms in the row. An example of scheduling is presented in Fig. 4 . The example shows that, in that particular case, the critical path can be reduced from 36 À 1 þ 6 ¼ 41 to 14 À 1 þ dlog 2 3e ¼ 15 point additions by using three parallel processing blocks instead of one. 
Find the processing block with the smallest assigned cost (if more than one, select the one with the smallest index) 3:
Find the nonassigned row with the highest cost (if more than one, select the one with the smallest index) 4:
Assign the row to the processing block 5:
Add the cost of the row to the assigned cost of the block and remove the row from the list of nonassigned rows 6:
If the assigned row is not the first row computed in the block, add the cost of combining rows to the assigned cost of the block 7: end for It is not obvious which are the optimal choices for maxðb i Þ and p when C and R are both considered. Hence, Algorithm 5 was performed for 10,000 random 163-bit integers converted with Algorithm 4 ðw ¼ 10Þ and different maxðb i Þ and p. The resulted critical paths and latency-area products are shown in Table 5 . The critical path C is the number of point additions. All point additions are assumed to have a cost of one, reflecting the situation where only A is used for representing points on E a ðIF 2 m Þ. The best possible critical paths and latency-area products with different p (bold values in Table 5 ) are plotted in Fig. 5 , which clearly shows that the critical path shortens considerably when p > 1. However, major reductions are not achievable with p > 4 because the most expensive row becomes a bottleneck. Latency-area product always increases when p grows, but the increase is moderate when p 4. The most expensive row bounds the critical path, and as can be seen in Fig. 3 in Section 5, the cost of the most expensive row decreases very slowly when maxðb i Þ grows. Thus, the technique yields efficient results only with small p. This is not a major concern in practice as large p are often unreachable anyhow because of area constraints.
Hardware Implementation
The efficiency of the parallelization technique was studied in practice by designing an FPGA implementation based on the design described in Section 4. Because the design uses both A and LD , the analysis of optimal p and maxðb i Þ performed in Section 5 is not valid.
The design computes kP so that, when the integer is converted with the converter described in Section 4.3, the scheduling algorithm, Algorithm 5, is performed for the f; À 1g-expansion. It was implemented so that it assumes that point additions in A and LD cost 1.5 times as much as a mixed coordinate point addition. The correct values are 1.53 and 1.49, respectively, so the error of this assumption is negligible, but the implementation is easier and the result more area efficient. The scheduling increases the latency of the conversion by 16 clock cycles resulting in 351 clock cycles. After the conversions, the subcomputations are performed independently in parallel FAPs. The computations are The scheduling is performed for a 163-bit integer k ¼ 1902cd968f45b3c58932fdb63eea875b2884b3d35 x , which is converted using the blocking algorithm, Algorithm 4, with w ¼ 10 and maxðb i Þ ¼ 6. The most expensive computation is performed in the second processing block and it has a critical path of 14 point additions. The cost of the first processing block reduces by one because base point computations for P 5 and P 6 can be combined resulting in a reduction of one point subtraction.
performed as discussed in Section 4 with the exception that P i are computed using (8)- (10) . In the end, the results of the subcomputations are combined with point additions in A, each requiring 171 clock cycles.
Latencies for the above-described design with different p and maxðb i Þ are presented in Table 6 . Fig. 6 depicts the shortest latencies and the smallest latency-area products for all p. Comparisons of Tables 5 and 6 and Figs. 5 and 6 show that the parallelization technique gives similar results than in Section 5 also for our FPGA design. Also, in this case, p ¼ 4 is the highest number of parallel processing blocks that result in any significant reductions in latency. However, the values of maxðb i Þ that result in the shortest latencies differ from the analysis of Section 5. The speedups are also slightly smaller than in Section 5. Based on Table 6 , we selected the parameters p ¼ 4 and maxðb i Þ ¼ 6 for the FPGA implementation.
Results
The parallel implementation was also written in VHDL and synthesized for Stratix II S180C3 with Quartus II. The design requires 28,328 ALMs, which is 39 percent of the device resources, 52 M512s, and 66 M4Ks. The design operates with 152-and 93-MHz clocks similarly as the one processor implementation that was expected because the critical path determining the maximum clock frequency is the same in both cases. The average kP latency is 2,033 clock cycles as in Table 6 . Thus, a kP computation requires on average only 13.38 s. Conversions including schedulings require 3.77 s.
The speedup compared to the single processing block implementation presented in Section 4.4 is 35:04=13:38 ¼ 2:61 (excluding the conversion) or 38:62=17:15 ¼ 2:25 (including the conversion). The increase in area is 28;328=8;799 ¼ 3:21 (including the converter). This shows that very high speedups are achievable with moderate area increase by using parallel processing. Furthermore, the resulted implementation computing kP on average in 17.15 s (including the conversion) is the fastest published FPGA-based implementation, at least to the authors' knowledge.
CONCLUSIONS AND FURTHER WORK
Our results have demonstrated that f; À 1g-expansions lead to a competitive point multiplication algorithm for Koblitz curves when the base point P is not fixed. Nevertheless, there are a number of aspects we are continuing to explore.
Alternative choices of the bases, or even using three bases, may lead to further improvements. For example, using f; þ 1g may be advantageous in our implementation because computing ð þ 1Þ l P from ð þ 1Þ lÀ1 P requires a point addition as opposed to a point subtraction when using bases and À 1, and subtraction is slightly more expensive than addition. f; þ 1g-expansions may also be useful in parallel implementations because ð þ 1Þ l P can be computed efficiently on E 0 ðIF 2 m Þ but ð À 1Þ l P cannot.
Our point multiplication algorithm does not require any precomputations involving the base point P nor storage of additional points and, hence, is well suited to applications where P is random. We are investigating the possibility of generalizing window methods, using 2D windows, to our algorithm in order to obtain further improvements when precomputations involving P are permitted. Table 5 , with p 2 ½1; 6.
TABLE 5
Critical Paths and Latency-Area Products with p 2 ½1; 6 and maxðb i Þ 2 ½2; 10
It is also possible to augment our conversion algorithm using a sliding window analog, in which we slide the blocks along the -adic expansion of the scalar so that the loworder bit is always one. Experiments indicate that this does not significantly reduce the required number of point additions, but it does have the advantage that the size of the lookup table is cut in half.
Although our numerical data suggests that the density of the f; À 1g-expansions obtained by our conversion algorithm is sublinear in the bit length of k, we do not yet have a proof of this fact. In addition, our conversion algorithm requires a modest amount of storage. These precomputed quantities are independent of both the base point P and multiplier k and can be viewed as part of the domain parameters. Nevertheless, we continue to search for an efficient memory-free conversion algorithm.
TABLE 6
Critical Path and Latency-Area Product Estimates with p 2 ½1; 6 and maxðb i Þ 2 ½2; 10 for the FPGA Implementation Critical path is given as the number of clock cycles including cost of combining subcomputations, and latency-area product is R ¼ pC. The best values for all p are in bold. Fig. 6 . The (solid line) best critical path C and (dotted line) latency-area product R estimates for the FPGA implementation, i.e., the bold values in Table 6 , with p 2 ½1; 6.
Vassil S. Dimitrov received the PhD degree in applied mathematics from the Bulgarian Academy of Sciences in 1995. Since 1995, he has held postdoctoral positions at the University of Windsor, Ontario (1996-1998) and Helsinki University of Technology (1999 Technology ( -2000 . During the period 1998-1999, he worked as a research scientist for Cigital, Dulles, Virginia (formerly known as Reliable Software Technology), where he conducted research work on different cryptanalysis problems. Since 2001, he has been an associate professor in the Department of Electrical and Computer Engineering, University of Calgary, Alberta. His main research areas include implementation of cryptographic protocols, number-theoretic algorithms, computational complexity, image processing and compression, and related topics. . For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.
