Abstract. Scalar multiplication on Koblitz curves can be very efficient due to the elimination of point doublings. Modular reduction of scalars is commonly performed to reduce the length of expansions, and τ -adic NonAdjacent Form (NAF) can be used to reduce the density. However, such modular reduction can be costly. An alternative to this approach is to use a random τ -adic NAF, but some cryptosystems (e.g. ECDSA) require both the integer and the scalar multiple. This paper presents an efficient method for computing integer equivalents of random τ -adic expansions. The hardware implications are explored, and an efficient hardware implementation is presented. The results suggest significant computational efficiency gains over previously documented methods.
Introduction
While compact keys and signatures occur naturally when using elliptic curves, the computational efficiency of elliptic curve cryptosystems is the subject of much research. Koblitz [1] showed that scalar multiplication can be done very fast on a certain family of binary curves now commonly referred to as Koblitz curves. In the same paper, Koblitz credited Hendrik Lenstra for first suggesting random base-τ expansions for key agreement protocols using Koblitz curves.
Meier and Staffelbach [2] showed how to significantly reduce the length of τ -adic expansions by performing modular reduction on scalars. Solinas [3, 4] later built on this idea and additionally reduced the weight by designing a τ -adic analogue of Non-Adjacent Form (NAF).
Unfortunately, performing such modular reduction can be costly. As future work, Solinas suggested a study of the distribution of random τ -adic NAFs. Lange and Shparlinski [5, 6] have studied the distribution of such expansions in depth.
For key agreement protocols like Diffie-Hellman [7] , the integer equivalent of such a random τ -adic expansion is not needed. However, for ElGamal [8] type digital signatures like ECDSA [9] , both the integer and the scalar multiple are needed to generate a signature. Lange [10] discussed many of the details of this approach, as well a straightforward method for recovering the integer equivalent using a number of multiplications.
In this paper, we present a new method for recovering integer equivalents of random τ -adic expansions using only additions and one field multiplication. This method is shown to be very efficient and has significant hardware implications. A hardware implementation is also presented and studied in depth. The results are then compared to current similar methods in hardware.
Sec. 2 reviews background information on Koblitz curves and τ -adic expansions. Sec. 3 covers more recent research on random τ -adic expansions, as well as our new method for efficiently computing integer equivalents of such expansions. Sec. 4 presents an efficient hardware implementation of our new method on a field programmable gate array (FPGA), as well as a comparison to current methods. We conclude in Sec. 5.
Koblitz Curves
Koblitz curves [1] are non-supersingular elliptic curves defined over F 2 , i.e.
These curves have the nice property that if a point P = (x, y) is on the curve, so is the point (x 2 , y 2 ). The map σ : 
The last equation also allows us to look at σ as a complex number τ and we can extend scalar multiplication to scalars
Higher powers of τ make sense as repeated applications of the endomorphisms. Koblitz showed how to use use the Frobenius endomorphism very efficiently in scalar multiplication: A scalar n = d 0 + d 1 τ is expanded using (2) repeatedly, i.e. put 0 = d 0 (mod 2) and replace n by (
to compute 1 etc. This iteration leads to a so called τ -adic expansion with coefficients ∈ {0, 1} such that nP = i=0 i τ i (P ). As a result of representing integers as the sum of powers of τ , scalar multiplication can be accomplished with no point doublings by combining point additions and applications of σ. Koblitz noted in [1] that such base-τ expansions unfortunately have twice the length when compared to binary expansions. This leads to twice the number of point additions on average.
Integer Equivalents
To overcome this drawback, Meier and Staffelbach [2] made the following observation. Given any point P ∈ E a (F 2 m ), it follows that
) are said to be equivalent with respect to P as γ multiples of P can also be obtained using the element ρ since for some κ
While this relation holds for all points on the curve, cryptographic operations are often limited to the main subgroup; that is, the points of large prime order. Solinas [4] further improved on this. The small subgroup can be excluded since (τ − 1) divides (τ m − 1), and for multiples of points in the main subgroup scalars can be reduced modulo δ = (τ m − 1)/(τ − 1) to further reduce the length to a maximum of m + a. For computational reasons, it is often convenient to have the form δ = c 0 + c 1 τ as well. For reference, the procedure for performing such modular reduction is presented as Algorithm 1, which calculates ρ such that the probability that ρ = ρ holds is less than 2 −(C−5) . This probabilistic approach is used to avoid costly rational divisions, trading them for a number of multiplications. .
Solinas [4] also developed a τ -adic analogue of Non-Adjacent Form (NAF). Signed representations are used as point subtraction has roughly the same cost as point addition; NAF guarantees that no two adjacent coefficients are non-zero. By reducing elements of Z[τ ] modulo δ and using τ -adic NAF, such expansions have roughly the same length and average density (1/3 when i ∈ {0, 1, −1}) as normal NAFs of the same integers.
Algorithm 2. Rounding off in Z[τ ].
Input: Rational numbers λ0 and λ1. Output: Integers q0, q1 such that q0 + q1τ is close to λ0 + λ1τ .
Random τ -adic Expansions
Instead of performing such non-trivial modular reduction, Solinas [4] suggested (an adaptation of an idea credited to Hendrik Lenstra in Koblitz's paper [1] ) producing a random τ -adic NAF; that is, to build an expansion by generating digits with Pr(−1) = 1/4, Pr(1) = 1/4, Pr(0) = 1/2, and following each non-zero digit with a zero. Lange and Shparlinski [5] proved that such expansions with length = m − 1 are well-distributed and virtually collision-free.
This gives an efficient way of obtaining random multiples of a point (for example, a generator). For some cryptosystems (e.g. Diffie-Hellman [7] ), the integer equivalent of the random τ -adic NAF is not needed. The expansion is simply applied to the generator, then to the other party's public point. However, for generating digital signatures (e.g. ECDSA [9] ), the equivalent integer is needed.
Lange [10] covered much of the theory of this approach, as well as a method for recovering the integer. Given a generator G of prime order r and a group automorphism σ, there is a unique integer s modulo r which satisfies σ(G) = sG, and s (fixed per-curve) is obtained using (
. We note that s also satisfies s = (−c 0 )(c 1 )
−1 (mod r). Values of s for standard Koblitz curves are listed in Table 1 for convenience.
Given some τ -adic expansion i=0 i s i using at most − 2 multiplications and some additions for non-zero coefficient values, all modulo r [10] . Our approach improves on these computation requirements Curve a s K-163 1 00000003 81afd9e3 493dccbf c2faf1d2 84e6d34e bd67a6da K-233 0 00000060 6590ef0a 0a0abf8d 755a2be3 1f5449df ff5b4307 33472d49 10444625 K-283 0 00d5d05a 1b6c5ace e76b8ee3 f925a572 19bcb952 12945154 588d0415 a5b4bb50 57f69216 K-409 0 0024ef90 54eb3a6c f4bdc6ed 021f6e5c b8da0c79 5f913c52 ebaa9239 8d1b7d3d 0adb8a34 add81800 acf7e302 a7d25095 1701d7a4 K-571 0 01cc6c27 e62f3e0d df5ea7eb 1ab1cc4d 0da631c0 d70a969a a14b0350 85b31511 f5a97455 20cba528 e2d1e647 f4f708d3 9fba0c3b e4e35543 821344d1 662727bd 2d59dbc0 5e6853b1
to recover the integer equivalent. Solinas [4] showed that given the recurrence relation
Solinas noted that this equation can be used for computing the order of the curve. In addition to the aforementioned application, it is clear that this equation can also be directly applied to compute the equivalent element
We present an efficient algorithm to compute d 0 and d 1 as Algorithm 3. Note that the values d 0 and d 1 are built up in a right-to-left fashion. Clearly it makes sense to generate the fixed U sequence right-to-left. So if the coefficients are generated left-to-right and are not stored then U should be precomputed and stored and given as input to Algorithm 3, which is then modified to run Algorithm 3. Integer equivalents of τ -adic expansions.
Input: -bit τ -adic expansion , curve constants r, s, μ Output:
/* setup for next round */ end n ← d0 + d1s mod r /* integer equivalent */ return n left-to-right. The storage of is small; indeed it can be stored in -bits if zeros inserted to preserve non-adjacency are omitted. Either way, the choice of which direction to implement Algorithm 3 is dependent on many factors, including storage capacity and direction of the scalar multiplication algorithm. In any case, the main advantage we are concerned with in this paper is the ability to compute the integer equivalent in parallel to the scalar multiple, and hence in our implementations is assumed to be stored. Thus, we omit any analysis of generating coefficients of one at a time, and concentrate on the scenario of having two separate devices, each with access to the coefficients of : one for computing the integer equivalent, and one for computing the scalar multiple.
As s is a per-curve constant, it is clear that the integer equivalent n can be computed using one field multiplication and one field addition. This excludes the cost of building up d 0 + d 1 τ from the τ -adic expansion, done as shown in Algorithm 3 using the U sequence with only additions.
For the sake of simplicity, only width-2 τ -adic NAFs are considered here (all i ∈ {0, 1, −1}). Since generators are fixed and precomputation can be done offline, it is natural to consider arbitrary window width as well; we defer this to future work.
Following this reasoning, our method can be summarized as follows. 
Hardware Implementation
If scalar multiplication is computed with a random integer, the integer is typically first reduced modulo δ. The τ -adic NAF then usually needs to be computed before scalar multiplication because algorithms for producing the τ -adic NAF presented in [4] produce them from right-to-left 1 , whereas scalar multiplication is typically more efficient when computed from left-to-right. In either case, the end-to-end computation time is T τ + T sm where T τ is the conversion time (including reducing modulo δ and generating the τ -adic NAF) and T sm is the scalar multiplication time. However, when scalar multiplication is computed on hardware with a random τ -adic NAF of length , the calculation of an integer equivalent can be performed simultaneously with scalar multiplication (assuming separate dedicated hardware) thus resulting in an end-to-end time of only max(T τ , T sm ) = T sm with the reasonable assumption that T sm > T τ . In this case, T τ denotes the time needed to generate a random τ -adic NAF and calculate the integer equivalent. We assume storage for the coefficients i exists. This parallelization implies that our method is especially well-suited for hardware implementations. An FPGA design was implemented in order to investigate the practical feasibility of our method on hardware. The implementation consists of two adders, two comparators, a U sequence block and certain control logic. The structure of the implementation is shown in Fig. 1 . An integer equivalent is computed so that, first, d 0 and d 1 are built up using the U sequence and, second, n = d 0 + d 1 s is calculated as shown in Sec. 3. The design operates in two modes.
The first mode computes d 0 and d 1 in parallel using the adders. A τ -adic expansion is input into the design in serial starting from 0 and the U sequence is either read from ROM or computed on-the-fly. As shown in Sec. 3, U i can be directly applied in computing d 1 by simply adding or subtracting U i to d 1 according to i . In d 0 calculation, 2U i−1 is received by shifting U i to the left and delaying it by one clock cycle. Because the least significant bit (LSB) 0 is handled similarly, an additional value U −1 = −0.5 is introduced into the U sequence in order to get d 0 = 0 . If the U sequence is precomputed and stored in ROM, the required size of the ROM depends on m. The depth of the ROM is m and the width is determined by the longest U i in the sequence. ROM sizes for the NIST curves [9] are listed in Table 2 . It makes sense not to reduce the sequence U modulo r, as r > 1 + 2 m−3 i=0 |U i | and hence U modulo r requires significantly more storage space than U alone.
If the amount of memory is an issue, the U sequence can be computed on-thefly by using an adder as shown in Algorithm 3. This implies that extra storage for the coefficients i is also needed for performing the scalar multiplication simultaneously. The width of the adder is also determined by the longest U i . As shown in Table 2 , the size of the ROM grows rapidly with m and in practice ROMs can be used only when m is small. However, the on-the-fly computation is also a viable approach for large m. It is easy to check that all r listed in [9] fulfill this condition.
The first mode requires + 1 clock cycles. Because both phases in the second mode require one clock cycle and the length of s is m bits, the latency of the second phase is 2m clock cycles. Thus, the latency of a conversion is exactly +2m+1 clock cycles where is the length of the τ -adic expansion. As = m−1, the conversion requires 3m clock cycles. The design is inherently resistant against side-channel attacks based on timing because its latency is constant.
If the U sequence is stored in ROM, it would be possible to reduce the latency by on average 2 3 m clock cycles by skipping all zeros in random τ -adic NAFs in the first mode. This could also be helpful in thwarting side-channel attacks based on power or electromagnetic measurements. Unfortunately, latency would not be constant anymore making the design potentially insecure against timing attacks. Reductions based on zero skippings are not possible if the U sequence is computed on-the-fly because that computation always requires m clock cycles.
Results
The design presented in Sec. 4 was written in VHDL and simulated in ModelSim SE 6.1b. To the best of our knowledge, the only published hardware implementation of the integer to τ -adic NAF conversion was presented in [11] where the converter was implemented on an Altera Stratix II EP2S60F1020C4 FPGA. In order to ensure fair comparison, we synthesized our design for the same device using Altera Quartus II 6.0 SP1 design software. Two curve-specific variations of the design were implemented for the NIST curves K-163 and K-233.
The K-163 design with ROM requires 929 ALUTs (Adaptive Look-Up Tables) and 599 registers in 507 ALMs (Adaptive Logic Modules) and 13,529 bits of ROM which were implemented by using 6 M4K memory blocks. The maximum clock frequency is 56.14 MHz which yields the conversion time of 8.7 μs. The K-233 design with ROM has the following characteristics: 1,311 ALUTs and 838 registers in 713 ALMs, 27,612 memory bits in 7 M4Ks and 42.67 MHz resulting in 16.4 μs. Implementations where the U sequence is computed on-the-fly require 1,057 and 1,637 ALUTs and 654 and 934 registers in 592 and 894 ALMs for K-163 or K-233, respectively. They operate at the maximum clock frequencies of 55.17 and 43.94 MHz resulting in the computation times of 8.9 μs and 15.9 μs. The differences in computation times compared to the ROM-based implementations are caused by small variations in place&route results which yield slightly different maximum clock frequencies. The latencies in clock cycles are the same for both ROM-based and memory-free implementations, i.e. 489 for K-163 and 699 for K-233. As the required resources are small and the conversion times are much shorter than any reported scalar multiplication time, our method is clearly suitable for hardware implementations.
Comparisons
The implementation presented in [11] computes τ -adic NAF mod (τ m − 1) so that it first converts the integer to τ -adic NAF with an algorithm from [3] , then reduces it modulo (τ m −1) and finally reconstructs the NAF which was lost in the reduction. This was claimed to be more efficient in terms of required resources than reductions modulo δ presented in [3] because their implementation is problematic on hardware as they require computations of several multiplications, and hence either a lot of resources or computation time. Table 3 summarizes the implementations presented here and in [11] . Comparing the implementations is straightforward because FPGAs are the same. It should be noted that the converter in [11] has a wider scope of possible applications since our approach is only for taking random multiples of a point (this is not useful in signature verifications, for example). Obviously, the computation of an integer equivalent can be performed with fewer resources. The reductions are 35 % in ALUTs and 39 % in registers for K-163 and 27 % and 30 % for K-233 when the U sequence is stored in the ROM. However, it should be noted that the implementations presented in [11] do not require such additional memory. When the U sequence is computed with logic and no ROM is needed, the reductions in ALUTs and registers are 26 % and 34 % for K-163 and 9 % and 22 % for K-233 and so it is obvious that our converter is more compact.
The average latencies of both converters are approximately the same. The difference is that the latency of our converter is always exactly 489 or 699 clock cycles whereas the converter in [11] has an average latency of 491 or 701 clock cycles for K-163 and K-233, respectively. The maximum clock frequencies of our converters are lower and, thus, the implementations of [11] can compute conversions faster. However, an integer equivalent can be computed in parallel with scalar multiplication and, thus, it can be claimed that the effective conversion time is 0 μs.
To support the argument that the effective elimination of the conversion time T τ is significant, there are several implementations existing in the literature computing scalar multiplications on K-163 in less than 100 μs. For example, [12] reports a time of 44.8 μs, and [13] a time of 75 μs. Hence conversions requiring several μs are obviously significant when considering the overall time.
To summarize, computing the integer equivalent of a random τ -adic expansion offers the following two major advantages from the hardware implementation point-of-view compared to computing the τ -adic NAF of a random integer: -Conversions can be computed in parallel with scalar multiplications. -Computing the integer equivalent can be implemented with fewer resources.
As a downside, the calculation of an integer equivalent has a longer latency; however, this is insignificant since the conversion is not on the critical path.
Conclusion
As shown, our new method for computing integer equivalents of random τ -adic expansions is very computationally efficient. This has been demonstrated with an implementation in hardware, where the parallelization of computing the integer equivalent and the scalar multiple yields significant efficiency gains. It seems unlikely that such gains are possible with this approach in software.
Future Work
Side-channel attacks based on timing, power, or electromagnetic measurements are a serious threat to many implementations; not only on smart cards, but on FPGAs [14] as well. Our converter provides inherent resistance against timing attacks because its latency is constant. Side-channel countermeasures against other attacks are beyond the scope of this paper. However, before the suggested implementation can be introduced in any practical application where these attacks are viable, it must be protected against such attacks. This will be an important research topic in the future.
As mentioned, only width-2 τ -adic NAFs have been considered here (all i ∈ {0, 1, −1}). Arbitrary window width would clearly be more efficient for the scalar multiplication. We are currently researching efficient methods for scanning multiple bits at once, as well as simple alternatives to using the U sequence.
