Abstract-In this paper, we discuss conversions between integers and -adic expansions and we provide efficient algorithms and hardware architectures for these conversions. The results have significance in elliptic curve cryptography using Koblitz curves, a family of elliptic curves offering faster computation than general elliptic curves. However, in order to enable these faster computations, scalars need to be reduced and represented using a special base-expansion. Hence, efficient conversion algorithms and implementations are necessary. Existing conversion algorithms require several complicated operations, such as multiprecision multiplications and computations with large rationals, resulting in slow and large implementations in hardware and microcontrollers with limited instruction sets. Our algorithms are designed to utilize only simple operations, such as additions and shifts, which are easily implementable on practically all platforms. We demonstrate the practicability of the new algorithms by implementing them on Altera Stratix II FPGAs. The implementations considerably improve both computation speed and required area compared to the existing solutions.
INTRODUCTION
E LLIPTIC curve cryptography, introduced independently by Koblitz [1] and Miller [2] in 1985, has matured into a widely accepted method for public-key cryptography and it is deployed in numerous practical applications. Scalar multiplication, kP , where k is an integer and P is a point on an elliptic curve, is the cornerstone of every elliptic curve cryptosystem. It is also the most demanding operation computationally and many papers have discussed its efficient computation. Koblitz curves [3] are an attractive class of elliptic curves because they offer faster scalar multiplications. They are widely used in practical cryptosystems and listed in major standards issued by NIST [4] and SECG [5] , [6] . However, in order to utilize the potential of Koblitz curves, k must be given as a so-called -adic expansion, and conversions can be costly. They require operations which are considered awkward on many platforms, e.g., multiprecision multiplications and operations with large rationals [7] .
Accelerating scalar multiplications with dedicated hardware has been extensively studied in the literature; see, e.g., [8] for a review. However, despite their importance, conversions are often overlooked in papers discussing hardware implementations using Koblitz curves. Traditionally, only scalar multiplications were delegated to a hardware accelerator, while a host processor took care of the conversions and sent the resulting -adic expansions to the accelerator; see, e.g., [9] , [10] . Recent publications, such as [11] , [12] , have proved this approach impractical by introducing FieldProgrammable Gate Array (FPGA) based accelerators which compute scalar multiplications in a few microseconds. Because software conversions are much slower, also the conversions should be delegated to the hardware in order to prevent them from becoming the bottleneck.
Hardware implementation of conversions has, however, gained only limited interest. Prior to this paper, converters have been presented in [13] , [14] , [15] , [16] . A converter for integer to -adic nonadjacent form (NAF) conversions was presented in [13] whereas [14] described an algorithm and a converter for the conversion to the other direction, i.e., from a NAF to an integer, which is useful in certain elliptic curve cryptosystems. A converter for integer to double-base -adic expansion conversions has been presented in [15] , [16] . Even these converters are too slow to prevent conversions from becoming the bottleneck in certain cases [11] . The converters are also relatively large, especially the one presented in [15] , [16] , which reduces their feasibility.
On the software side, current solutions [17] , [7] for such conversions are not necessarily any less awkward to implement. In particular, they require large integer multiplications (or optimally, divisions) as well as computations in the rationals. The feasibility of implementing such an algorithm is highly platform dependent; while this would be of little consequence on modern x86 architectures, implementing the current solutions on an embedded system with limited memory and a small instruction set might not even be possible. For example, the Atmel tinyAVR family [18] has an extremely limited instruction set, lacking integer multiplication; other platforms have similar restrictions, not just with respect to the instruction set but memory constraints as well.
In this paper, we improve the existing conversion algorithms and present efficient hardware architectures for them. In particular, our algorithms can be implemented without any complex operations and they are, therefore, suitable also for platforms where existing algorithms are problematic. We demonstrate the feasibility of our algorithms and converter architectures with implementations on Altera Stratix II FPGAs. They are shown to be both smaller and faster than those currently available in the literature. As a consequence, our results have significance especially for high-speed hardware accelerators using Koblitz curves, but also for applications requiring lower resource utilization.
The remainder of the paper is structured as follows: Section 2 introduces the fundamentals of elliptic curve cryptography and Koblitz curves and reviews existing work on conversions. In Section 3, we describe our new algorithms. Their hardware implementations are discussed in Section 4. We prove the efficiency of the algorithms and implementations by providing the implementation results with comparisons to existing works in Section 5. We end with conclusions in Section 6.
PRELIMINARIES
The points on an elliptic curve form an additive abelian group with an identity element, O, called the point at infinity. We consider elliptic curves over finite fields with characteristic two, IF 2 m , commonly called binary curves, and we denote the group by EðIF 2 m Þ. The group operation, P 1 þ P 2 , is referred to as the point addition if P 1 6 ¼ P 2 and the point doubling if P 1 ¼ P 2 . Scalar multiplication, kP , is typically computed via the "double-and-add" algorithm, analogous to the well-known "square-and-multiply" exponentiation; i.e., each bit, i , in the binary expansion of k results in a point doubling, and a point addition is required if i ¼ 1. Hence, scalar multiplication with an '-bit scalar costs ' point doublings and '=2 point additions, on average.
Koblitz curves [3] are binary curves of the form
We denote #E a ðIF 2 m Þ ¼ fr with small cofactor f ¼ 2 2Àa 2 f2; 4g and large prime r, the order of the main subgroup-not all such curves satisfy r prime, but a sufficient number of curves do and appear in a variety of standards; we limit our discussion to such curves. These curves are of cryptographic interest as they admit an efficient endomorphism called the Frobenius :
From the point addition formula, it can be shown that satisfies the characteristic polynomial
Denoting as a complex root of (3), the Frobenius carries out complex multiplication by the complex number
to compute 1 and so on. This iteration leads to a so-called -adic expansion with coefficients i 2 f0; 1g:
This allows scalar multiplication to be accomplished without point doublings:
Applying the Frobenius is a significantly cheaper operation than point doubling; hence, this is extremely computationally efficient; squaring is a linear operation in IF 2 m . In fact, the cost is considered negligible, as with a normal basis representation it requires only a simple rotation of bits.
Nonadjacent Form (NAF)
Signed representations are often used in elliptic curve cryptography, in an attempt to reduce the overall weight of the scalar and, hence, the number of point additions that need to be performed; as negating a point is virtually free, point subtraction has the same cost as point addition. NAF is an efficient representation with coefficients i 2 f0; 1; À1g having the property that no two adjacent coefficients are nonzero. The strategy is to repeatedly divide by 2, choosing nonzero coefficients such that the result upon division by 2 is divisible by 2. A width w NAF is produced in a similar manner by expanding the coefficient set, ensuring that at most one out of w þ 1 coefficients is nonzero. The average weight of an '-bit scalar is then '=ðw þ 1Þ, as opposed to '=2 in the binary case. As Solinas showed [17] , the base-case called NAF is completely analogous; repeatedly divide by , choosing nonzero coefficients such that the result upon division by is divisible by , and the next coefficient is zero. This is depicted in Fig. 1 . A width w NAF is also produced analogously [17] .
Given 2 Z Z½, Solinas [7] proved the upper bound on the NAF length ' satisfies ' < log 2 ðNðÞÞ þ 3:51559412;
where N is the norm; for explicit details on the norm and the above algorithm, we refer the reader to the extensive results of Solinas [7] .
Double-Base Expansions
The weight of the scalar can be reduced also by using a double-base number system (DBNS), where the scalar is given as a sum of powers of two coprime integers, e.g., 2 and 3, as follows:
. Obviously, such representations are not unique and it is possible to find expansions which are very sparse leading to fewer point additions in the scalar multiplication. The idea of the DBNS has been adapted to -adic expansions [15] , [16] , [20] , [21] . The expansion suggested in [15] , [16] is so far the only one which has been rigorously shown to be feasible in practice, and it represents the scalar as follows:
An efficient way to obtain the expansion of (7), called a blocking algorithm, was introduced in [15] , [16] . The blocking algorithm uses a precomputed lookup table (LUT) for replacing fixed-length blocks of (4) with optimal double-base equivalents. Although the overall weight of the resulting expansion is not optimal, the algorithm is easily implementable and the weight is considerably lower compared to the width 2 NAF. The reader is referred to [16] for further details.
Reduction of Scalars
When producing a base-expansion of a scalar k 
Two elements ; 2 Z Z½ such that ðmod m À 1Þ are said to be equivalent with respect to P as multiples of P can also be obtained using the element since for some
While this relation holds for all points on the curve, cryptographic operations are often limited to the main subgroup; that is, the points of large prime order. Solinas [7] further improved on this. The small subgroup can be excluded since ð À 1Þ divides ð m À 1Þ, and for multiples of points in the main subgroup scalars can be reduced modulo ¼ ð m À 1Þ=ð À 1Þ to further reduce the length to a maximum of m þ a. For computational reasons, it is often convenient to have the form ¼ c 0 þ c 1 as well. For reference, the procedure for performing such modular reduction is shown in Fig. 2 and Fig. 3 , which calculates 0 such that Prð 0 6 ¼ Þ < 2 ÀðCÀ5Þ . This probabilistic approach is used to avoid costly rational divisions, trading them for a number of multiplications. For explicit details on these algorithms, we again refer the reader to [7] .
Reduction by Recoding
Using again the observation in (8), Lutz [23] showed an alternative approach. Instead of performing the costly operations to perform partial modular reduction, simply produce the longer (signed) base-expansion; that is, setting d 0 ¼ k and d 1 ¼ 0. We are then left with the expansion
and we again have the shorter length, although the coefficient set is no longer restricted to f0; 1; À1g. The NAF can be recovered by examining the least two significant digits, say the element 1 þ 0 , producing the NAF of this element and repeating for all digits of the expansion. The result, after repeating as needed until the expansion length is less than m, is a NAF of approximate average length m. This approach is attractive because logic needed in implementation is simple; it requires relatively little area compared to partial modular reduction, and uses only simple logic instead of computations in Q Q and multiprecision multiplications. 
Integer Equivalents
As an alternative approach, for some cryptosystems, Solinas [7] suggested (an adaptation of an idea credited to Hendrik Lenstra in Koblitz's paper [3] ) producing a random NAF; that is, to build an expansion by generating digits with PrðÀ1Þ ¼ 1=4, Prð1Þ ¼ 1=4, and Prð0Þ ¼ 1=2, and following each nonzero digit with a zero. Lange and Shparlinski [24] proved that such expansions with length ' ¼ m À 1 are well distributed and virtually collision free. This gives an efficient way of obtaining random multiples of a point (for example, a generator). For some cryptosystems (e.g., Diffie-Hellman [25] key agreement), the integer equivalent of the random NAF is not needed. The expansion is simply applied to the generator, then to the other party's public point. However, for generating digital signatures (e.g., ECDSA [4] ), not only is the expansion needed for computing a multiple of the generator, but its integer value modulo the group order. In contrast to integer to NAF conversion, we are left with the need for NAF to integer conversion.
Lange [26] covered much of the theory of this approach, as well as a method for recovering the integer. Given a generator G of prime order r and a group automorphism , there is a unique integer s modulo r which satisfies ðGÞ ¼ sG, and s (fixed per curve) is obtained using ðT À
Given some -adic expansion, it follows that the equivalent integer can be recovered deterministically as
using at most ' À 2 multiplications and some additions for nonzero coefficient values (assuming i 2 f0; 1; À1g), all modulo r [26] . We improved on this result in [14] . It is known (and shown in [7] ) that given the recurrence relation
This equation can be used for computing the order of the curve. In addition to the aforementioned application, this equation can also be directly applied to compute the equivalent element
and once d 0 þ d 1 is computed, the equivalent integer k modulo r is easily obtained using
In comparison to (10) , instead of a multiplication modulo r in each iteration, this method uses only cheap shifts and additions, then finally one multiplication modulo r.
Remarks on Computation Schedules
Fig. 4 presents computation schedules for the two approaches discussed above, i.e., when (a) k is given as an integer or (b) as a random NAF. It is assumed in the schedules that separate circuitries are used for conversions and scalar multiplications allowing parallel computations. Fig. 4a shows that computation time is the sum of conversion time, t c , and scalar multiplication time, t sm . However, throughput (scalar multiplications per second) is bounded by the longer of the two, i.e., maxðt sm ; t c Þ. The reason for this is that the conversion of the next k can be computed in parallel with the previous scalar multiplication. Fig. 4a depicts the situation where t sm > t c . Conversions to equivalent integers can be computed in parallel with scalar multiplications in the case of random NAF [14] , as depicted in Fig. 4b . Hence, both total computation time and throughput are determined by maxðt sm ; t c Þ. Recent developments in processors computing scalar multiplications and the lack of fast converters has led to an unwanted situation where the conversions have become the bottleneck, i.e., t c > t sm [11] . This is one of the main motivators for the development of more efficient algorithms and implementations for conversions.
NEW ALGORITHMS
In this section, we present two new algorithms for conversions based on the simple observation that multiplication (division) by is an extremely cheap operation, requiring only a small number of shifts and additions-in contrast to division by m À 1 or ð m À 1Þ=ð À 1Þ as shown in Fig. 2 , which can not only be slow but also require significant implementation area.
Lazy Reduction
We begin with the new algorithm for reduction. The idea is similar to Crandall reduction [27] (or, from another perspective, Optimal Extension Fields [28] ), often used in IF p with p of special form, say p ¼ 2 i À C, where C is a small (e.g., 32-bit), low-weight constant. Crandall reduction leverages the fact that division (multiplication) by 2 is only shifting, and thus reduction with such a p can be accomplished using only shifts and additions (subtractions).
In the case of Koblitz curves, we have a similar situation, but with base instead of base 2. Division by is not completely free, but still comparatively cheap, using only shifts and additions. The approach is as follows: Instead of performing a division by m À 1, we divide by repeatedly m times (similarly as when producing a base-expansion); we can then express the integer k as
The desired reduction is obtained directly from (12); we refer to this method as lazy reduction, and the algorithm is shown in Fig. 5 .
Although the algorithm in Fig. 5 allows only u i 2 f0; 1g, we note that it is entirely possible to combine the logic with that which is used to produce the NAF (Fig. 1) , pausing after iteration m À 1 to perform the addition of the d i and b i , then continuing on to generate the NAF of the resulting element. The benefit of such an approach in terms of area would be device and implementation specific; hence, we merely note the alternative here.
Length
Theorem 1. The result of the lazy reduction algorithm (Fig. 5) is an element of Z Z½ with NAF of length at most m þ 4.
Proof. For explicit details on the following notation and terminology (such as the norm and the Triangle Inequality), we refer the reader to [7] , [17] . Recall that the norm satisfies NðÞ ¼ NðÞNðÞ:
We know NðÞ ¼ 2, thus Nð m Þ ¼ 2 m . Observe Z Z½ is a euclidean domain, so in (12) the equation
Using the triangle inequality, we obtain ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi
Hasse's theorem states (see, for example, [29, p. 127,
and for Koblitz curves this yields
To continue, we consider the two cases a ¼ 0 and a ¼ 1 separately. We will also use the fact that log 2 ðx þ Þ À log 2 ðxÞ < log 2 e x for all positive real numbers x and .
For a ¼ 0, we have
By taking base 2 logarithms and multiplying by 2 we get
Given (6) and the fact that the length must be an integer, we conclude the NAF length ' of þ satisfies ' m þ 4 when m ! 8 for both a ¼ 0 and a ¼ 1. is given, and again with NðÞ ¼ #E a ðIF 2 m Þ ¼ fr with cofactor f, the worst case effect of lazy reduction modulo compared to exact reduction modulo (not partial reduction as shown in Fig. 2 ) is at most a few bits.
Compared to the partial modular reduction by ¼ ð m À 1Þ=ð À 1Þ, the bound
is given in [7] , leading to a NAF of length at most m þ a þ 3, a similar bound as that of Theorem 1 for the new method. Experimental results on the practical effect of lazy reduction are given in Section 5.1. 
Integer Equivalents
Similar to the previous algorithm where division by is cheap, we can leverage the fact that multiplication by is cheap; indeed, to multiply
where higher powers of are reduced using the characteristic equation (3) . This simple idea can be directly applied to computing the integer equivalent of a -adic expansion using a -and-add approach, where the accumulator is multiplied by in each iteration (a shift and add).
The number of iterations of (15) can be reduced by processing several i at a time. Let ! denote the number of i processed in one iteration. Then, the integer equivalent for a -adic expansion can be found with the algorithm given in Fig. 6 . The algorithm requires four parameters i ; i 2 Z Z such that
and two functions i ðx !À1 ; . . . ; x 0 Þ which satisfy
with the output of the i taken as integers. The constants and the functions are generated by repeatedly applying (15) and Table 1 presents their values for ! 4. For instance, the simplest case, ! ¼ 1, yields
Because the parameters and the functions depend only on !, we note that they can be easily precomputed and hardwired in an implementation. The feasibility of a large ! is questionable because the equations (specifically, the multiplications by i and i ) become more involved.
HARDWARE IMPLEMENTATIONS
In this section, we discuss hardware implications of the new algorithms presented in Section 3 and present efficient architectures for them. The algorithms do not require divisions or several multiplications of large integers which increases their hardware suitability because such operations consume considerable amounts of area or computation time.
Computation speed was prioritized over compactness, and operations were parallelized as much as possible. As a consequence, the presented architectures are unsuitable for highly constrained applications, but implementations with small resource requirements can be realized, as will be discussed shortly in Section 4.4. In the architectures that are described in the following, data path width, D, must satisfy D ! m. D affects both area requirements and maximum clock frequencies; hence, it should be as small as possible.
We design the converters specifically for two Koblitz curves, K-163 and K-233, recommended by NIST in [4] . K-163 is the curve E 1 ðIF 2 163 Þ and K-233 is E 0 ðIF 2 233 Þ. However, the architectures are not restricted to any particular Koblitz curves.
Hardware architectures for conversions from integer to NAF (I-to-T), NAF to integer (T-to-I), and integer to double-base -adic expansion (I-to-DB) are described in detail in Sections 4.1-4.3, respectively. Issues such as compact implementations, side-channel analysis (SCA), etc., are discussed in Section 4.4.
The I-to-T Converter
In this section, we describe the new I-to-T converter which finds the NAF for an input integer k and outputs it in serial starting from 0 . The converter implements the algorithms from Figs. 1 and 5. Fig. 7 depicts the I-to-T converter. It comprises the main NAF circuitry (on the left), reduction circuitry (on the right), and simple control logic (the control block). The main NAF circuitry is based on the architecture presented in Fig. 1 of [13] , but we introduce certain simple optimizations discussed in the following. The reduction circuitry and the control logic are not based on any existing architectures.
The main NAF circuitry consists of logic for obtaining the signed result bit, u, a 2-bit register for storing it, two D þ 1-bit adders (subtractors) for computing c 0 and c 1 , and two registers for storing them, both D þ 1 bits wide. The magnitude bit of u is simply the lsb of c 0 [13] . The Simple modifications are applied to the equations updating c 0 and c 1 ; see Fig. 1 . In particular, the negation in the equation of c 1 is removed by reversing the operands of the subtraction, and the equation updating c 0 was changed accordingly. The resulting equations are as follows:
The reduction circuitry computing b 0 þ b 1 consists of three adder/subtractors, one negation circuitry, a few multiplexers, and four registers, all dD=2e þ 1 bits wide. The reduction circuitry is a straightforward implementation of Fig. 5 .
The conversion is performed in four modes, and they are denoted by M 2 f0; 1; 2; 3g. The modes are controlled by the control logic which is implemented as a finite state machine (FSM). In different modes, the circuitry operates as described in the following:
0. This mode initializes the circuitry and it is used at the beginning of a conversion. All registers are cleared, i.e., set to zero, with the exception of c 0 , which is set to k and a 0 which is set to one. The latency of a conversion, L I-to-T , is on average 2m þ 3 clock cycles. It comprises the following latencies: Mode 0 requires one clock cycle, Mode 1 m clock cycles, and Mode 2 is computed in one clock cycle. The latency of Mode 3 is the length ' of the resulting NAF, and thus varies. The average value is m clock cycles and the maximum m þ 4 (given in Theorem 1). The registered outputs result in an additional delay of one clock cycle giving the total latency:
and the maximum latency, L I-to-T;max ¼ 2m þ 7.
BRUMLEY AND JÄ RVINEN: CONVERSION ALGORITHMS AND IMPLEMENTATIONS FOR KOBLITZ CURVE CRYPTOGRAPHY 87
Fig . 7 . The architecture of the I-to-T converter. The main NAF circuitry is on the left, the reduction circuitry is on the right, and control logic is in the control block. The critical path determining the maximum clock frequency is located in the main NAF circuitry. The circuitry operates in four modes, M 2 f0; 1; 2; 3g. All signals are represented in two's complement.
The T-to-I Converter
In this section, we describe the new T-to-I converter which finds an integer equivalent k for a -adic expansion, P 'À1 i¼0 i i where i 2 f0; AE1g. The -adic expansion is input in serial starting from 0 and the result, k, is output in parallel. The converter implements Fig. 6 with ! ¼ 1.
The converter is depicted in Fig. 8 . It consists of two adder/subtractors, two D þ 2-bit registers for storing their results, of which the other supports shifts to the left, a blog 2 sc þ 1-bit shift register for storing the constant s, control logic, comparators, and multiplexers.
The conversion is, again, performed in four modes, denoted by M 2 f0; 1; 2; 3g, and the control logic, implemented as an FSM, controls the modes which operate as follows:
0. This mode initializes the circuitry and it is used at the beginning of a conversion. The registers d 0 and d 1 are cleared and the constant s is assigned to the register s. If not, r is either added or subtracted. The register s is shifted to the right so that its next bit is processed during the following Mode 2. If the entire s has been processed, the register d 0 holds k 2 ½0; rÞ ending the conversion. Building up d 0 and d 1 from the -adic expansion, i.e., Mode 1, takes ' clock cycles (the length of the -adic expansion), and the computation of the multiplication takes two clock cycles for each bit of s. Thus, the latency including interfacing delays is given by
assuming ' ¼ m À 1. The length of s is blog 2 sc þ 1 ¼ 159 for NIST K-163 and blog 2 sc þ 1 ¼ 229 for NIST K-233. Hence, the latencies are 481 and 691 clock cycles for NIST K-163 and K-233, respectively. Notice that L T-to-I is constant because it is possible to use a fixed ', contrary to L I-to-T discussed in Section 4.1.
The architecture shown in Fig. 8 can be applied to both ¼ AE1, but one of the multiplexers can be removed when ¼ 1 by rearranging the operands of the adder on the right. If ¼ 1, (17) is an addition and the order of the operands can be reversed, i.e.,
In that case, the operand A is always d 1 which removes the multiplexer.
Contrary to the I-to-T converter, the T-to-I converter was designed primarily for a fixed curve. The converter can be modified to support arbitrary curves, but the modifications are costly. The main problem is that s and r need to be available, and computing them on the fly would require considerable amounts of area and time. However, for a small set of curves, e.g., the NIST curves, they can be precomputed and, hence, support for a few curves could be realized with minor additional costs including interface logic and a register for r. Supporting both ¼ AE1 requires only minor modifications to the control logic.
The I-to-DB Converter
In this section, we show how our I-to-T converter improves the I-to-DB converter described in [15] , [16] . The modified converter operates identically with the original converter, i.e., it takes k in parallel and outputs the resulting doublebase expansion, P i;j i;j i ð À 1Þ j , in serial one term at a time. The latency of the converter also remains the same.
The modifications are applied only to the part that finds the binary -adic expansion, (4), while the rest of the converter remains unchanged. Fig. 9 shows the architecture of the I-to-DB converter. This architecture does not differ from the one described in [15] , [16] , but instead of using the algorithms from [7] for obtaining the -adic expansion, we use the algorithms of Figs. 1 and 5 and the architecture of Section 4.1.
The converter operates as follows: The integer k is first converted into a binary -adic expansion. The resulting expansion is fed into a 10-bit shift register. Every 10th clock cycle the contents of the shift register are used as an input to the LUT which includes optimal double-base expansions for each possible 10-bit input block. Finally, the expansion from the LUT is put on place by multiplying it with a power of according to the sequence number of the input block. Because the blocks are converted with the LUT simultaneously while the I-to-T converter computes and outputs the -adic expansion, the requirement for storage is very small (only the shift register) and the latency overhead is only some clock cycles. See [16] for a more detailed description.
Because the LUT requires a binary -adic expansion, i.e., i 2 f0; 1g, the I-to-T converter presented in Section 4.1 is modified to output binary -adic expansions. These modifications are applied to the main NAF conversion circuitry, and they simplify the circuitry even further. The result bit, i , is simply the lsb of c 0 and, hence, the AND and XOR gates are removed; see Fig. 7 .
Discussion

Compact Implementations
The converters presented above prioritize speed over compactness. In certain applications, however, small area requirements are more important. Our algorithms are feasible also in that case, whereas other existing algorithms are most probably not because they require complicated operations.
Our algorithms can be implemented with an arithmetic logic unit (ALU) supporting additions, subtractions, shifts, and comparisons of signed integers. Notice that the limit D ! m is valid only for the architectures proposed in Sections 4.1-4.3 and, thus, a data path of D < m bits can be used in an ALU, if longer latencies are tolerated. Implementing such an ALU would require only miniscule resources, at least compared to the processor computing scalar multiplications.
Supposedly, the largest problem for compact implementations is the relatively large number of registers, e.g., $4m for Figs. 1 and 5 or $3m for Fig. 6 . The register count of Fig. 5 can be reduced to $3m by merging the equations updating a 0 and a 1 into the equations updating b 0 and b 1 .
Another option is to include conversion logic into the processor computing scalar multiplications. The benefit would be that registers could be shared between conversions and scalar multiplications solving the problem of large number of registers. However, logic required in computations cannot be shared efficiently because conversions require large integer operations whereas scalar multiplications on Koblitz curves utilize arithmetic in IF 2 m . Besides, the possibility of computing conversions and scalar multiplications in a pipelined fashion would be lost.
Implementing compact designs is left for future work.
Width w Expansions
Although the converters were described merely for signedbinary -adic expansions, i.e., i 2 f0; AE1g, generalization to larger width w is trivial. It can be done with simple string replacements either after conversions for the I-to-T converter or before conversions for the T-to-I converter. For instance, a width 2 NAF string, 10 10 10 10 100 1, would be replaced by a width 4 NAF string, 3000 500007, or vice versa. This approach has been demonstrated at least in [11] where width 4 NAF was generated from a width 2 NAF with a simple string replacement circuitry. The area overhead of such a circuitry is very small. Hence, width w NAF offers significant speedups with an almost negligible area increase assuming that the processor computing scalar multiplications contains enough storage space for precomputed points. Furthermore, precomputations can be performed simultaneously with conversions because they do not depend on k which decreases their overhead.
Side-Channel Analysis and Countermeasures
SCA tries to reveal secret information, such as the scalar k, by observing the physical implementation of a cryptosystem. It is widely agreed that SCA poses a serious threat to many cryptographic devices. Although the most serious vulnerabilities to SCA are most likely located in the processor computing scalar multiplications, a converter may also leak information about a secret scalar. Thus, resistance against SCA requires attention. The purpose of the following discussion is not to describe specific countermeasures against SCA but rather to point out a few possible vulnerabilities and act as a starting point for further studies.
We focus this discussion to two SCA attacks; namely, timing analysis (TA) [30] and simple power analysis (SPA) [31] . In general, differential power analysis (DPA) [31] is more difficult to thwart because it utilizes statistical methods to analyze several power traces. However, DPA is not a threat in this case because each conversion needs to be performed only once. Even if the same scalar is used several times, the conversion needs to be computed only once if the resulting expansion is stored instead of the original one, which nullifies DPA.
The I-to-T converter presented in Section 4.1 is exposed to TA, because the variations in latency leak the length of the expansion, '. However, because no information about the values of i is leaked by the computation time, the effect on security is very small. Nevertheless, the leakage can be prevented by adding dummy operations in the end of computation that equalize the latency of each conversion to L I-to-T;max . The fact that the b 0 and b 1 registers are updated only when u ¼ AE1 causes a vulnerability to SPA in the standard version of the I-to-T converter. However, this Fig. 9 . The architecture of the I-to-DB converter. We present improvements to the binary I-to-T converter; the rest of the circuitry remains unchanged from [16] . The adder is used for multiplying the output of the LUT with a power of . vulnerability does not leak any direct information about the resulting NAF because the registers are used only in Mode 1 and their values are fixed to zero during the derivation of the actual expansion, i.e., during Mode 3. Nevertheless, the leakage could be reduced by using dummy registers which are updated when u ¼ 0.
Because the latency of the T-to-I converter, discussed in Section 4.2, is a constant, the converter is inherently resistant against TA. However, SPA causes a more serious risk in this case, and it would most likely easily expose some of the first bits of the -adic expansion. For example, if 0 ¼ 1, d 0 changes from 0 to 1 (one bit flips) after the first cycle but, if 0 ¼ À1, d 0 changes from 0 to À1 (all D þ 2 bits flip). This problem can be prevented with countermeasures at the cell level, e.g., with masked dual-rail precharge logic style [32] ; see, e.g., [33] for a review on countermeasures against SPA and DPA. As a downside, such countermeasures increase both area and overall power consumption. The I-to-DB converter, discussed in Section 4.3, shares many of the side-channel characteristics of the I-to-T converter which is used as a subcomponent. It seems unlikely that SPA could recover information about the expansion from the power trace of the LUT but, nonetheless, countermeasures at the cell level could be used to prevent such problems. Hence, we omit further analysis.
RESULTS
We begin this section with experimental software results that measure the practical effect of lazy reduction, given in Section 5.1. In Section 5.2, we continue with the more practical results of the previously described hardware implementations.
Software Experiments
A software implementation was done using the GNU MP multiprecision library version 4.2.1, in particular the C++ wrapper. The compiler used was GCC 4.1.2 at optimization level two (-O2) on a 1.86 GHz Intel Core 2 6320 with 2 GB RAM running Debian Linux 4.0.
For this experiment, we compared the lazy reduction method (resulting in a value we denote 0 ) by exact reduction (a value ) modulo ¼ m À 1. We measured the average weight, average length, maximum length, and timing over 1M iterations. The value for Prð 0 ¼ Þ is also given. From Table 2 , we can see that the output from lazy reduction has very similar properties to that of exact reduction. Indeed, 0 ¼ holds more often than not. The occasions where 0 6 ¼ have little effect; the maximum length of the NAF is increased by only one bit. Because of the slightly larger average length, this leads to a slightly larger average weight; the effect is negligible. For example, for K-163 we would expect, roughly, the cost of one extra point addition every eight scalar multiplications. Unfortunately, the timings from Table 2 are rather disappointing; lazy reduction makes use of a large number of shifts. While these are virtually free in hardware, in software this requires shifting and propagation across multiple computer words. We note that no attempt was made to optimize the implementation of either method based on a specific curve. The timings are provided only for an informal comparison and not for any benchmarking purpose. To our knowledge, the only timing of such a similar operation appearing in the literature is given in [34] , a particularly efficient implementation of ECC for standardized curves where partial modular reduction for K-163 is performed in 50 s on a Pentium II 400 MHz, accounting for roughly 3.5 percent of the total time for reduction, NAF generation, and scalar multiplication.
On a positive note, we can see that, in the worst case, we can expect lazy reduction to be a small factor slower than exact reduction. Again, in contrast to exact reduction or partial reduction, lazy reduction enjoys the benefit of being deployable on a wide variety of platforms as the only operations involved are simple shifts and additions.
Hardware Implementation Results
The architectures described in Section 4 were captured in VHDL 1 and synthesized with Altera Quartus II 7.2 SP1 design software for Altera Stratix II FPGAs [35] . A Stratix II EP2S60F1020C4, or S60C4 for short, was used for all architectures, except for the I-to-DB converter discussed in Section 4.3 which was implemented on a Stratix II EP2S180F1020C3, or S180C3 for short. These selections ensure fair comparison in Section 5.3 because comparable designs in the literature use these devices. Table 3 lists implementation results obtained from Quartus II. They include I-to-T, T-to-I, and I-to-DB converters optimized for specific curves, NIST K-163 and K-233 [4] . Table 3 also includes an I-to-T converter that can be used for an arbitrary Koblitz curve, denoted by Arbit., as long as m D ¼ 256.
The basic component of Stratix II FPGAs is called adaptive logic module (ALM) which decomposes into two adaptive lookup tables (ALUTs), two registers, and some carry and control logic [35] . Stratix II devices also include embedded memory blocks with different sizes [35] , of which we used M4Ks in our I-to-DB converter. Table 3 lists area requirements as ALMs, but also as ALUTs and registers, in order to provide seamless comparison to existing works in Section 5.3. Table 4 presents a comparison to other implementations available in the literature. To our knowledge, Table 4 includes all converters that have been published thus far, with the exception of the converter given in [15] which is exactly the same as the one in [16] but on a Xilinx Virtex-II FPGA. Because all converters listed in Table 4 are implemented on Stratix II FPGAs, comparison is straightforward. The only exception is that the computation times of the I-to-DB converters are not comparable with other implementations because of the faster and larger device.
Comparisons
Our I-to-T converters are both smaller and faster than the converters presented in [13] which use reduction by recoding discussed in Section 2.3.1. The speed difference is about one-third. The area requirements (ALUTs and registers) are reduced by 19-35 percent. The largest area benefits (À35 percent in ALUTs and À34 percent in registers) are achieved with the converter supporting arbitrary curves. The reasons for this originate from the fact that arbitrary curves ( ¼ AE1) can be supported by using two adder/ subtractors instead of an adder and a subtractor, whereas considerably more significant changes need to be applied to the converter presented in [13] including, e.g., doubling the size of an LUT used in reductions.
Compared to the converters presented in [14] , our T-to-I converters achieve similar conversion times with area requirements which are 20-25 percent smaller. The reason for this is that our converter does not require circuitry for computing the U sequence that was required in [14] whereas otherwise the layout of the circuitry is very similar.
Inclusion of our I-to-T converter into the I-to-DB converter presented in [16] results in major reductions in area, while the latency remains the same. Our converter also has a slightly higher maximum clock frequency resulting in a small improvement in overall conversion time. The logic requirements, i.e., the number of ALMs, are reduced by 62 percent. Additionally, our converter does not require any hardwired multipliers (DSPs) used in [16] and, hence, the difference would be even larger by taking them into account. Our I-to-DB converter receives a faster conversion time than the I-to-T converter, despite the fact that the LUT increases its latency slightly compared to the I-to-T converter. This difference is explained by the higher maximum clock frequency (see Table 3 ) obtained by using a faster FPGA.
The converter of [16] implements the algorithms presented by Solinas in [7] and almost all ALMs (and all DSPs) are used for computing those algorithms because the conversion to the double-base representation is performed with an LUT implemented in embedded memory. Hence, the reductions in area, witnessed by using our converter, reflect the benefits of our new algorithms compared to the algorithms given in [7] . The fact that our algorithms do not require large multipliers has cut down area requirements to less than half without increasing conversion time.
CONCLUSIONS
We discussed efficient computation of conversions related to scalar multiplication on Koblitz curves. We presented new algorithms for integer to NAF and NAF to integer conversions utilizing only simple operations, such as additions and shifts. Our algorithms solve the problems faced when existing algorithms were implemented in hardware and on processors lacking support for multiplications, e.g., certain microcontrollers. These problems were caused by requirements for several involved operations, e.g., multiprecision multiplications or divisions and computations with large rationals. The fact that our algorithms do not require such operations makes them suitable for hardware and platforms with limited instruction sets and storage.
We demonstrated the practicability of the new algorithms by implementing converters in Stratix II FPGAs. Our converters are faster than the existing ones. Furthermore, their conversion times are faster than any reported scalar multiplication times and, hence, they solve the problems noted in [11] , [12] , where the conversions had become the bottleneck. Not only are our converters faster, but they also consume significantly fewer resources than other reported converters. In fact, more than half of the area was saved by using our algorithms instead of the ones proposed in [7] without reductions in speed.
Despite the efficiency of our converters, we will continue to develop them further. In particular, minimizing the area of the converters will be studied in the future in order to increase the feasibility of Koblitz curves on highly constrained 
