Abstract-Koblitz curves are a class of computationally efficient elliptic curves where scalar multiplications can be accelerated using τ NAF representations of scalars. However, conversion from an integer scalar to a short τ NAF is a costly operation. In this paper, we improve the recently proposed scalar conversion scheme based on division by τ 2 . We apply two levels of optimizations in the scalar conversion architecture. First, we reduce the number of long integer subtractions during the scalar conversion. This optimization reduces the computation cost and also simplifies the critical paths present in the conversion architecture. Then we implement pipelines in the architecture. The pipeline splitting increases the operating frequency without increasing the number of cycles. We have provided detailed experimental results to support our claims made in this paper.
I. INTRODUCTION

E
LLIPTIC curve cryptography (ECC) is the modern standard for public key cryptography thanks to its higher bit security and implementation friendliness on embedded platforms. Elliptic curve scalar multiplication (ECSM) is the soul of any ECC processor. In ECSM, a point P on an elliptic curve is multiplied by a large scalar k to get k P. The standard way to perform ECSM is to use the double and add algorithm [1] - [3] where point doublings are performed for every key bit and point additions are performed for every nonzero key bit. As the number of point additions is determined by the Hamming weight of the scalar, computation time of an ECSM depends on the scalar k. Numerous works are present in the literature on improving the computation time by applying various optimizations such as efficient representation of scalars, efficient group operations, use of projective coordinate systems, and use of computation friendly class of elliptic curves. Architectural optimizations depending on the platforms add further acceleration [4] - [7] . The authors are with the Department of Electrical Engineering (ESAT/COSIC), KU Leuven and iMinds, Leuven 3001, Belgium (e-mail: sujoyetc@gmail.com; junfeng.fan@esat.kuleuven.be; ingrid.verbauwhede@ esat.kuleuven.be).
Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TVLSI.2014. 2321282 Koblitz curves [8] are a special class of elliptic curves, where the Frobenius endomorphism can be used to represent an integer scalar in a τ adic form. With this special representation of a scalar, costly point doublings are replaced by cheaper Frobenius operations. Solinas extended the work and proposed τ adic nonadjacent-form known as τ NAF [9] . However, the length of both τ adic and τ NAF representations of a scalar are almost twice the length of the integer scalar. Thus, increase in the length (thus increase in the number of point additions) becomes a negating factor in achieving the acceleration offered by the Frobenius endomorphism. Length reduction schemes were first introduced by Meier and Staffelback [10] and were later improved by Solinas [9] . The length reduction algorithms proposed by Solinas are efficient in software, but are not amiable to hardware implementations because of the presence of multiprecision integer multiplications and divisions.
On the hardware side, very few research papers exist in the literature on designing efficient scalar conversion algorithms. In [11] and [12] , the conversions are performed in a software processor, while the ECSMs are performed on a dedicated hardware. Such an approach is well suited when the scalar multiplier architecture is slow. However, the present speed records for ECSMs in hardware have made such an approach a bottleneck [6] , [13] . The first hardware implementation of scalar conversion was reported in [14] . Later, [15] and [16] keep the conversion units in hardware along with the scalar multipliers. Still the hardware versions of the converters are slow and have a large area.
The first hardware implementation friendly scalar conversion scheme was reported by Brumley and Järvinen [17] . Their conversion algorithm repeatedly divides the scalar by τ to generate a length reduced scalar. Because of its sequential nature, the authors call the algorithm lazy reduction. The algorithm is very simple as it requires only integer addition or subtraction and shifting. Adikari et al. [18] proposed an improvement over [17] . Since division by τ 2 is cheap in hardware, their double-lazy reduction scheme reduces the computation time to nearly half by using division by τ 2 instead of τ .
In this paper, we propose acceleration techniques for the double-lazy reduction algorithm [18] . We observe that several additions and subtractions can be eliminated during the scalar reduction and the τ NAF generation. Subtraction or addition of the nonzero remainders are replaced by alteration of the low-order bits in the operands, use of one's complement of the operands and by considering borrow or carry inputs in subtracter and adder circuits. We eliminate unnecessary subtractions from zero using iterative property of the conversion steps. Such optimizations reduce the number of integer adder and subtracter circuits from the critical paths of the conversion architecture without affecting the cycle requirement. With the proposed acceleration techniques, we achieve an improvement in the operating frequency by at least 12.5% and 17.9% compared with [18] for the fields F 2 233 and F 2 283 , respectively.
Next we perform efficient pipelining in the proposed conversion architecture. Using the iterative property of the conversion steps, we pipeline the conversion architecture in such a way that the pipelined stages are always used. Because of its bubble-free nature, almost no cycles are wasted to satisfy the data dependencies between the pipeline stages. Our twostage pipelined conversion architecture achieves acceleration by 35.5% and 40% compared with [18] for the fields F 2 233 and F 2 283 , respectively.
The organization of this paper is as follows. Section II has a brief mathematical background on the Koblitz curves and scalar conversion techniques. In Section III, computational optimizations for the double-lazy reduction algorithm are discussed. Section IV shows optimizations in the τ NAF generation steps. A hardware architecture for the scalar conversion is designed in Section V and an efficient pipeline strategy is presented in Section VI. Experimental results are presented in Section VII. The final section draws the conclusions.
II. PRELIMINARIES
The Koblitz curves over F 2 m have the following form:
We denote the Koblitz curve group by E a (F 2 m ). In E a (F 2 m ), the Frobenius mapping can be applied to reduce the complexity of ECSM. The Frobenius mapping τ :
is defined as follows:
Application of τ on any point P squares the coordinates and gives another point Q on the curve. As squaring in F 2 m is cheap [1] , [19] , computing the Frobenius map of a point is also easy. The map operator τ satisfies the relation τ 2 + 2 = μτ , where
The ring of polynomials in τ with integer coefficients is denoted by
with u i ∈ {0, 1}, and any base point P on a Koblitz curve, we see the following relation:
In the preceding equation, the base point P is multiplied by a scalar which is an element of Z[τ ]. During ECSM, only point addition and Frobenius operations are performed. Solinas [9] proposed algorithms to convert integer scalars into polynomials in τ with coefficients u i ∈ {0, 1} (τ adic form). He also showed that number of point additions can be reduced using nonadjacent form which is known as a τ NAF of a scalar. Calculation of the τ NAF from a scalar requires iterative divisions by τ . Here are two theorems from [9] related to the division of any element
Theorem 1: α is divisible by τ when d 0 is even. The result of the division when stored in
A. Length of τ NAF and Reduction Schemes
The length (l) of a τ NAF for an integer scalar k is approximately 2 log 2 k, which is almost double the length of the scalar. Length reduction schemes were initially proposed by Solinas [9] . An integer scalar k is first reduced to k( mod δ), where δ = τ m − 1/τ − 1 and then a τ NAF is generated from the reduced scalar. The maximum length of the generated τ NAF is m + a in F 2 m . The scalar reduction scheme involves multiprecision integer division. Solinas proposed another reduction scheme [9] where the scalar is partially reduced to avoid integer division at the cost of integer multiplication. Because multiprecision division and multiplication operations are complicated, hardware implementations are inefficient.
Brumley and Järvinen [17] presented the lazy reduction algorithm, where the scalar k is repeatedly divided by τ for m number of times to get the following relation:
The authors use γ as the reduced scalar and show that the length of the τ NAF generated from γ is at most m +4 in F 2 m . Because division by τ is a simple operation (Theorem 1), the algorithm is suitable for hardware. Adikari et al. [18] proposed an improvement of the lazy reduction which they call the double-lazy reduction. In the double-lazy reduction, the scalar is divided by τ 2 for (m−1)/2 number of times. Finally one division by τ is performed to obtain the reduced scalar. The cycle count for the scalar reduction reduces to nearly half compared with the lazy reduction. The computational steps in the double-lazy reduction are shown in Algorithm 1. We advise the readers of this paper to study the double-lazy reduction algorithm from [18] , as this will be helpful to understand the acceleration techniques proposed in our paper.
III. IMPROVED REDUCTION ALGORITHM
In this section, we propose optimization steps to reduce the number of long integer additions and subtractions in Algorithm 1. Throughout this discussion, we consider μ = −1. Similar optimizations for μ = 1 are shown in the Appendix.
A. Elimination of Long Subtractions for Remainders
In line 6 of Algorithm 1, remainders u 0 and u 1 ∈ {0, 1} are subtracted from d 0 and d 1 . We observe that the subtractions are easy in some cases. For example, when d 0 ≡ 1( mod 4) and 2d 1 ≡ 0( mod 4) (i.e., u 0 = 1 and u 1 = 0), the subtraction of u 0 from d 0 is equivalent to changing the least significant bit of Algorithm 1 Double-Lazy Reduction d 0 from 1 to 0. Hence, in this case the long subtraction can be replaced by a bit alteration. However, when carry propagations are involved with long subtractions, alteration of few specific bits do not work as a replacement. For example, when d 0 ≡ 3( mod 4) and 2d 1 ≡ 0 ( mod 4) (i.e., when u 0 = 1 and u 1 = 1), a long subtraction appears. Use of signed remainders u 0 and u 1 ∈ {0, ±1} helps to some extent in eliminating the long subtractions of nonzero remainders for such cases. Table I shows how the signed remainders are generated during the reduction steps depending on the low bits of d 0 and 2d 1 . In this table, the % operator represents modular reduction operation. Except Case 4, subtractions of the u 0 and u 1 from d 0 and d 1 involve no carry propagation and thus can be performed by altering the low-order bits of d 0 and d 1 .
For Case 4, if we perform the subtraction of u 0 = −1 in line 7 of Algorithm 1 instead of line 6 (i.e., we put d 0 + 1 in place of d 0 ), then we have the following observation:
This is equivalent to taking carry or borrow inputs in the adder or subtracter circuits as shown as follows:
From the foregoing observations, we conclude that the long integer subtractions of the nonzero remainders in Algorithm 1 can be eliminated using cheaper alternatives.
B. Elimination of Subtractions From Zero
In line 7 of Algorithm 1, a subtraction from zero is required for d 0 after computing 2d 1 
We eliminate the subtraction from zero using the following scheme: Instead of (2), we compute
The results from (2) and (3) have opposite signs, but same magnitudes. So, when (2) and (3) The same trick is applied to eliminate the subtractions from zero during the computation of (a 0 , a 1 ) in line 12 of Algorithm 1. Instead of computing
we compute
After the completion of the for-loop, one subtraction from zero is required to correct the signs of (a 0 , a 1 ) when (m − 1)/2 is odd. This subtraction can be eliminated if we compute
Since the remainders are generated by observing the loworder bits of d 0 and d 1 (Table I) , use of (d 0 , d 1 ) which has wrong sign should be justified for the correctness of the reduction algorithm. Let after any odd number of iterations of the for-loop in Algorithm 1, we have the pairs (a i,0 , a i,1 ),  (b i,0 , b i,1 ) and (d i,0 , d i,1 ). Since wrong sign is assigned to  (d i,0 , d i,1 ) after any odd number of iterations, we have the following relation:
In 
After plugging in (7) in (6), we get the following equation:
In the above equation, the actual remainders are (−u 0 , −u 1 ). Interestingly, after any odd number of iterations, wrong sign is also assigned to (a 0 , a 1 ). Since both (u 0 , u 1 ) and (a 0 , a 1 ) are of same sign in any iteration, computation of (b 0 , b 1 ) is insensitive to the wrong sign of operands. Thus, (b i+1,0 , b i+1,1 ) is computed as follows:
This justifies the correctness of the reduction during assignment of wrong sign to the variables
Throughout this section we have discussed how the number of long integer additions/subtractions can be reduced during the scalar reduction steps. Our proposed improvements over the double-lazy reduction are described in Algorithm 2. We see that only one addition or subtraction is performed during the computations of d 0 , d 1 , a 0 , a 1 , and b 0 in every iterations. For b 1 , at most two additions/subtractions are performed per iteration. Thus, if implemented in hardware, critical path contains only one adder/subtracter circuit of width m + 1 bit. In the previous reduction architectures [17] , [18] , critical paths are through two cascaded adder and subtracter circuits of data width m + 1. Because integer adder and subtracter circuits have large delay, removal of such circuits from critical paths help improves frequency. In the next section, we further look into τ NAF generation algorithm and discuss how long subtractions of nonzero remainders can be eliminated during the τ NAF generation steps.
IV. IMPROVED DOUBLE-DIGIT τ NAF GENERATION
In [18] , two consecutive τ NAF digits are generated in a single step from the reduced scalar d 0 + τ d 1 by performing divisions by τ 2 . The authors call the NAF as double τ NAF. Table II describes the generation of the consecutive τ NAF digits r 0 and r 1 from d 0 and d 1 . As described in Section III, we eliminate subtractions of nonzero remainders from d 0 and d 1 during the τ NAF generation.
From Table II Table II can be handled in the same way we did for Case 4 in Table I (Section III-A).
In Case 3.A, the subtraction of r 1 = 1 from d 1 involves borrow propagation and thus may affect all the bits of d 1 .
If we incorporate this subtraction in the next step where we (2), we have the following observation: (2), we have the following observation: 
V. HARDWARE ARCHITECTURE
The hardware architecture for performing scalar conversion (for μ = −1) using the proposed acceleration techniques is shown in Fig. 1 . Similar to [17] and [18] , our conversion architecture is capable of performing both scalar reduction and double-digit τ NAF generation.
In any iteration of Algorithms 2 and 3, the variables a 0 , a 1 A controller is used to generate the control signals for the multiplexers and the adder/subtracter circuits. The control block also generates the carry and borrow inputs for the adder/subtracter circuits A1 and A2. In the figure, T1 and T2 are special categories of multiplexers which produce x,x and y from the inputs x and y as per equation ((x ⊕ s 0 ) ·s 1 )|(y · s 1 ) when the selection inputs (s 1 , s 0 ) are 00, 01, and 10, respectively. For LUT base FPGAs, this special construction for T1 and T2 achieves better LUT utilization [20] and thus saves area. The counter circuit is used to calculate the number of τ NAF digits generated. Completion of the τ NAF generation is indicated when m + 4 number of τ NAF digits are generated.
VI. PIPELINING THE CONVERSION HARDWARE
The critical path of the conversion architecture is indicated by the dotted line in Fig. 1 . As integer adders are slow, the proposed computational optimizations are not enough to achieve high speed for the scalar conversion architecture. Use of faster adder circuits increase frequency at the cost of area. However, for long operand size, such adders are also slower compared with the binary field primitives specially when pipelines are implemented in the binary field primitives [21] - [23] . In this section, we propose a solution to this problem by implementing pipelines in the conversion architecture.
A. Pipelining Iterative Addition and Subtraction Operations
The central operations in the scalar conversion hardware are additions and subtractions. We first consider a simple example where iterative additions and subtractions are performed. Let two variables c 0 and c 1 have some initial values and are updated iteratively as per the following equation: c 1 ) ← (c 0 + c 1 , c 0 − c 1 ) .
(10) Let the data width for both c 0 and c 1 be at most m. The adder and subtracter circuits are split into two equal stages (Fig. 2) of width m/2 by putting registers in the carry and borrow propagation paths. Because of the data dependency of stage 2 on stage 1, the computations in the stage 2 lag by one cycle. Timing diagram of the two stages is shown in Fig. 3 for the first five cycles. Iteration numbers are indicated by the superscripts. As per the timing diagram, first four iterations of the consecutive additions and subtractions complete after the fifth cycle. It is straightforward to understand that for I number of iterations (10), the two stage architecture takes I + 1 cycles. In comparison, a nonpipelined architecture takes I number of cycles to finish I rounds. The advantage of the pipelined architecture is in the reduction of overall delay by half (ideally) compared to the nonpipelined architecture at the cost of only two flip-flops and one cycle.
B. Pipelining the Conversion Architecture
We apply the same concept in pipelining the conversion hardware. However, data dependencies get more complicated because of the presence of shifter circuits and the different data widths of the registers present in the conversion architecture (Fig. 1) . Additionally, synchronization between the parallel data paths is essential to maintain functional correctness of the conversion hardware. Fig. 4 shows the two stage pipelined conversion architecture for μ = −1. Data paths are split in almost symmetric stages to achieve best operating frequency for the two-stage architecture. In the figure, suffix #1 and #2 indicate the parts of different components in the first and second stages of the pipelined architecture, respectively. The registers a 0 , a 1 
VII. EXPERIMENTAL RESULTS
We have evaluated the proposed acceleration techniques for the NIST recommended Koblitz curves [24] K-233 and K-283 on the Xilinx Virtex 4 FPGA xcvlx200-11ff1513. All these curves have μ = −1 and support the present security standards. Table III [18] .
In the table, T Red and T Conv represent the scalar reduction time and the scalar conversion time, respectively. The scalar conversion time is the sum of scalar reduction time and τ NAF [18] have same cycle count of m + 6 for a scalar conversion in F 2 m . The pipelined conversion architecture takes only two extra cycles and thus requires m+8 cycles in F 2 m . In [17] , the conversion architecture uses division by τ and takes 2m + 7 cycles.
Operating frequencies achieved for the conversion architectures depend on the type of integer adders and also on the optimization settings in the synthesis tools. Our implementation uses generic adders and subtracters. For fare comparison of the frequencies, we have implemented a small circuit which is same as the data path for d 0 register in [18] and uses carry propagation subtracter circuits. 1 With the same optimization parameters in the Xilinx ISE tool, we achieved frequencies 72 and 59.5 MHz for the fields F 2 233 and F 2 283 , respectively. When we consider implementation of the conversion hardware in [18] using generic adder and subtracter circuits, the frequencies will be limited by the aforementioned values because of the increased circuit complexities. Under this fare comparison scenario, our nonpipelined architectures achieve improvement in frequencies by at least 12.5% and 17.9% for K-233 and K-283, respectively. Area of the nonpipelined architectures are slightly lesser than the architectures in [18] . The proposed optimizations reduce the number of adder and subtracter circuits but increase the number of multiplexers.
Use of the pipeline strategy improves the frequency drastically. Since there are no bubbles in the pipeline data path, the cycle requirements for scalar reduction and conversion remain almost same. Our two-stage pipeline architectures achieve 35.5% and 40% reduction in the overall conversion time compared with [18] for the curves K-233 and K-283, respectively. Further, to show that the proposed pipeline strategy is not limited to only two stages, we have implemented a three-stage pipeline architecture for K-283. The three-stage architecture improves the frequency by around 14% compared with the two-stage architecture and thus reduces computation time.
An interesting observation is that the pipeline architectures have smaller area compared with the nonpipeline architectures. In one side, the pipeline strategy is very cost effective as it requires very few flip-flops. While on the other side, the ISE tool performs lesser logic replications due to shorter critical paths [25] . The combined effect reduces the overall area.
VIII. CONCLUSION
In this paper, we have proposed acceleration techniques for scalar conversions required in the Koblitz curve-based cryptoprocessors. The scalar conversion time is improved by reducing the number of costly integer additions and subtractions and by implementing pipelines in the data path. The proposed reduction in the arithmetic cost simplifies the critical paths in the conversion architecture. Further, an efficient pipeline strategy is used to drastically improve the frequency of the conversion architecture without increasing the cycle count.
In this paper, we have implemented up to three-stage pipeline architecture. We remark that more number of pipeline stages can be implemented in the conversion architecture to achieve faster computation time. However, the percentage of reduction in the computation time will lessen with the increase in the number of pipeline stages. The actual number of pipeline stages in the conversion architecture could be fixed to match the speed of the binary field primitives used in the Koblitz curve cryptoprocessor.
APPENDIX
Here we present computational optimizations for the curve parameter μ = 1. We compute (d 0 , d 1 ) as per (11) to avoid long subtractions from zero (Section III-B)
During the iterative divisions by τ 2 , wrong sign is assigned to either d 0 or d 1 in any iteration. Assignment of the wrong sign alternates in every consecutive iteration. We find (11) is same as (3), only with the difference in the relative positions of d 0 and d 1 in the left-hand-side. During the reduction of the scalar, the nonzero remainders are generated as per Table I for both μ = 1 and μ = −1. Thus, the computational optimizations we followed in Section III for μ = −1, are also applicable for μ = 1. Double digit τ NAF is generated as per Table IV for μ = 1. Comparing with Table II , we see only the Cases 3.A-3.D are different in Table IV . For the cases which are same in both the tables, we apply the same computational optimizations discussed in Section IV for μ = −1. Subtractions of remainders from d 1 are performed by altering low-order bits of d 1 for the Cases 3.A, 3.C, and 3.D. However, the subtraction of remainder in Case 3.B involves carry propagation. We eliminate this long subtraction by incorporating it in the next step where we perform division by τ 2 . This is shown in (12) . Subtraction of 2 or addition of 1 with d 0 is easy as it requires only alteration of low-order bits of d 0 . We also consider a borrow input to the adder/subtracter circuit in the critical path of d 1
